# Scikit-Learn LogisticRegression
Using IRIS_VIEW from DWC. This view has 150 records

## Install fedml_aws library

In [1]:
pip install fedml_aws-1.0.0-py3-none-any.whl --force-reinstall

Processing ./fedml_aws-1.0.0-py3-none-any.whl
Collecting hdbcli
  Using cached hdbcli-2.10.13-cp34-abi3-manylinux1_x86_64.whl (11.7 MB)
Installing collected packages: hdbcli, fedml-aws
  Attempting uninstall: hdbcli
    Found existing installation: hdbcli 2.10.13
    Uninstalling hdbcli-2.10.13:
      Successfully uninstalled hdbcli-2.10.13
  Attempting uninstall: fedml-aws
    Found existing installation: fedml-aws 1.0.0
    Uninstalling fedml-aws-1.0.0:
      Successfully uninstalled fedml-aws-1.0.0
Successfully installed fedml-aws-1.0.0 hdbcli-2.10.13
Note: you may need to restart the kernel to use updated packages.


## Import Libraries

In [2]:
from fedml_aws import DwcSagemaker
from fedml_aws import DbConnection
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # plotting
import seaborn as sn

## Create DwcSagemaker instance to access libraries functions

In [3]:
dwcs = DwcSagemaker(prefix='scikit-learn/log-reg-iris', bucket_name='fedml-bucket')

## Create DbConnection instance to get data from DWC

Before running the following cell, you should have a config.json file in the same directory as this notebook with the specified values to allow you to access to DWC.

You should also have the follow view `IRIS_VIEW` created in your DWC. To gather this data, please refer to https://www.kaggle.com/uciml/iris

In [4]:
%%time
db = DbConnection()
train_data = db.get_data_with_headers(table_name='IRIS_VIEW', size=1)
data = pd.DataFrame(train_data[0], columns=train_data[1])
data

CPU times: user 47.1 ms, sys: 254 µs, total: 47.4 ms
Wall time: 96.2 ms


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3,5.2,2.3,Iris-virginica
146,6.3,2.5,5,1.9,Iris-virginica
147,6.5,3,5.2,2,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [5]:
data.isna().any()

sepal_length    False
sepal_width     False
petal_length    False
petal_width     False
species         False
dtype: bool

In [6]:
data.isnull().any()

sepal_length    False
sepal_width     False
petal_length    False
petal_width     False
species         False
dtype: bool

In [7]:
data.dtypes

sepal_length    object
sepal_width     object
petal_length    object
petal_width     object
species         object
dtype: object

In [8]:
from sklearn.model_selection import train_test_split
data = data.sample(frac=1).reset_index(drop=True)
sub_data = data.head(100)
X_train, X_test, y_train, y_test = train_test_split(sub_data.drop(['species'], axis=1), sub_data['species'], test_size=0.3)

In [9]:
train_data = pd.concat([X_train, y_train], axis=1)
train_data

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
63,5.1,2.5,3,1.1,Iris-versicolor
7,6,2.2,4,1,Iris-versicolor
74,5.4,3.4,1.7,0.2,Iris-setosa
30,5.8,4,1.2,0.2,Iris-setosa
77,5.6,3,4.5,1.5,Iris-versicolor
...,...,...,...,...,...
0,6.1,3,4.9,1.8,Iris-virginica
19,5.5,2.5,4,1.3,Iris-versicolor
91,5.6,2.5,3.9,1.1,Iris-versicolor
98,5.2,2.7,3.9,1.4,Iris-versicolor


In [10]:
pd.Series(train_data['species']).unique()

array(['Iris-versicolor', 'Iris-setosa', 'Iris-virginica'], dtype=object)

In [11]:
test_data = pd.concat([X_test, y_test], axis=1)
test_data

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
18,4.4,2.9,1.4,0.2,Iris-setosa
72,4.7,3.2,1.6,0.2,Iris-setosa
50,6.1,3.0,4.6,1.4,Iris-versicolor
29,5.0,3.4,1.5,0.2,Iris-setosa
78,5.8,2.7,5.1,1.9,Iris-virginica
85,5.4,3.0,4.5,1.5,Iris-versicolor
90,6.4,3.2,5.3,2.3,Iris-virginica
59,6.1,2.8,4.0,1.3,Iris-versicolor
31,5.1,3.5,1.4,0.3,Iris-setosa
24,6.7,3.0,5.0,1.7,Iris-versicolor


## Train SciKit Model

`train_data` is the data you want to train your model with.

In order to deploy a model to AWS using the Scikit-learn Sagemaker SDK, you must have a script that tells Sagemaker how to train and deploy the model. The path to the script is passed to the `train_sklearn_model` function in the `train_script` parameter.

`instance_type` specifies how much computing power we want AWS to allocate for our services.

In [12]:
clf = dwcs.train_sklearn_model(train_data=train_data,
                               test_data=test_data,
                               content_type='text/csv',
                               train_script='iris_trainV2.py',
                               instance_count=1,
                               instance_type='ml.c4.xlarge',
                               wait=True,
                               base_job_name='scikit-learn-logistic-regression-iris'
                              )

Training data uploaded
Test data uploaded
2021-10-06 23:22:00 Starting - Starting the training job...
2021-10-06 23:22:02 Starting - Launching requested ML instancesProfilerReport-1633562520: InProgress
......
2021-10-06 23:23:21 Starting - Preparing the instances for training.........
2021-10-06 23:25:01 Downloading - Downloading input data...
2021-10-06 23:25:28 Training - Downloading the training image...
2021-10-06 23:26:02 Training - Training image download completed. Training in progress.[34m2021-10-06 23:25:51,424 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2021-10-06 23:25:51,427 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-10-06 23:25:51,438 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2021-10-06 23:25:59,116 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-10-06 23:25:59,132 sagemaker-t