# Scikit-Learn Cross Val Score on Logistic Regression Model
Using IRIS_VIEW from DWC. This view has 150 records

## Install fedml aws library

In [1]:
pip install fedml-aws --force-reinstall

## Import Libraries

In [2]:
from fedml_aws import DwcSagemaker
from fedml_aws import DbConnection
import numpy as np
import pandas as pd

## Create DwcSagemaker instance to access libraries functions

In [3]:
dwcs = DwcSagemaker(prefix='<prefix>', bucket_name='<bucket_name>')

Bucket created in  us-east-1


## Create DbConnection instance to get data from DWC

Before running the following cell, you should have a config.json file in the same directory as this notebook with the specified values to allow you to access to DWC.

You should also have the follow view `IRIS_VIEW` created in your DWC. To gather this data, please refer to https://www.kaggle.com/uciml/iris

In [4]:
%%time
db = DbConnection()
train_data = db.get_data_with_headers(table_name='IRIS_VIEW', size=1)
data = pd.DataFrame(train_data[0], columns=train_data[1])
data

CPU times: user 47.1 ms, sys: 441 µs, total: 47.6 ms
Wall time: 105 ms


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3,5.2,2.3,Iris-virginica
146,6.3,2.5,5,1.9,Iris-virginica
147,6.5,3,5.2,2,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [5]:
data.isna().any()

sepal_length    False
sepal_width     False
petal_length    False
petal_width     False
species         False
dtype: bool

In [6]:
data.isnull().any()

sepal_length    False
sepal_width     False
petal_length    False
petal_width     False
species         False
dtype: bool

In [7]:
data.dtypes

sepal_length    object
sepal_width     object
petal_length    object
petal_width     object
species         object
dtype: object

In [8]:
from sklearn.model_selection import train_test_split
data = data.sample(frac=1).reset_index(drop=True)
sub_data = data.head(100)
X_train, X_test, y_train, y_test = train_test_split(sub_data.drop(['species'], axis=1), sub_data['species'], test_size=0.3)

In [9]:
train_data = pd.concat([X_train, y_train], axis=1)
train_data

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
98,5.5,2.5,4,1.3,Iris-versicolor
53,5.7,2.8,4.1,1.3,Iris-versicolor
18,6.3,3.4,5.6,2.4,Iris-virginica
67,7.7,2.8,6.7,2,Iris-virginica
9,6.7,3.3,5.7,2.1,Iris-virginica
...,...,...,...,...,...
86,4.8,3,1.4,0.1,Iris-setosa
48,5.7,2.5,5,2,Iris-virginica
16,5.6,3,4.5,1.5,Iris-versicolor
12,6.8,3.2,5.9,2.3,Iris-virginica


In [10]:
pd.Series(train_data['species']).unique()

array(['Iris-versicolor', 'Iris-virginica', 'Iris-setosa'], dtype=object)

In [11]:
test_data = pd.concat([X_test, y_test], axis=1)
test_data

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
7,5.2,4.1,1.5,0.1,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa
17,6.1,3.0,4.9,1.8,Iris-virginica
95,6.6,2.9,4.6,1.3,Iris-versicolor
28,6.5,3.0,5.5,1.8,Iris-virginica
42,6.4,2.8,5.6,2.1,Iris-virginica
1,5.8,2.7,5.1,1.9,Iris-virginica
50,6.0,2.2,4.0,1.0,Iris-versicolor
47,4.6,3.1,1.5,0.2,Iris-setosa
81,6.7,3.1,4.4,1.4,Iris-versicolor


## Train SciKit Model

`train_data` is the data you want to train your model with.

In order to deploy a model to AWS using the Scikit-learn Sagemaker SDK, you must have a script that tells Sagemaker how to train and deploy the model. The path to the script is passed to the `train_sklearn_model` function in the `train_script` parameter.

`instance_type` specifies how much computing power we want AWS to allocate for our services.

In [12]:
clf = dwcs.train_sklearn_model(train_data=train_data,
                               test_data=test_data,
                               content_type='text/csv',
                               train_script='iris_trainV3.py',
                               instance_count=1,
                               instance_type='ml.c4.xlarge',
                               wait=True,
                               base_job_name='scikit-learn-logistic-regression-crossval'
                              )

Training data uploaded
Test data uploaded
2022-01-27 00:03:27 Starting - Starting the training job...
2022-01-27 00:03:51 Starting - Launching requested ML instancesProfilerReport-1643241807: InProgress
.........
2022-01-27 00:05:06 Starting - Preparing the instances for training.........
2022-01-27 00:06:56 Downloading - Downloading input data...
2022-01-27 00:07:23 Training - Downloading the training image...
2022-01-27 00:07:55 Uploading - Uploading generated training model[34m2022-01-27 00:07:50,340 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2022-01-27 00:07:50,342 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-01-27 00:07:50,353 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2022-01-27 00:07:50,699 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-01-27 00:07:50,713 sagemaker-training-toolkit IN