## Scikit-Learn Hyperparameter Tuning
### Using local data (data was created from preprocessor script)

## Install fedml aws library

In [1]:
pip install fedml-aws --force-reinstall

Note: you may need to restart the kernel to use updated packages.


## Import Libraries 

In [2]:
from fedml_aws import DwcSagemaker
import numpy as np
import pandas as pd
import json

## Create DwcSagemaker instance to access libraries functions

In [3]:
dwcs = DwcSagemaker(prefix='<prefix>', bucket_name='<bucket_name>')

Bucket created in  us-east-1


## In this example, we are using local data for training.

Before running this cell, please make sure to have run the Data Preprocessor model example.
That model will download an output directory containing the preprocessed_data.csv and labels.csv files used for this model.

Make sure to specify the correct output directory in the next cell before running it.

In [4]:
%%time
features = pd.read_csv('output-2021-09-17-18-50-56/preprocessed_data.csv')
labels = pd.read_csv('output-2021-09-17-18-50-56/labels.csv')

CPU times: user 6.46 ms, sys: 3.16 ms, total: 9.63 ms
Wall time: 10.4 ms


In [5]:
features = features.drop(['Unnamed: 0'], axis=1)
features

Unnamed: 0,num__PassengerId,num__Pclass,num__Age,num__SibSp,num__Parch,num__Fare,onehotencoder__x0_female,onehotencoder__x0_male,onehotencoder__x1_C,onehotencoder__x1_Q,onehotencoder__x1_S
0,1.0,3.0,22.0,1.0,0.0,7.0,0.0,1.0,0.0,0.0,1.0
1,2.0,1.0,38.0,1.0,0.0,71.0,1.0,0.0,1.0,0.0,0.0
2,3.0,3.0,26.0,0.0,0.0,7.0,1.0,0.0,0.0,0.0,1.0
3,4.0,1.0,35.0,1.0,0.0,53.0,1.0,0.0,0.0,0.0,1.0
4,5.0,3.0,35.0,0.0,0.0,8.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
886,851.0,3.0,4.0,4.0,2.0,31.0,0.0,1.0,0.0,0.0,1.0
887,852.0,3.0,74.0,0.0,0.0,7.0,0.0,1.0,0.0,0.0,1.0
888,853.0,3.0,9.0,1.0,1.0,15.0,1.0,0.0,1.0,0.0,0.0
889,854.0,1.0,16.0,0.0,1.0,39.0,1.0,0.0,0.0,0.0,1.0


In [6]:
labels

Unnamed: 0,Survived
0,0
1,1
2,1
3,1
4,0
...,...
886,0
887,0
888,0
889,1


In [7]:
data = pd.concat([features, labels], axis=1)

## Train SciKit Model¶
`train_data` is the data you want to train your model with. 

In order to deploy a model to AWS using the Scikit-learn Sagemaker SDK, you must have a script that tells Sagemaker how to train and deploy the model. The path to the script is passed to the `train_sklearn_model` function in the `train_script` parameter.

`instance_type` specifies how much computing power we want AWS to allocate for our services.

In [8]:
hyperparameters = {
    'max_depth': [2, 4, 6],
    'n_estimators': [100, 250, 300],
    'max_features': [4, 5, 6, 'sqrt'],
    'min_samples_leaf': [25, 30],
    'n_jobs': 24
    }
clf = dwcs.train_sklearn_model(data,
                               train_script='tuning_script.py',
                               instance_type='ml.c4.xlarge',
                              wait=True,
                               hyperparameters = hyperparameters,
                              download_output=False,
                              logs='All')

Training data uploaded
2022-01-26 23:03:27 Starting - Starting the training job...
2022-01-26 23:03:55 Starting - Launching requested ML instancesProfilerReport-1643238206: InProgress
......
2022-01-26 23:04:55 Starting - Preparing the instances for training...............
2022-01-26 23:07:20 Downloading - Downloading input data...
2022-01-26 23:07:56 Training - Downloading the training image...
2022-01-26 23:08:16 Training - Training image download completed. Training in progress.....[34m2022-01-26 23:08:16,118 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2022-01-26 23:08:16,121 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-01-26 23:08:16,132 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2022-01-26 23:08:16,596 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-01-26 23:08:19,622 sagemaker-training-t