## Scikit-Learn Hyperparameter Tuning
### Using local data (data was created from preprocessor script)

## Install fedml_aws library

In [1]:
pip install fedml_aws-1.0.0-py3-none-any.whl --force-reinstall

Processing ./fedml_aws-1.0.0-py3-none-any.whl
Collecting hdbcli
  Using cached hdbcli-2.10.13-cp34-abi3-manylinux1_x86_64.whl (11.7 MB)
Installing collected packages: hdbcli, fedml-aws
  Attempting uninstall: hdbcli
    Found existing installation: hdbcli 2.10.13
    Uninstalling hdbcli-2.10.13:
      Successfully uninstalled hdbcli-2.10.13
  Attempting uninstall: fedml-aws
    Found existing installation: fedml-aws 1.0.0
    Uninstalling fedml-aws-1.0.0:
      Successfully uninstalled fedml-aws-1.0.0
Successfully installed fedml-aws-1.0.0 hdbcli-2.10.13
Note: you may need to restart the kernel to use updated packages.


## Import Libraries 

In [2]:
from fedml_aws import DwcSagemaker
import numpy as np
import pandas as pd
import json

## Create DwcSagemaker instance to access libraries functions

In [3]:
dwcs = DwcSagemaker(prefix='scikit-learn/hyperparameter-tuning', 
                    bucket_name='fedml-bucket')


## In this example, we are using local data for training.

Before running this cell, please make sure to have run the Data Preprocessor model example.
That model will download an output directory containing the preprocessed_data.csv and labels.csv files used for this model.

Make sure to specify the correct output directory in the next cell before running it.

In [4]:
%%time
features = pd.read_csv('output-2021-09-17-18-50-56/preprocessed_data.csv')
labels = pd.read_csv('output-2021-09-17-18-50-56/labels.csv')

CPU times: user 5.89 ms, sys: 3.77 ms, total: 9.66 ms
Wall time: 13.4 ms


In [5]:
features = features.drop(['Unnamed: 0'], axis=1)
features

Unnamed: 0,num__PassengerId,num__Pclass,num__Age,num__SibSp,num__Parch,num__Fare,onehotencoder__x0_female,onehotencoder__x0_male,onehotencoder__x1_C,onehotencoder__x1_Q,onehotencoder__x1_S
0,1.0,3.0,22.0,1.0,0.0,7.0,0.0,1.0,0.0,0.0,1.0
1,2.0,1.0,38.0,1.0,0.0,71.0,1.0,0.0,1.0,0.0,0.0
2,3.0,3.0,26.0,0.0,0.0,7.0,1.0,0.0,0.0,0.0,1.0
3,4.0,1.0,35.0,1.0,0.0,53.0,1.0,0.0,0.0,0.0,1.0
4,5.0,3.0,35.0,0.0,0.0,8.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
886,851.0,3.0,4.0,4.0,2.0,31.0,0.0,1.0,0.0,0.0,1.0
887,852.0,3.0,74.0,0.0,0.0,7.0,0.0,1.0,0.0,0.0,1.0
888,853.0,3.0,9.0,1.0,1.0,15.0,1.0,0.0,1.0,0.0,0.0
889,854.0,1.0,16.0,0.0,1.0,39.0,1.0,0.0,0.0,0.0,1.0


In [6]:
labels

Unnamed: 0,Survived
0,0
1,1
2,1
3,1
4,0
...,...
886,0
887,0
888,0
889,1


In [7]:
data = pd.concat([features, labels], axis=1)

## Train SciKit Model¶
`train_data` is the data you want to train your model with. 

In order to deploy a model to AWS using the Scikit-learn Sagemaker SDK, you must have a script that tells Sagemaker how to train and deploy the model. The path to the script is passed to the `train_sklearn_model` function in the `train_script` parameter.

`instance_type` specifies how much computing power we want AWS to allocate for our services.

In [8]:
hyperparameters = {
    'max_depth': [2, 4, 6],
    'n_estimators': [100, 250, 300],
    'max_features': [4, 5, 6, 'sqrt'],
    'min_samples_leaf': [25, 30],
    'n_jobs': 24
    }
clf = dwcs.train_sklearn_model(data,
                               train_script='tuning_script.py',
                               instance_type='ml.c4.xlarge',
                              wait=True,
                               hyperparameters = hyperparameters,
                              download_output=False,
                              logs='All')

Training data uploaded
2021-10-06 23:07:14 Starting - Starting the training job...
2021-10-06 23:07:40 Starting - Launching requested ML instancesProfilerReport-1633561633: InProgress
......
2021-10-06 23:08:41 Starting - Preparing the instances for training.........
2021-10-06 23:10:11 Downloading - Downloading input data...
2021-10-06 23:10:41 Training - Downloading the training image...
2021-10-06 23:11:01 Training - Training image download completed. Training in progress....[34m2021-10-06 23:10:58,511 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2021-10-06 23:10:58,514 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-10-06 23:10:58,525 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2021-10-06 23:10:58,960 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-10-06 23:11:00,401 sagemaker-training-toolkit 