## Scikit-Learn Data Preprocessor
### Using TITANIC_VIEW from DWC. This view has 861 records

## Install fedml_aws library

In [1]:
pip install fedml_aws-1.0.0-py3-none-any.whl --force-reinstall

Processing ./fedml_aws-1.0.0-py3-none-any.whl
Collecting hdbcli
  Using cached hdbcli-2.10.13-cp34-abi3-manylinux1_x86_64.whl (11.7 MB)
Installing collected packages: hdbcli, fedml-aws
  Attempting uninstall: hdbcli
    Found existing installation: hdbcli 2.10.13
    Uninstalling hdbcli-2.10.13:
      Successfully uninstalled hdbcli-2.10.13
  Attempting uninstall: fedml-aws
    Found existing installation: fedml-aws 1.0.0
    Uninstalling fedml-aws-1.0.0:
      Successfully uninstalled fedml-aws-1.0.0
Successfully installed fedml-aws-1.0.0 hdbcli-2.10.13
Note: you may need to restart the kernel to use updated packages.


## Import Libraries 

In [2]:
from fedml_aws import DwcSagemaker
from fedml_aws import DbConnection
import numpy as np
import pandas as pd
import json

## Create DwcSagemaker instance to access libraries functions

In [3]:
dwcs = DwcSagemaker(prefix='scikit-learn/data-preprocessing', 
                    bucket_name='fedml-bucket')


## Create DbConnection instance to get data from DWC

Before running the following cell, you should have a config.json file in the same directory as this notebook with the specified values to allow you to access to DWC.

You should also have the follow view `TITANIC_VIEW` created in your DWC. To gather this data, please refer to https://www.kaggle.com/c/titanic/data and download the train.csv file.

In [4]:
%%time
db = DbConnection()
data = db.execute_query('SELECT * FROM %s' % ('SCE.TITANIC_VIEW'))
data = pd.DataFrame(data[0], columns=data[1])
data

CPU times: user 49.2 ms, sys: 3.55 ms, total: 52.8 ms
Wall time: 110 ms


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,False,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,True,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,True,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,True,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,False,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,False,2,"Montvila, Rev. Juozas",male,27,0,0,211536,13,,S
887,888,True,1,"Graham, Miss. Margaret Edith",female,19,0,0,112053,30,B42,S
888,889,False,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,True,1,"Behr, Mr. Karl Howell",male,26,0,0,111369,30,C148,C


## Train SciKit Model¶
`train_data` is the data you want to train your model with. 

In order to deploy a model to AWS using the Scikit-learn Sagemaker SDK, you must have a script that tells Sagemaker how to train and deploy the model. The path to the script is passed to the `train_sklearn_model` function in the `train_script` parameter.

`instance_type` specifies how much computing power we want AWS to allocate for our services.

In [5]:
clf = dwcs.train_sklearn_model(data,
                               train_script='preprocessor_script.py',
                               instance_type='ml.c4.xlarge',
                              wait=True,
                              download_output=True,
                              logs='All')

Training data uploaded
2021-10-06 22:40:22 Starting - Starting the training job...
2021-10-06 22:40:46 Starting - Launching requested ML instancesProfilerReport-1633560021: InProgress
......
2021-10-06 22:41:47 Starting - Preparing the instances for training............
2021-10-06 22:43:47 Downloading - Downloading input data...
2021-10-06 22:44:07 Training - Downloading the training image..[34m2021-10-06 22:44:30,473 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2021-10-06 22:44:30,475 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-10-06 22:44:30,486 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2021-10-06 22:44:30,967 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-10-06 22:44:33,999 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-10-06 22:44:34,013 