## Scikit-Learn Preprocessing and Training Pipeline
##### from sklearn.feature_extraction.text import TfidfVectorizer
##### from sklearn.naive_bayes import MultinomialNB
### Using data from S3 and DWC

## Install fedml_aws library

In [1]:
pip install fedml_aws-1.0.0-py3-none-any.whl --force-reinstall

Processing ./fedml_aws-1.0.0-py3-none-any.whl
Collecting hdbcli
  Using cached hdbcli-2.10.13-cp34-abi3-manylinux1_x86_64.whl (11.7 MB)
Installing collected packages: hdbcli, fedml-aws
  Attempting uninstall: hdbcli
    Found existing installation: hdbcli 2.10.13
    Uninstalling hdbcli-2.10.13:
      Successfully uninstalled hdbcli-2.10.13
  Attempting uninstall: fedml-aws
    Found existing installation: fedml-aws 1.0.0
    Uninstalling fedml-aws-1.0.0:
      Successfully uninstalled fedml-aws-1.0.0
Successfully installed fedml-aws-1.0.0 hdbcli-2.10.13
Note: you may need to restart the kernel to use updated packages.


## Import Libraries 

In [2]:
from fedml_aws import DwcSagemaker
from fedml_aws import DbConnection
import numpy as np
import pandas as pd
import json

## Create DwcSagemaker instance to access libraries functions

In [3]:
dwcs = DwcSagemaker(prefix='scikit-learn/pipeline', 
                    bucket_name='fedml-bucket')


## Create DbConnection instance to get data from DWC

Before running the following cell, you should have a config.json file in the same directory as this notebook with the specified values to allow you to access to DWC.

You should also have the follow view `IMDB_TEST_VIEW` created in your DWC. To gather this data, please refer to https://www.kaggle.com/mantri7/imdb-movie-reviews-dataset?select=train_data+%281%29.csv and download the test dataset.

In [4]:
db = DbConnection()
res, column_headers = db.get_data_with_headers(table_name='IMDB_TEST_VIEW', size=1)
dwc_data = pd.DataFrame(res, columns=['0', '1'])

## Now accessing data residing in S3

Before running the below cell, please download the train dataset from the link below and upload it to your s3 bucket.
https://www.kaggle.com/mantri7/imdb-movie-reviews-dataset?select=train_data+%281%29.csv

In [5]:
import boto3
downloaded_data_bucket = f"fedml-bucket"

s3 = boto3.client("s3")
s3.download_file(downloaded_data_bucket, "imdb_train.csv", "imdb_train.csv")

In [6]:
df = pd.read_csv('imdb_train.csv')
df

Unnamed: 0,0,1
0,"This film is absolutely awful, but nevertheles...",0
1,Well since seeing part's 1 through 3 I can hon...,0
2,I got to see this film at a preview and was da...,1
3,This adaptation positively butchers a classic ...,0
4,Råzone is an awful movie! It is so simple. It ...,0
...,...,...
24995,With this movie being the only Dirty Harry mov...,1
24996,Any screen adaptation of a John Grisham story ...,1
24997,This film captured my heart from the very begi...,1
24998,A deplorable social condition triggers off the...,1


## Combining the data from S3 and DWC to use for training.

In [7]:
data = pd.concat([df, dwc_data], axis=0)
data.shape

(50000, 2)

In [8]:
data.head()

Unnamed: 0,0,1
0,"This film is absolutely awful, but nevertheles...",0
1,Well since seeing part's 1 through 3 I can hon...,0
2,I got to see this film at a preview and was da...,1
3,This adaptation positively butchers a classic ...,0
4,Råzone is an awful movie! It is so simple. It ...,0


## Train SciKit Model¶
`train_data` is the data you want to train your model with. 

In order to deploy a model to AWS using the Scikit-learn Sagemaker SDK, you must have a script that tells Sagemaker how to train and deploy the model. The path to the script is passed to the `train_sklearn_model` function in the `train_script` parameter.

`instance_type` specifies how much computing power we want AWS to allocate for our services.

In [9]:
clf = dwcs.train_sklearn_model(data,
                               train_script='pipeline_script.py',
                               instance_type='ml.c4.xlarge',
                              wait=True,
                              download_output=False)

Training data uploaded
2021-10-06 23:36:18 Starting - Starting the training job...
2021-10-06 23:36:27 Starting - Launching requested ML instancesProfilerReport-1633563378: InProgress
......
2021-10-06 23:37:43 Starting - Preparing the instances for training............
2021-10-06 23:39:43 Downloading - Downloading input data...
2021-10-06 23:40:17 Training - Training image download completed. Training in progress..[34m2021-10-06 23:40:19,497 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2021-10-06 23:40:19,500 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-10-06 23:40:19,511 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2021-10-06 23:40:19,931 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-10-06 23:40:19,943 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34