## Scikit-Learn Preprocessing and Training Pipeline
##### from sklearn.feature_extraction.text import TfidfVectorizer
##### from sklearn.naive_bayes import MultinomialNB
### Using data from S3 and SAP Datasphere

## Install fedml aws library

In [1]:
pip install fedml-aws --force-reinstall

## Import Libraries 

In [2]:
from fedml_aws import DwcSagemaker
from fedml_aws import DbConnection
import numpy as np
import pandas as pd
import json

## Create DwcSagemaker instance to access libraries functions

In [3]:
dwcs = DwcSagemaker(prefix='<prefix>', bucket_name='<bucket_name>')

Bucket created in  us-east-1


## Create DbConnection instance to get data from SAP Datasphere

Before running the following cell, you should have a config.json file in the same directory as this notebook with the specified values to allow you to access to SAP Datasphere.

You should also have the follow view `IMDB_TEST_VIEW` created in your SAP Datasphere. To gather this data, please refer to https://www.kaggle.com/mantri7/imdb-movie-reviews-dataset?select=train_data+%281%29.csv and download the test dataset.

In [4]:
db = DbConnection()
res, column_headers = db.get_data_with_headers(table_name='IMDB_TEST_VIEW', size=1)
dwc_data = pd.DataFrame(res, columns=['0', '1'])

## Now accessing data residing in S3

Before running the below cell, please download the train dataset from the link below and upload it to your s3 bucket.
https://www.kaggle.com/mantri7/imdb-movie-reviews-dataset?select=train_data+%281%29.csv

In [7]:
import boto3
downloaded_data_bucket = f"fedml-bucket"

s3 = boto3.client("s3")
s3.download_file(downloaded_data_bucket, "imdb_train.csv", "imdb_train.csv")

In [8]:
df = pd.read_csv('imdb_train.csv')
df

Unnamed: 0,0,1
0,"This film is absolutely awful, but nevertheles...",0
1,Well since seeing part's 1 through 3 I can hon...,0
2,I got to see this film at a preview and was da...,1
3,This adaptation positively butchers a classic ...,0
4,Råzone is an awful movie! It is so simple. It ...,0
...,...,...
24995,With this movie being the only Dirty Harry mov...,1
24996,Any screen adaptation of a John Grisham story ...,1
24997,This film captured my heart from the very begi...,1
24998,A deplorable social condition triggers off the...,1


## Combining the data from S3 and SAP Datasphere to use for training.

In [9]:
data = pd.concat([df, dwc_data], axis=0)
data.shape

(50000, 2)

In [10]:
data.head()

Unnamed: 0,0,1
0,"This film is absolutely awful, but nevertheles...",0
1,Well since seeing part's 1 through 3 I can hon...,0
2,I got to see this film at a preview and was da...,1
3,This adaptation positively butchers a classic ...,0
4,Råzone is an awful movie! It is so simple. It ...,0


## Train SciKit Model¶
`train_data` is the data you want to train your model with. 

In order to deploy a model to AWS using the Scikit-learn Sagemaker SDK, you must have a script that tells Sagemaker how to train and deploy the model. The path to the script is passed to the `train_sklearn_model` function in the `train_script` parameter.

`instance_type` specifies how much computing power we want AWS to allocate for our services.

In [11]:
clf = dwcs.train_sklearn_model(data,
                               train_script='pipeline_script.py',
                               instance_type='ml.c4.xlarge',
                              wait=True,
                              download_output=False)

Training data uploaded
2022-01-26 23:42:10 Starting - Starting the training job...
2022-01-26 23:42:34 Starting - Launching requested ML instancesProfilerReport-1643240530: InProgress
......
2022-01-26 23:43:34 Starting - Preparing the instances for training.........
2022-01-26 23:45:02 Downloading - Downloading input data......
2022-01-26 23:45:58 Training - Training image download completed. Training in progress.[34m2022-01-26 23:46:00,175 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2022-01-26 23:46:00,178 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-01-26 23:46:00,190 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2022-01-26 23:46:00,590 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-01-26 23:46:03,648 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m