<a href="https://colab.research.google.com/github/vedantdave77/project.Orca/blob/master/ML%20deployment-%20AWS.SageMaker/Boston_House(AWS_SageMaker_Batch_Transform_HP_Tuning).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AWS SAGEMAKER 
**--------------------------------------**
Hello, I am [Vedant_Dave](vedantdave77@gmail.com), a data enthusiast with deep interest in machine learning and deep learning. 

## Intro To Topic
Today, I am going to deploy Boston Housing data Project using AWS Sagemaker's High level API called - "Python SDK". 
> This API has facility to train and deploy model in cloud directly from innner Jupyter notebook creation. So, I will use simple Machine learning workflow as usaual. 

> Data loading --> Data Preparation --> Model Training --> HP Tuning --> Deployment in AWS. (Hopefully, try to make Web Application). 

---
---

First of all, I will use SageMaker's batch transform feature, which  is a high-performance and high-throughput method for transforming data and generating inferences. 

- I personally think, It's ideal for scenarios where you're dealing with large batches of data, don't need sub-second latency, or need to both preprocess and transform the training data. 

- My main focus is to deploy model, so on analytic point of view, I tried to use Sagemaker's ML library and find median housing price for specific housing requrements in certain areas. 

 


## Set Environment (lib & SageMaker)

In [0]:
# Setting-up Notebook in relevant environment.

import os
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_boston
import sklearn.model_selection

In [0]:
# set sagemaker in env.
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import image_url
from sagemaker.predictor import csv_serializer

# Object to represent current active session of sagemaker - contains some useful info. for future usage.
session = sagemaker.Session()

# object shows IAM role - will help us to assign training job to sagemaker.
role = get_execution_role()

## Download Data 

In [0]:
boston = load_boston()


## Data preparation and splitting.


In [0]:
# prepare data for python notebook
X_bos_pd = pd.DataFrame(boston.data, columns=boston, feature_names)
y_bos_pd = pd.DataFrame(bosotn.target)

# splitting into train and test
x_train,S_test,Y_train,Y_test = sklearn.model_selection.train_test_split(X_bos_pd, Y_bos_pd, test_size =0.33)

# further splitting of train to train(2/3) and validation(1/3)
X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(X_train, Y_train, test_size=0.33)

SyntaxError: ignored

## Uploading dataa files to S3.

Keep in mind that, 

- When a training job is constructed using SageMaker, a container is executed which performs the training operation.
- This container is given access to data that is stored in S3. This means that we need to upload the data we want to use for training to S3. 
- In addition, when we perform a batch transform job, SageMaker expects the input data to be stored on S3. We can use the SageMaker API to do this and hide some of the details, but first data saved locally and then uploaded to S3 container.




In [0]:
# define & ensure data dictionary...
data_dir = '.../data/boston'
if not os.path.exists(data_dir):
  os.makedirs(data_dir)

In [0]:
# In data_dir, I amd creating csv file format for all data, and in validation and train set target data comes in first columns.

X_test.to_csv(os.path.join(data_dir,'test.csv'),header = False, index = False)
 
pd.concat([Y_val,X_val], axis =1).to_csv(os.path.join(data_dir, 'validation.csv'),header=False, index= False)
pd.concat([Y_train,X_train],axis =1).to_csv(os.path.join(data_dir,'train.csv'),header=Flase,index=False)

### Upload to S3 - data storage.
Its good prectice to give prefix to your S3 bucket, so you can easily get idea about specific container for relevant project. 
- Here, I am giving name as "dataset_name-algorithm_name-API_level".

I will use xgboost algorithm, which is one of the modern approach for supervised learning. It boost our algorithm gradient and give high accuracy result with good F1 score Matrix. 

> For more info, visit [XGBoost](https://xgboost.readthedocs.io/en/latest/) official documentation.

In [0]:
prefix = 'boston-xgboost-HL'

test_location = session.upload_data(os.path.join(data_dir,'test.csv'),key_prefix = prefix)
val_location = session.upload_data(os.path.join(data_dir,'validation.csv'), key_prefix = prefix)
train_location = session.upload_data(os.path.join(data_dir,'train.csv'),key_prefix = prefix)

## Training XGBoost Model
There are two options for training model either use high-level API in which Sage-Maker will train algorithm ownself or from low level API inwhich we need to define our own work. 

I will go with both the cases to represent difference. Before this we must need some important information for sagemaker to give permissions and you can find them from [common_para_list](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html)


In [0]:
# define container with uri to define container with session and ml-model info.
container = get_image_uri(session.boto_region_name),'xgboost')

# Now construct container object with requried parametrs.
xgb = sagemaker.estimator.Estimator(continer,                                   # our training container
                                    role,                                       # defined IAM role for training
                                    train_instance_count=1,                     # instaces, depend how many you created - for lengthy job need more
                                    train_instance_type = 'ml.m4.xlarge',       # type of instace type for deployjent - can use m2 to m5 (AWS rate will according to that, check here --> https://aws.amazon.com/sagemaker/pricing/)
                                    output_path = 's3://{}/{}/output'.format(session.default_bucket(),prefix),   # output destination
                                    sagemaker_session=session)                  # current session (because instance are on regionwise servers, s3 bucket is globalize platform)

> ***SageMaker has xgb HP tuning parameters as follow, you  can check it*** [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html)

In [0]:
# set HyperParameter of model.
xgb.set_hyperparameters(max_depth= 5,
                        etz = 0.2,
                        min_child_weight= 6,
                        subsample = 0.8,
                        objective = 'reg"linear',
                        early_stopping_round = 10,
                        num_round = 200)

The best thing about AWS service is about Hyper Parameter tuning freatures. 
> If you have little bit experience with ML and DL deep network training on complex data then you know, How much frustrating to set HP for better model accuracy, we all face different problems for that such as overfitting, gradient explodation/ vanishing, underfitting, time/computation complexity etc...

With HP tuner we can provide range for output and wait for best model HP, sagemaker will return the best executed model.

In [0]:
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

xgb_hyperparameter_tuner = HyperparameterTuner(estimator = xgb,                                    # choose your model type (ML-approach)
                                               objective_metric_name = 'validation:rmse',          # Evaluation Matrix for model comparison 
                                               objective_type = 'Minimize',                        # What you want? ---> minimize / maximize matrix result as best model (here, minimize will give best accuracy (due to error matrix))
                                               max_jobs = 20,                                      # total models to train --> More model=>more time=> expensive 
                                               max_parallel_jobs = 3,                              # The number of models to train in parallel, (more parallel operation--> powerful instance type(cpu) => More Money to spend)
                                               hyperparameter_ranges = {
                                                    'max_depth': IntegerParameter(3, 12),        # visit -->  https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
                                                    'eta'      : ContinuousParameter(0.05, 0.5), 
                                                    'min_child_weight': IntegerParameter(2, 8),
                                                    'subsample': ContinuousParameter(0.5, 0.9),
                                                    'gamma': ContinuousParameter(0, 10),
                                               })

In [0]:
# give some more info to sagemaker about our input's data structure and arrangement
s3_input_train = sagemaker.s3_input(s3_data= train_location, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data= val_location, content_type='csv')

xgb.fit({'train':s3_input_train,'validation': s3_input_validation})

In [0]:
# wait for model response (after all you gave 20 model execution job!)
xgb_hyperparameter_tuner.wait()

In [0]:
#pick best model for me for highest accuracy
xgb_hyperparameter_tuner.best_training_job()

In [0]:
# replace our previous xgb with best one.
xgb_attached = sagemaker.estimator.Estimator.attach(xgb_hyperparameter_tuner.best_training_job())

## Test Model
Testing need special requirement of sagemaker-batch transform. It works in background so we need to wait.

So, first let's construct transform object and give batch transform work.


In [0]:
xgb_transformer = xgb_attached.transformer(instance_count =1, instance_type = 'ml.m4.xlarge')

In [0]:
xgb_transformer.transform(test_location,content_type='text/csv',split_type='Line')

In [0]:
xgb_transformer.wait()

Our Output will saved automatically, and its better idea to save it locally so we can use it even after terminate/stop instance. After all, its a matter of budget and application.

In [0]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

## Ouput visualization (scatter plot)


In [0]:
Y_pred = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)
plt.scatter(Y_test, Y_pred)
plt.xlabel("Median Price")
plt.ylabel("Predicted Price")
plt.title("Median Price vs Predicted Price")

### Important Observation:
The sagemaker does not have lot of disk space regards to notebook space. After the ML modeling,training and testing it will fully used some time. And, give some errors. Which is not easy to diagnosed. So, you must remove data_dir and executed file

- You can do it from terminate notebook stance also.

***Keep in mind*, IT WILL LOSS YOUR ALL DATA**, even given **INPUT DATA**


In [0]:
# remove all the files from data_dir
!rm $data_dir/*

# remove directory
!rmdir $data_dir

### Summary 
Let's revise 
The flow of notebook  ...
1. Download or otherwise retrieve the data.
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.

Thats all, please visit [image_folder]() to see aws working screenshots.

In [0]:
# "Keep Learning, Enjoy Empowering" @dave117