# Lab: Bring your own script Challenge

## Introduction
Your new colleague in the data science team (who isn't very familiar with SageMaker) has written a nice notebook to tackle a classification problem with scikitlearn: `"skLearn-Local Notebook.ipynb"`.

It works OK with the simple IRIS data set they were working on before, but now they'd like to take advantage of some of the features of SageMaker to tackle bigger and harder challenges.

Can you help refactor the Local Notebook code, to show them how to use SageMaker effectively?

## Getting Started

First, check you can run the  **sklearn-Local Notebook.ipynb**  notebook through - reviewing what steps it takes.

This notebook sets out a structure you can use to migrate code into, and lists out some of the changes you'll need to make at a high level. You can either work directly in here, or duplicate this notebook so you still have an unchanged copy of the original.

Try to work through the sections first with an MVP goal in mind (fitting the model to data in S3 via a SageMaker Training Job, and deploying/using the model through a SageMaker Endpoint). The goal is to understand the big picture on how you can bring your own code to SageMaker and scale your training and deploy. You can always have more advance models and more complex training code. 

The excercise of bringing your own training code to SageMaker is what we call ***'Script Mode'***. 

## Sklearn script mode training and serving
Script mode is a training script format for a number of supported frameworks that lets you execute the training script in SageMaker with minimal modification (read more details in this blog [Script mode](https://aws.amazon.com/blogs/machine-learning/bring-your-own-model-with-amazon-sagemaker-script-mode/)). The [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native SKlearn support sets up training-related environment variables and executes your training script. Script mode supports training with a Python script, a Python module, or a shell script. 

## Dependencies
Listing all our imports at the start helps to keep the requirements to run any script/file transparent up-front, and is specified by nearly every style guide including Python's official [PEP 8](https://peps.python.org/pep-0008/#imports)

In [2]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

# TODO: What else will you need?
# Have a look at the documentation: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html
# to see which libraries need to be imported to use sagemaker and the Sklearn estimator estimator

import boto3
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split



##  Prepare the Data
We download the Iris data from UCI Machine Learning repository directly from the web. this is the url where you can get the data similar to what we did in the "sklearn-local Notebook.ipynb"

In [3]:
#TODO: download the data from internet
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data


--2022-06-23 05:36:19--  https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4551 (4.4K) [application/x-httpd-php]
Saving to: ‘iris.data.1’


2022-06-23 05:36:19 (127 MB/s) - ‘iris.data.1’ saved [4551/4551]



In [4]:
#read in the data with the headers
local_data_path = './iris.data'

data = pd.read_csv(local_data_path,                   
                   names=['sepal length', 'sepal width', 
                          'petal length', 'petal width', 
                          'label'])
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 10)         # Keep the output on one page
data


Unnamed: 0,sepal length,sepal width,petal length,petal width,label
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [6]:
#split the data into train and test
train,test= np.split(data.sample(frac=1, random_state=22), [int(0.7 * len(data))])
train.head()


#TODO:convert the test and train to CSV 
train.to_csv("train.csv")
test.to_csv("test.csv")

## Set up the environment: Execution Role, Session and S3 Bucket
Now that we have downloaded and reduced the data in the local directory, we will need to upload it to Amazon S3 to make it available for Amazon Sagemaker training.

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region.

- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the get_execution_role method from sagemaker python SDK.

In [7]:
# TODO: This is where you can setup execution role, session and S3 bucket.

region = boto3.Session().region_name

#define samemaker role 
role = get_execution_role()

#define sagemaker session
sess = sagemaker.Session()

#define default bucket
bucket = sess.default_bucket()


## Upload Data to Amazon S3
Next is the part where you need to upload the images to Amazon S3 for Sagemaker training. You can refer to the previous example on how to do it using the aws s3 sync CLI command or using the boto3 SDK. The high-level command aws s3 sync command synchronizes the contents of the target bucket and source directory. It allows the use of options such as --delete that allows to remove objects from the target that are not present in the source and --exclude or --include options that filter files or objects to exclude or not exclude.

⏰ Note: Uploading to Amazon S3 typically takes about 2-3 minutes assuming a reduction_factor of 2

In [12]:
#TODO: import aws boto3 library


#TODO:convert the test and train to CSV 
train.to_csv("train.csv")
test.to_csv("test.csv")

#TODO:upload the data on to your sagemaker defulat S3 bucket in a folder called training and your data called 'data.csv'
train_path_s3 = sess.upload_data(
    path='train.csv',  # source
    bucket=bucket,
    key_prefix='training'  # destination path in S3
)

test_path_s3 = sess.upload_data(
    path='test.csv',  # source
    bucket=bucket,
    key_prefix='testing'  # destination path in S3
)

print('Train set URI:', train_path_s3)
print('Test set URI:', test_path_s3)


Train set URI: s3://sagemaker-ap-southeast-2-006485324388/training/train.csv
Test set URI: s3://sagemaker-ap-southeast-2-006485324388/testing/test.csv


## Data Input ("Channels") Configuration
The draft code has 2 data sets: One for training, and one for test/validation. 

In SageMaker terminology, each input data set is a "channel" and we can name them however we like... Just make sure you're consistent about what you call each one!

For a simple input configuration, a channel spec might just be the S3 URI of the folder. For configuring more advanced options, there's the s3_input class in the SageMaker SDK.

In [None]:
# TODO: Define your 2 data channels (train and test)
# The data can be found in: "s3://{bucket_name}/mnist/training" and "s3://{bucket_name}/mnist/testing"
# We can use either the s3_input (which gives us additional configuration options), or a plain string:



## Algorithm ("Estimator") Configuration and Run
Instead of loading and fitting this data here in the notebook, we'll be creating a Sklearn Estimator through the SageMaker SDK, to run the code on a separate container that can be scaled as required.

The ["Using SKlearn with the SageMaker Python SDK"](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#using-scikit-learn-with-the-sagemaker-python-sdk) docs give a good overview of this process. You should run your estimator in script mode (which is easier to follow than the old default legacy mode) and as Python 3.

## Use the `**main.py**` file already prepared for you in your local directory as your entry point to port code into - which has already been created for you with some basic hints.


In [29]:
#TODO:define your estimator using SKlearn framework
from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
    entry_point='main.py',
    role=role,
    instance_type='ml.m5.large',
    framework_version='0.20.0',
    base_job_name='iris-scikit',
    hyperparameters={
        'n_estimators': 100,
        'min_samples_leaf': 3})


### Before running the actual training on SageMaker TrainingJob, it can be good to run it locally first using the code below. If there is any error, you can fix them first before running using SageMaker TrainingJob.

In [27]:
!python3 ./main.py --train ./ --test ./ --model-dir ./ --n_estimators=100 --min_samples_leaf=3

extracting arguments
model saved at ./model.joblib


# Calling `fit`
When you're ready to try your script in a SageMaker training job, you can call estimator.fit() as we did in previous exercises:To start a training job, we call `estimator.fit(training_data_uri)`.

When training is complete, the training job will upload the saved model to S3 for deployment.

In [30]:
#TODO:call the fit function and pass on your data you  uploaded to S3 above for the training to start

sklearn_estimator.fit({'train': train_path_s3, 'test': test_path_s3})

2022-06-23 07:00:31 Starting - Starting the training job...
2022-06-23 07:00:47 Starting - Preparing the instances for trainingProfilerReport-1655967631: InProgress
......
2022-06-23 07:01:59 Downloading - Downloading input data......
2022-06-23 07:02:47 Training - Downloading the training image...
2022-06-23 07:03:27 Uploading - Uploading generated training model[34m2022-06-23 07:03:17,437 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2022-06-23 07:03:17,439 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-06-23 07:03:17,457 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2022-06-23 07:03:17,892 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-06-23 07:03:17,910 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-06-23 07:03:17,929 sagemaker-training-toolkit I

## Deploy and Use Your Model (Real-Time Inference)
We are now ready to deploy our model to Sagemaker hosting services and make real time predictions

In [32]:
#TODO:deploy the model to a real time endpoint

predictor = sklearn_estimator.deploy(instance_type='ml.m5.xlarge',
                                     initial_instance_count=1)

-----!

Let's now send some data to our model to predict- the data shouldbe sent in the accepted format (The data sent to the endpoint for this model should be 'text.csv' format) and the code below just does that. We also ensure to perform the same processing on our test, same as what we did on our training data.

In [31]:
#TODO:now get some test data to test your model and process them similar to our training set
from sklearn.preprocessing import StandardScaler
test=pd.read_csv("test.csv")
test=test.iloc[:,0:4].values.tolist()
print(test)

[[60.0, 5.0, 2.0, 3.5], [58.0, 6.6, 2.9, 4.6], [83.0, 6.0, 2.7, 5.1], [87.0, 6.3, 2.3, 4.4], [135.0, 7.7, 3.0, 6.1], [1.0, 4.9, 3.0, 1.4], [47.0, 4.6, 3.2, 1.4], [123.0, 6.3, 2.7, 4.9], [75.0, 6.6, 3.0, 4.4], [52.0, 6.9, 3.1, 4.9], [13.0, 4.3, 3.0, 1.1], [94.0, 5.6, 2.7, 4.2], [11.0, 4.8, 3.4, 1.6], [105.0, 7.6, 3.0, 6.6], [122.0, 7.7, 2.8, 6.7], [106.0, 4.9, 2.5, 4.5], [110.0, 6.5, 3.2, 5.1], [23.0, 5.1, 3.3, 1.7], [103.0, 6.3, 2.9, 5.6], [134.0, 6.1, 2.6, 5.6], [7.0, 5.0, 3.4, 1.5], [91.0, 6.1, 3.0, 4.6], [66.0, 5.6, 3.0, 4.5], [19.0, 5.1, 3.8, 1.5], [121.0, 5.6, 2.8, 4.9], [38.0, 4.4, 3.0, 1.3], [81.0, 5.5, 2.4, 3.7], [29.0, 4.7, 3.2, 1.6], [144.0, 6.7, 3.3, 5.7], [27.0, 5.2, 3.5, 1.5], [111.0, 6.4, 2.7, 5.3], [133.0, 6.3, 2.8, 5.1], [8.0, 4.4, 2.9, 1.4], [127.0, 6.1, 3.0, 4.9], [34.0, 4.9, 3.1, 1.5], [93.0, 5.0, 2.3, 3.3], [45.0, 4.8, 3.0, 1.4], [14.0, 5.8, 4.0, 1.2], [136.0, 6.3, 3.4, 5.6], [84.0, 5.4, 3.0, 4.5], [102.0, 7.1, 3.0, 5.9], [100.0, 6.3, 3.3, 6.0], [44.0, 5.1, 3.8, 1.9

In [33]:
#TODO:the body that you send to your model enpoint should be text/csv format, get your data to the right format before sending it to you model endpoint for prediciton, each observation should be placed on a new line
request_body = ""
for row in test:
    request_body += ",".join([str(n) for n in row]) + "\n"


In [34]:
#TODO:now envoke your endpoint and get predictions

client = boto3.client('sagemaker-runtime')

endpoint=predictor.endpoint_name


content_type = "text/csv"

response = client.invoke_endpoint(
    EndpointName=endpoint,
    ContentType=content_type,
    Body=request_body
    )
response['Body'].read()


b'Iris-virginica | Iris-virginica | Iris-virginica | Iris-virginica | Iris-virginica | Iris-versicolor | Iris-versicolor | Iris-virginica | Iris-virginica | Iris-virginica | Iris-versicolor | Iris-virginica | Iris-versicolor | Iris-virginica | Iris-virginica | Iris-virginica | Iris-virginica | Iris-versicolor | Iris-virginica | Iris-virginica | Iris-versicolor | Iris-virginica | Iris-virginica | Iris-versicolor | Iris-virginica | Iris-versicolor | Iris-virginica | Iris-versicolor | Iris-virginica | Iris-versicolor | Iris-virginica | Iris-virginica | Iris-versicolor | Iris-virginica | Iris-versicolor | Iris-virginica | Iris-versicolor | Iris-versicolor | Iris-virginica | Iris-virginica | Iris-virginica | Iris-virginica | Iris-virginica | Iris-virginica | Iris-virginica'

In [35]:
predictor.delete_endpoint(delete_endpoint_config=True)
