<h1 align="center">Introduction to Amazon SageMaker AI</h1>

## Table of Contents
1. [Overview](#Overview)
2. [Amazon S3](#Amazon-S3)
   * [Introduction](#Introduction)
   * [Creating bucket](#Creating-bucket)
   * [Uploading data](#Uploading-data)
3. [Amazon SageMaker AI](#Amazon-SageMaker-AI)
   * [Processing jobs](#Processing-jobs)
   * [Training jobs](#Training-jobs)
   * [Endpoints](#Endpoints)
   * [Batch transform jobs](#Batch-transform-jobs)
4. [CloudWatch Logs](#CloudWatch-Logs)
5. [Epilogue](#Epilogue)

## Overview
Amazon SageMaker AI is an umbrella of services that AWS provides for Machine Learning (ML). In a nutshell, it is a service that enables the developer to be much more efficient with their valuable time when developing and deploying ML models. This methodology is applicable across many learning algorithms and many production use cases.

In this tutorial, you will harness some of the most commonly used microservices of SageMaker AI to contruct basic components of a machine learning workflow. By the end of this lesson, you will be able to:
* Launch a processing job to preprocess your data.
* Launch a training job and build your ML model.
* Deploy an endpoint to serve as an API for your trained model.
* Launch a batch transform job to try out your trained model.
<center><img src="img/sagemaker_microservices.png" width="80%"></center>

## Amazon S3
First of all, you need to create a bucket in Amazon S3 to store any future files and data.

### Introduction

Amazon Simple Storage Service (Amazon S3) is an object storage service that can store almost any object needed for machine learning. That includes datasets, model artifacts, logs, and more.

An [S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html) is a container for objects (i.e., files) stored in S3. 

S3 supports the folder concept as a means of grouping objects. It does this by using a shared name *prefix*. In other words, the grouped objects have names that begin with a common string. This common string, or shared prefix, is the folder name. The prefix must end with a forward slash character `/` to indicate folder structure. Furthermore, object names are also referred to as key names.

For example: `s3://example-bucket/1/2/3/example.txt`
 * Bucket: `example-bucket`
 * Prefix: `1/2/3/`
 * Key name: `1/2/3/example.txt`

### Creating bucket

First step is importing `boto3` module, which is the AWS SDK for Python. You are encouraged to explore [its documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) for future practice.

In [None]:
import boto3
from botocore.exceptions import ClientError

You will then connect to Amazon S3 and create a bucket. Enter a name for your bucket in the code below. It must be globally unique accross all AWS accounts. Once created, you cannot change its name.

In [None]:
# Create a service client to access S3
s3 = boto3.client('s3')

bucket_name = 'ml-workflow-2'  # Replace with a globally unique name for your bucket
s3.create_bucket(Bucket=bucket_name)

Now that you have created a bucket on AWS, you can upload any files into this storage through `boto3` or AWS console. Furthermore, you can create folders to help organize your files more effectively. Simply run the code below and feel free to change the name of folders. You can also create nested folders by including more slashes, such as `parent-folder/child-folder`.

In [None]:
data_prefix = 'data'  # Folder for datasets
model_prefix = 'models'  #Folder for models

s3.put_object(Bucket=bucket_name, Key=data_prefix)
s3.put_object(Bucket=bucket_name, Key=model_prefix)

### Uploading data

After creating bucket, you can upload data and any other files there. But right now, you need to upload all datasets in [data folder](../data/) first using the function below.

In [None]:
def upload_file_to_s3(file_name, s3_prefix=''):
    key_name = s3_prefix + file_name.rsplit('/', maxsplit=1)[-1]
    try:
        s3.upload_file(file_name, bucket_name, key_name)
    except ClientError as e:
        print(e)

In [None]:
upload_file_to_s3('../data/reviews_Musical_Instruments_5.json.zip', data_prefix)
upload_file_to_s3('../data/reviews_Patio_Lawn_and_Garden_5.json.zip', data_prefix)
upload_file_to_s3('../data/reviews_Toys_and_Games_5.json.zip', data_prefix)

If done successfully, you can see these datasets in your bucket from [S3 console](https://console.aws.amazon.com/s3/home).

## Amazon SageMaker AI

After uploading necessary files, the next thing to do is training a machine learning model and making use of it to produce inferences. This is the most important part as you will perform common machine learning operations on AWS.

Step by step, you will create a model that predicts the usefulness of a product review, given only the text. This is an example of a problem in the domain of supervised sentiment analysis.

### Processing jobs

Before training a model, you need input data. The [dataset](../data/reviews_Toys_and_Games_5.json.zip) you will be working with is a collection of reviews for an assortment of toys and games found on Amazon. It includes, but is not limited to, the text of the review itself as well as the number of user votes on whether or not the review was helpful.

However, the dataset is inside a .zip file so you have to extract it before proceeding. Moreover, the dataset is a file containing a single JSON object per line representing a review with the following format: 
```JSON
{
 "reviewerID": "<string>",
 "asin": "<string>",
 "reviewerName": "<string>",
 "helpful": [
   <int>, (indicating number of helpful votes)
   <int>  (indicating total number of votes)
 ],
 "reviewText": "<string>",
 "overall": <int>,
 "summary": "<string>",
 "unixReviewTime": <int>,
 "reviewTime": "<string>"
}
```
Later, you will be using [BlazingText algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html) in training process, which is an implemention of [Word2Vec algorithm](https://en.wikipedia.org/wiki/Word2vec) optimized for SageMaker AI. Therefore, in order for this algorithm to work, you have to format the input data correctly. This is true for any other algorithm or model you work with, as each of them requires a particular type and structure of the input data. In this case, the data should only consist of plain text, with each line containing a label name followed by a sentence. Labels must be prefixed by the string `__label__`. 

For the dataset in this exercise, you will extract the text from the field *reviewText* and generate label based on the field *helpful* for each review. If the majority of votes is helpful, assign it `__label__1`, otherwise `__label__2`. If there is no majority or the review text is empty, drop the review from consideration. Then, cut the text into individual sentences, while ensuring that each sentence retains the original label from the review. When splitting using the character `.`, make sure that no empty sentences are created, since reviews usually contain an ellipsis `...` or more. Your input data should look something like this:

```
__label__1 Even if you can only play with one other person, you'll want to pull Stone Age out often
__label__1 But if you have friends to join you, this game will be on the table a lot
__label__2 It's a fun game but not a favorite
__label__2 I prefer more complex games
__label__2 If you're new to gaming or like relatively simple games I recommend you try this
```

Finally, it is your responsibility is to split the dataset into training set and testing set. Training set should represent 80% of the dataset, while the rest is testing set. Make sure that they don't overlap.

All of the procedures mentioned above and more are collectively called *data pre-processing*, the first and most crucial step in any machine learning project.

Now, implement a Python script to unzip, format, and split the raw dataset as previously instructed and save it in the same folder as this notebook. Or you can go ahead and use [hello_blaze_preprocess.py](hello_blaze_preprocess.py) provided.

If you decide to custom your own script, note that the processing job will copy the dataset from S3 to a local directory, prefixed with `/opt/ml/processing/`, within the container. Thus, your script should take the dataset in this directory as input instead. Additionally, it must output training set and testing set to specified local directories, also prefixed with `/opt/ml/processing/`. You will have a chance to set up these local directories in the next steps.
<center><img src="img/processing_model.png"></center>

In the upcoming steps, you are going to use `sagemaker` module, which is a higher-level AWS SDK specifically designed for Amazon SageMaker AI. Although `boto3` gives you general access to all AWS services, this module is specialized for tasks within the Amazon SageMaker AI microservices.

You are also going to use the role that was created for this notebook instance to run SageMaker AI microservices.

In [None]:
from sagemaker
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

# Get the execution role
role = sagemaker.get_execution_role()

In [None]:
container_prefix = '/opt/ml/processing/'
preprocess_code = 'hello_blaze_preprocess.py'  # Replace with your own Python script

# S3 path of the unprocessed dataset
s3_dataset = 's3://' + bucket_name + data_prefix + 'reviews_Toys_and_Games_5.json.zip'

# local directory path that the dataset will be downloaded into
input_path = container_prefix + 'input'
# local directory paths where your Python script saves the training/testing set
train_path = container_prefix + 'output/train'
test_path = container_prefix + 'output/test'

In [None]:
def get_s3_path(filename='', prefix=''):
    return 's3://' + bucket_name + prefix + filename

In [None]:
# Create an SKLearnProcessor, version 0.20.0
sklearn_processor = SKLearnProcessor(role=role, framework_version='0.20.0', instance_type='ml.m5.large', instance_count=1)

# Start a run job. You will pass in as parameters the local location of the processing code, 
# a processing input object, two processing output objects. The paths that you pass in here are directories, 
# not the files themselves. Check the preprocessing code for a hint about what these directories should be. 

# local directory path that the dataset will be downloaded into
input_path = container_prefix + 'input'
# local directory paths where your Python script saves the training/testing set
train_path = container_prefix + 'output/train'
test_path = container_prefix + 'output/test'

sklearn_processor.run(code=preprocess_code,
                      inputs=[ProcessingInput(
                          source = 's3://udacity-ml-workflow/Lesson 2, Exercise 4/Toys_and_Games_5.json.zip', # the S3 path of the unprocessed data
                          destination=input_path, 
                      )],
                      outputs=[ProcessingOutput(source=train_path),# a 'local' directory path for where you expect the output for train data to be
                               ProcessingOutput(source=test_path)]) # a 'local' directory path for where you expect the output for test data to be 