# NeoPulse® SageMaker Algorithm Usage Demonstration

Using NeoPulse® Algorithm with Amazon SageMaker APIs

This sample notebook demonstrates using NeoPulse® Algorithm ARN to run training jobs and use that result for 
inference.

***Pre-Requisite:*** Please subscribe to a NeoPulse® Algorithm before proceeding with this notebook.

***NOTE: Before running this notebook, please read through it and fill in the IAM role and ARN for the algorithm you wish to use.***

***NOTE:*** By default, this notebook uses the AWS credentials located in ${HOME}/.aws/ 

## Contents of this notebook
This notebook contains all the code necessary to build a simple Sentiment model based on the IMDB sentiment analysis data set. 

It contains the following files:  

- <b>aws_example.ipynb</b> - this file.
- <b>src/build_csv.py</b> - function definitions to download and preprocess the IMDB data set.
- <b>data/training/train.nml</b> - NML script for training model.
- <b>images/workflow.jpeg</b> - Workflow image.

## Workflow
The typical workflow is depicted in the figure below. This illustrates the process of training a model, creating a model package, using the model package to create a model, then using the model for batch or real-time inference.
![fig1](img/workflow.jpeg "Fig.1 workflow")

## Prepare the data

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're using the classic IMDB dataset which is pretty small. First we'll download and pre-process the data locally, then upload that training data to S3 for use with Amazon SageMaker.

In [None]:
from src.build_csv import download_data, write_training_data, write_inference_data

download_data()

write_training_data()

write_inference_data()

You should now see the following files:

- <b>data/training/training_data.csv</b> - Training dataset. 50000 records.


- <b>data/transform/batch_inference.zip</b> - Zipped batch inference data.
- <b>data/transform/full_query.csv</b> - Batch inference data. 25000 records.
- <b>data/transform/short_query.csv</b> - Real-time inference data. 10 records.
- <b>data/transform/realtime_inference.zip</b> - Zipped real-time inference data.

## Training data
The training data <b>training_data.csv</b> contains two comma separated columns "Review" and "Label". There is one record on each line. Note that there are 50,000 records. The NML script <b>train.nml</b> specifies a parameter "validation_split" with a value of 0.5, i.e. 50% of the data is to be withheld from training and just used for validation.  

## Inference data
The inference data <b>full_query.csv</b> only contains the "Review" column, and only contains the last 25,000 records present in <b>training_data.csv</b> This is then compressed in <b>batch_inference.zip</b>

We have also created a smaller set for testing real-time inference: <b>short_query.csv</b>. This only contains the first 10 records of <b>full_query.csv</b>. This is then compressed into <b>realtime_inference.zip</b>

The CSV files are not used directly, the script just leaves them for illustrative purposes.

## Set up the environment and create a session

We begin by setting up some environmental variables. In particular, our local data directories, the name and prefixes for the S3 bucket we will use, and the IAM role we will use. Please edit these to reflect the bucket/role you will use.

In [None]:
# Local data directories
TRAINING_WORKDIR = "data/training/"
TRANSFORM_WORKDIR = "data/transform/"


# AWS INFORMATION
# S3 Bucket -- Replace this with the name of the S3 bucket you want to use (or leave it to use the default bucket name:)
BUCKET = None

# S3 Prefixes -- Replace these with 
COMMON_PREFIX = "neopulse-sagemaker-demo-imbd"
TRAINING_PREFIX = COMMON_PREFIX + "/training-data"
INFERENCE_PREFIX = COMMON_PREFIX + "/inference-data"

# IAM Role -- Replace this with a valid IAM role
IAM_ROLE = None

# Algorithm ARN -- Replace this with the ARN of your NeoPulse® Algorithm Subscription
ALGO_ARN = None

The session remembers our connection parameters to Amazon SageMaker. We'll use it to perform all of our Amazon SageMaker operations.

In [None]:
import sagemaker as sage

sm_sess = sage.Session()

## Upload data to S3

Now we use the session to upload the data to S3.

In [None]:
# First we upload the training data
training_data = sm_sess.upload_data(TRAINING_WORKDIR, bucket=BUCKET, key_prefix=TRAINING_PREFIX)
print ("Training Data Location: " + training_data)

# Then the batch inference data
batch_inference_data = sm_sess.upload_data(TRANSFORM_WORKDIR + 'batch_inference.zip', bucket=BUCKET, key_prefix=INFERENCE_PREFIX)
print("Batch Inference Data Location: " + batch_inference_data)

## Creating Training Job using Algorithm ARN
Now we create a training job by specifying the instance type and job name to the AlgorithmEstimator.

In [None]:
from sagemaker.algorithm import AlgorithmEstimator

estimator = AlgorithmEstimator(sagemaker_session=sm_sess,
                               algorithm_arn=ALGO_ARN,
                               role=IAM_ROLE,
                               train_instance_count=1,
                               train_instance_type='ml.m4.xlarge',
                               base_job_name='imdb-demo')

## Run Training Job
We use the estimator.fit function to train the model. You can check on the status of the training job at:
https://us-east-2.console.aws.amazon.com/sagemaker/home?region=<b>your-region-here</b>#/jobs

Look for a job starting with the <b>base_job_name</b> you specified in creating the estimator, above.

In [None]:
print ("Now run the training job using algorithm arn %s in region %s" % (ALGO_ARN, sm_sess.boto_region_name))
estimator.fit({'training': training_data})

## Batch Transform Job
Now let's use the model built to run a batch inference job and verify it works.

<b>NOTE:</b> Set the parameter <b>max_payload</b> to a value large enough to handle <b>batch_transform.zip</b>. Here we set it to 0, corresponding to unlimited payload size.

In [None]:
transformer = estimator.transformer(1, 'ml.m4.xlarge',max_payload=0)
transformer.transform(batch_inference_data, content_type='application/zip',logs=True,wait=True)

print("Batch Transform output saved to " + transformer.output_path)

## Inspect the Batch Transform Output in S3

In [None]:
from urllib.parse import urlparse

parsed_url = urlparse(transformer.output_path)
bucket_name = parsed_url.netloc
file_key = '{}/{}.out'.format(parsed_url.path[1:], "batch_inference.zip")
print(bucket_name)
print(file_key)
s3_client = sm_sess.boto_session.client('s3')

response = s3_client.get_object(Bucket = bucket_name, Key = file_key)
response_text = response['Body'].read().decode('utf-8')
print(response_text.split('\n')[:11])

## Live Inference Endpoint
Finally, we demonstrate the creation of an endpoint for live inference using this neopulse algorithm generated model

In [None]:
#from sagemaker.predictor import zip_serializer
predictor = estimator.deploy(1, 'ml.m4.xlarge', content_type="application/zip")

### Perform real-time prediction
We read the zip file as binary data and then use the predictor to get the results.

In [None]:
with open(TRANSFORM_WORKDIR + "realtime_inference.zip",'rb') as f:
    data = f.read()

print(predictor.predict(data).decode("utf-8"))

### Cleanup the endpoint

In [None]:
estimator.delete_endpoint()