# UDACITY SageMaker Essentials: Endpoint Exercise

In the last exercise, you trained a BlazingText supervised sentiment analysis model. (Let's call this model HelloBlaze.) You've recently learned about how we can take a model we've previously trained and generate an endpoint that we can call to efficently evaluate new data. Here, we'll put what we've learned into practice. You will take HelloBlaze and use it to create an endpoint. Then, you'll evaluate some sample data on that model to see how well the model we've trained generalizes. (Sentiment analysis is a notoriously difficult problem, so we'll keep our expectations modest.)

In [3]:
!pip install sagemaker

Collecting sagemaker
  Using cached sagemaker-2.126.0-py2.py3-none-any.whl
Collecting attrs<23,>=20.3.0
  Using cached attrs-22.2.0-py3-none-any.whl (60 kB)
Collecting pathos
  Using cached pathos-0.3.0-py3-none-any.whl (79 kB)
Collecting smdebug-rulesconfig==1.0.1
  Using cached smdebug_rulesconfig-1.0.1-py2.py3-none-any.whl (20 kB)
Collecting numpy<2.0,>=1.9.0
  Downloading numpy-1.24.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[K     |████████████████████████████████| 17.3 MB 32.7 MB/s eta 0:00:01
[?25hCollecting schema
  Using cached schema-0.7.5-py2.py3-none-any.whl (17 kB)
Collecting pandas
  Using cached pandas-1.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
Collecting importlib-metadata<5.0,>=1.4.0
  Using cached importlib_metadata-4.13.0-py3-none-any.whl (23 kB)
Collecting protobuf<4.0,>=3.1
  Using cached protobuf-3.20.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)
Collecting boto3<2.0,>=1.26.28
  Using cach

In [9]:
import boto3
import json
import sagemaker
import zipfile
import os

## Understanding Exercise: Preprocessing Data (again)

Before we start, we're going to do preprocessing on a new set of data that we'll be evaluating on HelloBlaze. We won't keep track of the labels here, we're just seeing how we could potentially evaluate new data using an existing model. This code should be very familiar, and requires no modification. Something to note: it is getting tedious to have to manually process the data ourselves whenever we want to do something with our model. We are also doing this on our local machine. Can you think of potential limitations and dangers to the preprocessing setup we currently have? Keep this in mind when we move on to our lesson about batch-transform jobs.  

In [16]:
# Function below unzips the archive to the local directory. 

def unzip_data(input_data_path):
    with zipfile.ZipFile(input_data_path, 'r') as input_data_zip:
        input_data_zip.extractall('.')

# Input data is a file with a single JSON object per line with the following format: 
# {
#  "reviewerID": <string>,
#  "asin": <string>,
#  "reviewerName" <string>,
#  "helpful": [
#    <int>, (indicating number of "helpful votes")
#    <int>  (indicating total number of votes)
#  ],
#  "reviewText": "<string>",
#  "overall": <int>,
#  "summary": "<string>",
#  "unixReviewTime": <int>,
#  "reviewTime": "<string>"
# }
# 
# We are specifically interested in the fields "helpful" and "reviewText"
#

def label_data(input_data):
    labeled_data = []
    HELPFUL_LABEL = "__label__1"
    UNHELPFUL_LABEL = "__label__2"
     
    for l in open(input_data, 'r'):
        l_object = json.loads(l)
        helpful_votes = float(l_object['helpful'][0])
        total_votes = l_object['helpful'][1]
        reviewText = l_object['reviewText']
        if total_votes != 0:
            if helpful_votes / total_votes > .5:
                labeled_data.append(" ".join([HELPFUL_LABEL, reviewText]))
            elif helpful_votes / total_votes < .5:
                labeled_data.append(" ".join([UNHELPFUL_LABEL, reviewText]))
          
    return labeled_data


# Labeled data is a list of sentences, starting with the label defined in label_data. 

def split_sentences(labeled_data):
    new_split_sentences = []
    for d in labeled_data:       
        sentences = " ".join(d.split()[1:]).split(".") # Initially split to separate label, then separate sentences
        for s in sentences:
            if s: # Make sure sentences isn't empty. Common w/ "..."
                new_split_sentences.append(s)
    return new_split_sentences


unzip_data('reviews_Musical_Instruments_5.json.zip')
labeled_data = label_data('reviews_Musical_Instruments_5.json')
new_split_sentence_data = split_sentences(labeled_data)

print(labeled_data[0:9])
print(' ')
print(new_split_sentence_data[0:9])

["__label__1 The product does exactly as it should and is quite affordable.I did not realized it was double screened until it arrived, so it was even better than I had expected.As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and smelling it after recording. :DIf you needed a pop filter, this will work just as well as the expensive ones, and it may even come with a pleasing aroma like mine did!Buy this product! :]", '__label__1 The primary job of this device is to block the breath that would otherwise produce a popping sound, while allowing your voice to pass through with no noticeable reduction of volume or high frequencies. The double cloth filter blocks the pops and lets the voice through with no coloration. The metal clamp mount attaches to the mike stand secure enough to keep it attached. The goose neck needs a little coaxing to stay where you

In [12]:
import boto3
from botocore.exceptions import ClientError
# Note: This solution implies that the bucket below has already been made and that you have access
# to that bucket. You would need to change the bucket below to a bucket that you have write
# premissions to. This will take time depending on your internet connection, the training file is ~ 40 mb

BUCKET = "udacity-sagemaker-helloblazedata"
s3_prefix = 1
split_sentence_data = new_split_sentence_data

def cycle_data(fp, data):
    for d in data:
        fp.write(d + "\n")

def write_trainfile(split_sentence_data):
    train_path = "hello_blaze_train"
    with open(train_path, 'w') as f:
        cycle_data(f, split_sentence_data)
    return train_path

def write_validationfile(new_split_sentence_data):
    validation_path = "hello_blaze_validation"
    with open(validation_path, 'w') as f:
        cycle_data(f, split_sentence_data)
    return validation_path 

def upload_file_to_s3(file_name, s3_prefix):
    object_name = os.path.join(s3_prefix, file_name)
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, BUCKET, object_name)
    except ClientError as e:
        logging.error(e)
        return False

s3_prefix = "l2e1"

split_data_trainlen = int(len(split_sentence_data) * .9)
split_data_validationlen = int(len(split_sentence_data) * .1)


train_path = write_trainfile(split_sentence_data[:split_data_trainlen])
print("Training file written!")
validation_path = write_validationfile(split_sentence_data[split_data_trainlen:])
print("Validation file written!")

upload_file_to_s3(train_path, s3_prefix)
print("Train file uploaded!")
upload_file_to_s3(validation_path, s3_prefix)
print("Validation file uploaded!")

print(" ".join([train_path, validation_path]))

Training file written!
Validation file written!
Train file uploaded!
Validation file uploaded!
hello_blaze_train hello_blaze_validation


## Exercise: Deploy Model

Once you have your model, it's trivially easy to create an endpoint. All you need to do is initialize a "model" object, and call the deploy method. Fill in the method below with the proper addresses and an endpoint will be created, serving your model. Once this is done, confirm that the endpoint is live by consulting the SageMaker Console. You'll see this under "Endpoints" in the "Inference" menu on the left-hand side. If done correctly, this will take a while to get instantiated. 

You will need the following methods: 

* You'll need `image_uris.retrieve` method to determine the image uri to get a BlazingText docker image uri https://sagemaker.readthedocs.io/en/stable/api/utility/image_uris.html
* You'll need a `model_data` to pass the S3 location of a SageMaker model data
* You'll need to use the `Model` object https://sagemaker.readthedocs.io/en/stable/api/inference/model.html
* You'll need to the get execution role. 
* You'll need to use the `deploy` method of the model object, using a single instance of "ml.m5.large"

In [2]:
from sagemaker import get_execution_role
role = get_execution_role()
print(role)


arn:aws:iam::106841699097:role/service-role/AmazonSageMaker-ExecutionRole-20221219T193483


In [13]:
from sagemaker import get_execution_role
from sagemaker.model import Model
from sagemaker import image_uris

# get the execution role
role = get_execution_role()
# get the image using the "blazingtext" framework and your region
image_uri = image_uris.retrieve(framework='blazingtext',region='us-east-1')
# get the S3 location of a SageMaker model data
model_data = "s3://toysreview/1/hello_blaze_output/BlazingText2/output/model.tar.gz"
# define a model object
model = Model(image_uri=image_uri, model_data=model_data, role=role)
# deploy the model using a single instance of "ml.m5.large"
predictor = model.deploy(initial_instance_count=1, instance_type="ml.m5.large")

-----!

## Exercise: Evaluate Data

Alright, we now have an easy way to evaluate our data! You will want to interact with the endpoint using the predictor interface: https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html

Predictor is not the endpoint itself, but instead is an interface that we can use to easily interact with our deployed model. Your task is to take `new_split_sentence_data` and evaluate it using the predictor.  

Note that the BlazingText supports "application/json" as the content-type for inference and the model expects a payload that contains a list of sentences with the key as “instances”.

The method you'll need to call is highlighted below.

Another recommendation: try evaluating a subset of the data before evaluating all of the data. This will make debugging significantly faster.

In [15]:
from sagemaker.predictor import Predictor
import json

predictor = Predictor("blazingtext-2022-12-26-14-47-46-668")

# load the first five reviews from new_split_sentence_data
example_sentences = new_split_sentence_data[0:6]

payload = {"instances": example_sentences}

print(json.dumps(payload))

# make predictions using the "predict" method. Set initial_args to {'ContentType': 'application/json'}
predictions = json.loads(predictor.predict(json.dumps(payload), initial_args={'ContentType': 'application/json'}))

print(predictions)

{"instances": ["The product does exactly as it should and is quite affordable", "I did not realized it was double screened until it arrived, so it was even better than I had expected", "As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and smelling it after recording", " :DIf you needed a pop filter, this will work just as well as the expensive ones, and it may even come with a pleasing aroma like mine did!Buy this product! :]", "The primary job of this device is to block the breath that would otherwise produce a popping sound, while allowing your voice to pass through with no noticeable reduction of volume or high frequencies", " The double cloth filter blocks the pops and lets the voice through with no coloration"]}
[{'label': ['__label__1'], 'prob': [0.8800267577171326]}, {'label': ['__label__1'], 'prob': [0.8743716478347778]}, {'label': ['__labe

## Make sure you stop/delete the endpoint after completing the exercise to avoid cost.

In [17]:
predictor.delete_endpoint()

In [5]:
import boto3
import json
import os
import zipfile

# Todo: Input the s3 bucket
s3_bucket = "udacity-sagemaker-helloblazedata"

# Todo: Input the s3 prefix
s3_prefix = "1"

# Todo: Input the the file to write the data to
file_name = "musical-instruments-review.txt"

# Function below unzips the archive to the local directory. 

def unzip_data(input_data_path):
    with zipfile.ZipFile(input_data_path, 'r') as input_data_zip:
        input_data_zip.extractall('.')


def split_sentences(input_data):
    split_sentences = []
    for l in open(input_data, 'r'):
        l_object = json.loads(l)
        helpful_votes = float(l_object['helpful'][0])
        total_votes = l_object['helpful'][1]
        if total_votes != 0 and helpful_votes/total_votes != .5:  # Filter out same data as prior jobs. 
            reviewText = l_object['reviewText']
            sentences = reviewText.split(".") 
            for s in sentences:
                if s: # Make sure sentences isn't empty. Common w/ "..."
                    split_sentences.append(s)
    return split_sentences

# Format the data as {'source': 'THIS IS A SAMPLE SENTENCE'}
# And write the data into a file
def cycle_data(fp, data):
    for d in data:
        fp.write(json.dumps({'source':d}) + '\n')

# Todo: write a function to upload the data to s3
def upload_file_to_s3(file_name, s3_prefix):
    object_name = os.path.join(s3_prefix, file_name)
    s3_client = boto3.client('s3')
    try: 
        response = s3_client.upload_file(file_name, s3_bucket, object_name)
    except ClientError as e:
        logging.error(e)
        return False


# Unzips file.
unzip_data('reviews_Musical_Instruments_5.json.zip')

# Todo: preprocess reviews_Musical_Instruments_5.json 
sentences = split_sentences('reviews_Musical_Instruments_5.json')

# Write data to a file and upload it to s3.
with open(file_name, 'w') as f:
    cycle_data(f, sentences)

upload_file_to_s3(file_name, s3_prefix)

# Get the s3 path for the data
batch_transform_input_path = "s3://" + "/".join([s3_bucket, s3_prefix, file_name])

print(batch_transform_input_path)

s3://udacity-sagemaker-helloblazedata/1/musical-instruments-review.txt


In [8]:
from sagemaker import get_execution_role
from sagemaker.model import Model
from sagemaker import image_uris
# Get the execution role

role = get_execution_role()

# Get the image uri using the "blazingtext" algorithm in your region. 

image_uri = image_uris.retrieve(framework='blazingtext',region='us-east-1')

# Get the model artifact from S3

model_data = 's3://toysreview/1/hello_blaze_output/BlazingText2/output/model.tar.gz'

# Get the s3 path for the batch transform data

batch_transform_output_path = 's3://udacity-sagemaker-helloblazedata/1/batch_transform_output'

# Define a model object

model = Model(image_uri=image_uri, model_data=model_data, role=role)


# Define a transformer object, using a single instance ml.m4.xlarge. Specify an output path to your s3 bucket. 

transformer = model.transformer(
    instance_count=1, 
    instance_type='ml.m4.xlarge', 
    output_path=batch_transform_output_path
    
)

# Call the transform method. Set content_type='application/jsonlines', split_type='Line'

transformer.transform(
    data=batch_transform_input_path, 
    data_type='S3Prefix',
    content_type='applications/json', 
    split_type='Line'
)

transformer.wait()

INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: 1.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating model with name: blazingtext-2022-12-26-16-59-07-537
INFO:sagemaker:Creating transform job with name: blazingtext-2022-12-26-16-59-08-091


..................................[34mArguments: serve[0m
[35mArguments: serve[0m
[34m[12/26/2022 17:04:38 INFO 140506817820480] Finding and loading model[0m
[34m[12/26/2022 17:04:38 INFO 140506817820480] Trying to load model from /opt/ml/model/model.bin[0m
[35m[12/26/2022 17:04:38 INFO 140506817820480] Finding and loading model[0m
[35m[12/26/2022 17:04:38 INFO 140506817820480] Trying to load model from /opt/ml/model/model.bin[0m
[34m[12/26/2022 17:04:39 INFO 140506817820480] Number of server workers: 4[0m
[34m[2022-12-26 17:04:39 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2022-12-26 17:04:39 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2022-12-26 17:04:39 +0000] [1] [INFO] Using worker: sync[0m
[34m[2022-12-26 17:04:39 +0000] [34] [INFO] Booting worker with pid: 34[0m
[34m[2022-12-26 17:04:39 +0000] [35] [INFO] Booting worker with pid: 35[0m
[35m[12/26/2022 17:04:39 INFO 140506817820480] Number of server workers: 4[0m
[35m[2022-12-

UnexpectedStatusException: Error for Transform job blazingtext-2022-12-26-16-59-08-091: Failed. Reason: ClientError: See job logs for more information