# Creating a Sentiment Analysis myb App
## Using PyTorch and SageMaker


---

My goal in this project will be to have a simple web page which a user can use to enter a movie review. The web page will then send the review off to the deployed model which will predict the sentiment of the entered review.

## General Outline

1. Download or otherwise retrieve the data.
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.

For this project, I will be following the steps in the general outline with some modifications. 

First, I will not be testing the model in its own step. I will still be testing the model, however, I will do it by deploying my model and then using the deployed model by sending the test data to it. One of the reasons for doing this is so that I can make sure that my deployed model is working correctly before moving forward.

In addition, I will deploy and use my trained model a second time. In the second iteration you will customize the way that my trained model is deployed by including some of my own code. In addition, my newly deployed model will be used in the sentiment analysis web app.

In [1]:
# make sure that SageMaker 1.x is used
!pip install sagemaker==1.72.0

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p36/bin/python -m pip install --upgrade pip' command.[0m


## Step 1: Downloading the data

For this notebook, I will be using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/)

> Maas, Andrew L., et al. [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/). In _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_. Association for Computational Linguistics, 2011.

In [2]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

mkdir: cannot create directory ‘../data’: File exists
--2021-09-03 16:10:56--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2021-09-03 16:11:00 (21.4 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



## Step 2: Preparing and Processing the data

Also, as in the XGBoost notebook, I will be doing some initial data processing. To begin with, I will read in each of the reviews and combine them into a single input structure. Then, I will split the dataset into a training set and a testing set.

In [3]:
import os
import glob

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # here I represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [4]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


Now that I've read the raw training and testing data from the downloaded dataset, I will combine the positive and negative reviews and shuffle the resulting records.

In [5]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    # combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    # shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test

In [6]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


Now that I have our training and testing sets unified and prepared, I should do a quick check and see an example of the data THE model will be trained on. This is generally a good idea as it allows us to see how each of the further processing steps affects the reviews and it also ensures that the data has been loaded correctly.

In [7]:
print(train_X[100])
print(train_y[100])

"Scoop" is also the name of a late-Thirties Evelyn Waugh novel, and Woody Allen's new movie, though set today, has a nostalgic charm and simplicity. It hasn't the depth of characterization, intense performances, suspense or shocking final frisson of Allen's penultimate effort "Match Point," (argued by many, including this reviewer, to be a strong return to form) but "Scoop" does closely resemble Allen's last outing in its focus on English aristocrats, posh London flats, murder, and detection. This time Woody leaves behind the arriviste murder mystery genre and returns to comedy, and is himself back on the screen as an amiable vaudevillian, a magician called Sid Waterman, stage moniker The Great Splendini, who counters some snobs' probing with, "I used to be of the Hebrew persuasion, but as I got older, I converted to narcissism." Following a revelation in the midst of Splendini's standard dematerializing act, with Scarlett Johansson (as Sondra Pransky) the audience volunteer, the misma

The first step in processing the reviews is to make sure that any html tags that appear should be removed. In addition I wish to tokenize our input, that way words such as *entertained* and *entertaining* are considered the same with regard to sentiment analysis.

In [8]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() # remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # convert to lower case
    words = text.split() # split string into words
    words = [w for w in words if w not in stopwords.words("english")] # remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

The `review_to_words` method defined above uses `BeautifulSoup` to remove any html tags that appear and uses the `nltk` package to tokenize the reviews. As a check to ensure we know how everything is working, I'll try applying `review_to_words` to one of the reviews in the training set.

In [9]:
# apply review_to_words to a review 
review_to_words(train_X[100])

['scoop',
 'also',
 'name',
 'late',
 'thirti',
 'evelyn',
 'waugh',
 'novel',
 'woodi',
 'allen',
 'new',
 'movi',
 'though',
 'set',
 'today',
 'nostalg',
 'charm',
 'simplic',
 'depth',
 'character',
 'intens',
 'perform',
 'suspens',
 'shock',
 'final',
 'frisson',
 'allen',
 'penultim',
 'effort',
 'match',
 'point',
 'argu',
 'mani',
 'includ',
 'review',
 'strong',
 'return',
 'form',
 'scoop',
 'close',
 'resembl',
 'allen',
 'last',
 'outing',
 'focu',
 'english',
 'aristocrat',
 'posh',
 'london',
 'flat',
 'murder',
 'detect',
 'time',
 'woodi',
 'leav',
 'behind',
 'arrivist',
 'murder',
 'mysteri',
 'genr',
 'return',
 'comedi',
 'back',
 'screen',
 'amiabl',
 'vaudevillian',
 'magician',
 'call',
 'sid',
 'waterman',
 'stage',
 'monik',
 'great',
 'splendini',
 'counter',
 'snob',
 'probe',
 'use',
 'hebrew',
 'persuas',
 'got',
 'older',
 'convert',
 'narciss',
 'follow',
 'revel',
 'midst',
 'splendini',
 'standard',
 'demateri',
 'act',
 'scarlett',
 'johansson',
 'son

The above `review_to_words` method removes html formatting and allows us to tokenize the words found in a review, for example, converting *entertained* and *entertaining* into *entertain* so that they are treated as though they are the same word, it also removes stop words.

The method below applies the `review_to_words` method to each of the reviews in the training and testing datasets. In addition it caches the results. This is because performing this processing step can take a long time.

In [10]:
import pickle

cache_dir = os.path.join("../cache", "sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [11]:
# Preprocess data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Read preprocessed data from cache file: preprocessed_data.pkl


## Transforming the data

For the model I'm going to construct in this notebook I will construct a feature representation. To start, I will represent each word as an integer. Of course, some of the words that appear in the reviews occur very infrequently and so likely don't contain much information for the purposes of sentiment analysis. The way I will deal with this problem is that I will fix the size of our working vocabulary and I will only include the words that appear most frequently. I will then combine all of the infrequent words into a single category and, in our case, I will label it as `1`.

Since I will be using a recurrent neural network, it will be convenient if the length of each review is the same. To do this, I will fix the size for our reviews and then pad short reviews with the category 'no word' (which I will label `0`) and truncate long reviews.

### Creating a word dictionary

To begin with, I need to construct a way to map words that appear in the reviews to integers. Here I fix the size of our vocabulary (including the 'no word' and 'infrequent' categories) to be `5000`.

Note that even though the vocab_size is set to `5000`, I only want to construct a mapping for the most frequently appearing `4998` words. This is because I want to reserve the special labels `0` for 'no word' and `1` for 'infrequent word'.

In [12]:
import numpy as np


def build_dict(data, vocab_size = 5000):
    """Construct and return a dictionary mapping each of the most frequently appearing words to a unique integer."""
    
    # determine how often each word appears in `data`. Note that `data` is a list of sentences and that a sentence is a list of words.
    word_count = {} # A dict storing the words that appear in the reviews along with how often they occur
    for review in data:
        for word in review:
            if word not in word_count:
                word_count[word] = 1
            else:
                word_count[word] += 1
    # sort the words found in `data` so that sorted_words[0] is the most frequently appearing word and 
    # sorted_words[-1] is the least frequently appearing word.
    
    sorted_words = sorted(word_count,key=word_count.get,reverse=True)
    
    word_dict = {} # This is what I'm are building, a dictionary that translates words into integers
    for idx, word in enumerate(sorted_words[:vocab_size - 2]): # The -2 is so that we save room for the 'no word'
        word_dict[word] = idx + 2                              # 'infrequent' labels
        
    return word_dict

In [13]:
word_dict = build_dict(train_X)

In [14]:
# determine the five most frequently appearing words in the training set.
list(word_dict.items())[:5]

[('movi', 2), ('film', 3), ('one', 4), ('like', 5), ('time', 6)]

### Save `word_dict`

Later on when I construct an endpoint which processes a submitted review, I will need to make use of the `word_dict` which I have created. As such, I will save it to a file now for future use.

In [15]:
data_dir = '../data/pytorch' # the folder I will use for storing data
if not os.path.exists(data_dir): # make sure that the folder exists
    os.makedirs(data_dir)

In [16]:
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

### Transform the reviews

Now that I have the word dictionary which allows us to transform the words appearing in the reviews into integers, it's time to make use of it and convert our reviews to their integer sequence representation, making sure to pad or truncate to a fixed length, which in our case is `500`.

In [17]:
def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0 # I will use 0 to represent the 'no word' category
    INFREQ = 1 # and use 1 to represent the infrequent words, i.e., words not appearing in word_dict
    
    working_sentence = [NOWORD] * pad
    
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
            
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    result = []
    lengths = []
    
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

In [18]:
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

As a quick check to make sure that things are working as intended, I'll check to see what one of the reviews in the training set looks like after having been processeed.

In [19]:
# examine one of the processed reviews to make sure everything is working as intended.
print(train_X[0])
print("Review length:",len(train_X[0]))

[  45  575 1212    2    5  419  629  753 1113   75 3665  514   85 1096
    2  187  102    2  584  634  164   97 1412  774 1219    1 4095    1
    5  343 1805 1436 1425    1    1  262 1263  645  229  123 1444    1
  194  132  535    1    1 1707 3961  378    1   57  758  136  943 4095
   34  758 1242  303 1623 1345   33   75 1910  212 1444   33  740  324
   64 4095  163 1444 2580  318  194  811  300    4   26    2   24   35
    2 1841 1408 2180 3606  137   10    1   17 1012    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In the cells above I use the `preprocess_data` and `convert_and_pad_data` methods to process both the training and testing set, this might result in a a problem beacuse the most frequent words (the vocabulary) can be different in the testing set.

## Step 3: Upload the data to S3

As in the XGBoost notebook, I will need to upload the training dataset to S3 in order for our training code to access it. For now I will save it locally and I will upload to S3 later on.

### Saving the processed training dataset locally

It is important to note the format of the data that I'm saving as I will need to know it when I write the training code. In our case, each row of the dataset has the form `label`, `length`, `review[500]` where `review[500]` is a sequence of `500` integers representing the words in the review.

In [20]:
import pandas as pd
    
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

### Uploading the training data


Next, I need to upload the training data to the SageMaker default S3 bucket so that I can provide access to it while training our model.

In [21]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'

role = sagemaker.get_execution_role()

In [22]:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

**NOTE:** The cell above uploads the entire contents of our data directory. This includes the `word_dict.pkl` file. This is fortunate as I will need this later on when I create an endpoint that accepts an arbitrary review. For now, I will just take note of the fact that it resides in the data directory (and so also in the S3 training bucket) and that I will need to make sure it gets saved in the model directory.

## Step 4: Building and Training the PyTorch Model

A model comprises three objects:

 - Model Artifacts,
 - Training Code, and
 - Inference Code,
 
each of which interact with one another. I will be using containers provided by Amazon with the added benefit of being able to include my own custom code.

I will start by implementing my own neural network in PyTorch along with a training script. the model object is in the `model.py` file, inside of the `train` folder. we can see the implementation by running the cell below.

In [23]:
!pygmentize train/model.py

[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m

[34mclass[39;49;00m [04m[32mLSTMClassifier[39;49;00m(nn.Module):
    [33m"""[39;49;00m
[33m    This is the simple RNN model we will be using to perform Sentiment Analysis.[39;49;00m
[33m    """[39;49;00m

    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, embedding_dim, hidden_dim, vocab_size):
        [33m"""[39;49;00m
[33m        Initialize the model by settingg up the various layers.[39;49;00m
[33m        """[39;49;00m
        [36msuper[39;49;00m(LSTMClassifier, [36mself[39;49;00m).[32m__init__[39;49;00m()

        [36mself[39;49;00m.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=[34m0[39;49;00m)
        [36mself[39;49;00m.lstm = nn.LSTM(embedding_dim, hidden_dim)
        [36mself[39;49;00m.dense = nn.Linear(in_features=hidden_dim, out_features=[34m1[39;49;00m)


The important takeaway from the implementation is that there are three parameters that I may wish to tweak to improve the performance of the model. These are the embedding dimension, the hidden dimension and the size of the vocabulary. I will likely want to make these parameters configurable in the training script so that if I wish to modify them I do not need to modify the script itself. I will see how to do this later on. To start I will write some of the training code in the notebook so that I can more easily diagnose any issues that arise.

First I will load a small portion of the training data set to use as a sample. It would be very time consuming to try and train the model completely in the notebook as I do not have access to a gpu and the compute instance that I'm using is not particularly powerful. However, I can work on a small bit of the data to get a feel for how our training script is behaving.

In [24]:
import torch
import torch.utils.data

# Read in only the first 250 rows
train_sample = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)

# Turn the input pandas dataframe into tensors
train_sample_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
train_sample_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()

# Build the dataset
train_sample_ds = torch.utils.data.TensorDataset(train_sample_X, train_sample_y)
# Build the dataloader
train_sample_dl = torch.utils.data.DataLoader(train_sample_ds, batch_size=50)

### Writing the training method

In [25]:
def train(model, train_loader, epochs, optimizer, loss_fn, device):
    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0
        for batch in train_loader:         
            batch_X, batch_y = batch
            
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            # complete this train method to train the model provided.
            optimizer.zero_grad()
            output = model(batch_X)
            loss = loss_fn(output,batch_y)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.data.item()
        print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))

knowing that I have the training method above, I will test that it is working by writing a bit of code in the notebook that executes our training method on the small sample training set that I loaded earlier. The reason for doing this in the notebook is so that I have an opportunity to fix any errors that arise early when they are easier to diagnose.

In [26]:
import torch.optim as optim
from train.model import LSTMClassifier

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMClassifier(32, 100, 5000).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = torch.nn.BCELoss()

train(model, train_sample_dl, 5, optimizer, loss_fn, device)

Epoch: 1, BCELoss: 0.6935802578926087
Epoch: 2, BCELoss: 0.6845847368240356
Epoch: 3, BCELoss: 0.6771884322166443
Epoch: 4, BCELoss: 0.668962550163269
Epoch: 5, BCELoss: 0.6587221145629882


In order to construct a PyTorch model using SageMaker I must provide SageMaker with a training script. I could optionally include a directory which will be copied to the container and from which the training code will be run. When the training container is executed it will check the uploaded directory (if there is one) for a `requirements.txt` file and install any required Python libraries, after which the training script will be run.

### Training the model

When a PyTorch model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained. Inside of the `train` directory is a file called `train.py` which contains the necessary code to train the model.


The way that SageMaker passes hyperparameters to the training script is by way of arguments. These arguments can then be parsed and used in the training script.

In [27]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge',
                    hyperparameters={
                        'epochs': 10,
                        'hidden_dim': 200,
                    })

In [28]:
estimator.fit({'training': input_data})

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2021-09-03 16:11:39 Starting - Starting the training job...
2021-09-03 16:11:41 Starting - Launching requested ML instances......
2021-09-03 16:12:45 Starting - Preparing the instances for training......
2021-09-03 16:14:05 Downloading - Downloading input data...
2021-09-03 16:14:39 Training - Downloading the training image...
2021-09-03 16:15:10 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-09-03 16:15:11,531 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-09-03 16:15:11,556 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-09-03 16:15:12,976 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-09-03 16:15:13,228 sagemaker-containers INFO     Module train does not provide a setup.py. [

## Step 5: Testing the model

As mentioned at the top of this notebook, I will be testing this model by first deploying it and then sending the testing data to the deployed endpoint. I will do this so that i can make sure that the deployed model is working correctly.

## Step 6: Deploy the model for testing

Now that I have trained the model, I would like to test it to see how it performs. Currently the model takes input of the form `review_length, review[500]` where `review[500]` is a sequence of `500` integers which describe the words present in the review, encoded using `word_dict`. Fortunately, SageMaker provides built-in inference code for models with simple inputs such as this.

There is one thing that I need to provide, however, and that is a function which loads the saved model. This function will be called `model_fn()` and takes as its only parameter a path to the directory where the model artifacts are stored. This function is also be present in the python file which I specified as the entry point.

Since I don't need to change anything in the code that was uploaded during training, I can simply deploy the current model as-is.



In [29]:
# deploy the trained model
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


-------------!

## Step 7 - Using the model for testing

Once deployed, I can read in the test data and send it off to the deployed model to get some results. Once I collect all of the results, I can determine how accurate the model is.

In [30]:
test_X = pd.concat([pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1)

In [31]:
# split the data into chunks and send each chunk seperately, accumulating the results.

def predict(data, rows=512):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = np.array([])
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
    
    return predictions

In [32]:
predictions = predict(test_X.values)
predictions = [round(num) for num in predictions]

In [33]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

0.85276

### More testing

I now have a trained model which has been deployed and which I can send processed reviews to and which returns the predicted sentiment. However, ultimately I would like to be able to send the model an unprocessed review. That is, I would like to send the review itself as a string. For example, I wish to send the following review to the model.

In [34]:
test_review = 'The simplest pleasures in life are the best, and this film is one of them. Combining a rather basic storyline of love and adventure this movie transcends the usual weekend fair with wit and unmitigated charm.'


In order process the review I will need to repeat these two steps:
 - Removed any html tags and stemmed the input
 - Encoded the review as a sequence of integers using `word_dict`
 
Using the `review_to_words` and `convert_and_pad` methods from section one, I will convert `test_review` into a numpy array `test_data` suitable to send to our model.

In [35]:
# convert test_review into a form usable by the model and save the results in test_data
test_data, len_test  = convert_and_pad(word_dict, review_to_words(test_review))
test_data = np.array([np.array([len_test] + test_data)])

Now that I have processed the review, I can send the resulting array to the model to predict the sentiment of the review.

In [36]:
predictor.predict(test_data)

array(0.8228836, dtype=float32)

Since the return value of the model is close to `1`, I can be certain that the review I submitted is positive.

### Deleting the endpoint

Once I've deployed an endpoint it continues to run until I tell it to shut down. Since I'm done using the endpoint for now, I can delete it.

In [37]:
estimator.delete_endpoint()

estimator.delete_endpoint() will be deprecated in SageMaker Python SDK v2. Please use the delete_endpoint() function on your predictor instead.


## Step 6 (again) - Deploy the model for the web app

Now that I know that the model is working, it's time to create some custom inference code so that I can send the model a review which has not been processed and have it determine the sentiment of the review.

As we saw above, by default the estimator which I created, when deployed, will use the entry script and directory which I provided when creating the model. However, since I now wish to accept a string as input and the model expects a processed review, I need to write some custom inference code.

I will store the code that I wrote in the `serve` directory. Provided in this directory is the `model.py` file that I used to construct the model, a `utils.py` file which contains the `review_to_words` and `convert_and_pad` pre-processing functions which I used during the initial data processing, and `predict.py`, the file which will contain our custom inference code. Note also that `requirements.txt` is present which will tell SageMaker what Python libraries are required by our custom inference code.

When deploying a PyTorch model in SageMaker, I'm expected to provide four functions which the SageMaker inference container will use.
 - `model_fn`: This function is the same function that I used in the training script and it tells SageMaker how to load our model.
 - `input_fn`: This function receives the raw serialized input that has been sent to the model's endpoint and its job is to de-serialize and make the input available for the inference code.
 - `output_fn`: This function takes the output of the inference code and its job is to serialize this output and return it to the caller of the model's endpoint.
 - `predict_fn`: The heart of the inference script, this is where the actual prediction is done.

For the simple website that I'm constructing during this project, the `input_fn` and `output_fn` methods are relatively straightforward. We only require being able to accept a string as input and we expect to return a single value as output. You might imagine though that in a more complex application the input or output may be image data or some other binary data which would require some effort to serialize.

### Writing inference code

Before writing the custom inference code, I will begin by taking a look at the code created.

In [38]:
!pygmentize serve/predict.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36msagemaker_containers[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36moptim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m

[34mfrom

### Deploying the model

Now the custom inference code has been added in the `serve/predict.py` file, I will create and deploy the model. To begin with, I need to construct a new PyTorch Model object which points to the model artifacts created during training and also points to the inference code that I wish to use. Then I can call the deploy method to launch the deployment container.

**NOTE**: The default behaviour for a deployed PyTorch model is to assume that any input passed to the predictor is a `numpy` array. In this case I want to send a string so I need to construct a simple wrapper around the `RealTimePredictor` class to accomodate simple strings. In a more complicated situation I may want to provide a serialization object, for example if I wanted to sent image data.

In [39]:
from sagemaker.predictor import RealTimePredictor
from sagemaker.pytorch import PyTorchModel

class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='0.4.0',
                     entry_point='predict.py',
                     source_dir='serve',
                     predictor_cls=StringPredictor)
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


-------------!

### Testing the model

Now that I have deployed the model with the custom inference code, I should test to see if everything is working. Here, I test our model by loading the first `250` positive and negative reviews and send them to the endpoint, then collect the results. The reason for only sending some of the data is that the amount of time it takes for the model to process the input and then perform inference is quite long and so testing the entire data set would be prohibitive.

In [40]:
import glob

def test_reviews(data_dir='../data/aclImdb', stop=250):
    
    results = []
    ground = []
    
    # I make sure to test both positive and negative reviews    
    for sentiment in ['pos', 'neg']:
        
        path = os.path.join(data_dir, 'test', sentiment, '*.txt')
        files = glob.glob(path)
        
        files_read = 0
        
        print('Starting ', sentiment, ' files')
        
        # Iterate through the files and send them to the predictor
        for f in files:
            with open(f) as review:
                # First, I store the ground truth (was the review positive or negative)
                if sentiment == 'pos':
                    ground.append(1)
                else:
                    ground.append(0)
                # Read in the review and convert to 'utf-8' for transmission via HTTP
                review_input = review.read().encode('utf-8')
                # Send the review to the predictor and store the results
                results.append(float(predictor.predict(review_input)))
                
            # Sending reviews to our endpoint one at a time takes a while so I
            # only send a small number of reviews
            files_read += 1
            if files_read == stop:
                break
            
    return ground, results

In [41]:
ground, results = test_reviews()

Starting  pos  files
Starting  neg  files


In [42]:
from sklearn.metrics import accuracy_score
accuracy_score(ground, results)

0.846

As an additional test, I will try sending the `test_review` that we looked at earlier.

In [43]:
predictor.predict(test_review)

b'1'

Now that I know our endpoint is working as expected, I can set up the web page that will interact with it. 

## Step 7 (again): Using the model for the web app

This entire section and the next contain documentation for the complete deployment process mostly done using the AWS console.

So far I have been accessing the model endpoint by constructing a predictor object which uses the endpoint and then just using the predictor object to perform inference. What if I wanted to create a web app which accesses our model? The way things are set up currently makes that not possible since in order to access a SageMaker endpoint the app would first have to authenticate with AWS using an IAM role which included access to SageMaker endpoints. However, there is an easier way! I just need to use some additional AWS services.

<img src="Web App Diagram.svg">

The diagram above gives an overview of how the various services will work together. On the far right is the model which I trained above and which is deployed using SageMaker. On the far left is the web app that collects a user's movie review, sends it off and expects a positive or negative sentiment in return.

In the middle is where some of the magic happens. I will construct a Lambda function, which can be thought of as a straightforward Python function that can be executed whenever a specified event occurs. i will give this function permission to send and recieve data from a SageMaker endpoint.

Lastly, the method I will use to execute the Lambda function is a new endpoint that I will create using API Gateway. This endpoint will be a url that listens for data to be sent to it. Once it gets some data it will pass that data on to the Lambda function and then return whatever the Lambda function returns. Essentially it will act as an interface that lets the web app communicate with the Lambda function.

### Setting up a Lambda function

The first thing I'm going to do is set up a Lambda function. This Lambda function will be executed whenever our public API has data sent to it. When it is executed it will receive the data, perform any sort of processing that is required, send the data (the review) to the SageMaker endpoint I've created and then return the result.

#### Part A: Creating an IAM Role for the Lambda function

Since I want the Lambda function to call a SageMaker endpoint, I need to make sure that it has permission to do so. To do this, I will construct a role that I can later give the Lambda function.

#### Part B: Creating a Lambda function

Finally, I will create the lambda function with the following code.



```python
# I need to use the low-level library to interact with SageMaker since the SageMaker API
# is not available natively through Lambda.
import boto3

def lambda_handler(event, context):

    # The SageMaker runtime is what allows us to invoke the endpoint that I've created.
    runtime = boto3.Session().client('sagemaker-runtime')

    # Now I'll use the SageMaker runtime to invoke our endpoint, sending the given review 
    response = runtime.invoke_endpoint(EndpointName = '**ENDPOINT NAME HERE**',    # The name of the endpoint I created
                                       ContentType = 'text/plain',                 # The data format that is expected
                                       Body = event['body'])                       # The actual review

    # The response is an HTTP response whose body contains the result of our inference
    result = response['Body'].read().decode('utf-8')

    return {
        'statusCode' : 200,
        'headers' : { 'Content-Type' : 'text/plain', 'Access-Control-Allow-Origin' : '*' },
        'body' : result
    }
```

In [44]:
# get the name of the endpoint
predictor.endpoint

'sagemaker-pytorch-2021-09-03-16-29-16-241'

### Setting up API Gateway

Now that the Lambda function is set up, The only thing that's left to do is to create a new API using API Gateway that will trigger the Lambda function I have just created.


## Step 4: Deploying the web app

I created a simple web app for the deployment of the project.\
Now, wil try a positive and a negative review to make sure it's working properly.

### **First, I will try it with a negative review:**

**Screenshot:**
<img src="neg-review.png">

### **Then, I will try it with a positive review:**

**Screenshot:**
<img src="pos-review.png">

In [45]:
# delete the endpoint
predictor.delete_endpoint()