# Sentiment Analysis Web App
## Using PyTorch and SageMaker

SageMaker is used to construct a complete project from end to end. A simple web page which a user can use to enter a movie review. The web page will then send the review off to our deployed model which will predict the sentiment of the entered review.

## General Outline

1. Download or otherwise retrieve the data.
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.

First, testing the model will not occur as a separate step. Instead, the model will be tested by deploying it and sending the test data to the deployed model. This approach ensures the correctness of the deployed model before proceeding further.

Furthermore, the trained model will be deployed and utilized a second time. In this second iteration, the deployment of the trained model will be customized by incorporating additional code. Additionally, the newly deployed model will be integrated into the sentiment analysis web app.

In [1]:
# Make sure that we use SageMaker 1.x
!pip install sagemaker==1.72.0

Collecting sagemaker==1.72.0
  Downloading sagemaker-1.72.0.tar.gz (297 kB)
[K     |████████████████████████████████| 297 kB 20.3 MB/s eta 0:00:01
Collecting smdebug-rulesconfig==0.1.4
  Downloading smdebug_rulesconfig-0.1.4-py2.py3-none-any.whl (10 kB)
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-1.72.0-py2.py3-none-any.whl size=386358 sha256=e7bd69a429ef6ebd40424085f39563d4bceaac80bbedd83a78f73e4aa14765c3
  Stored in directory: /home/ec2-user/.cache/pip/wheels/c3/58/70/85faf4437568bfaa4c419937569ba1fe54d44c5db42406bbd7
Successfully built sagemaker
Installing collected packages: smdebug-rulesconfig, sagemaker
  Attempting uninstall: smdebug-rulesconfig
    Found existing installation: smdebug-rulesconfig 1.0.1
    Uninstalling smdebug-rulesconfig-1.0.1:
      Successfully uninstalled smdebug-rulesconfig-1.0.1
  Attempting uninstall: sagemaker
    Found existing instal

## Step 1: Downloading the data

Data used -> [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/)

> Maas, Andrew L., et al. [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/). In _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_. Association for Computational Linguistics, 2011.

In [2]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

--2021-08-17 10:50:24--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2021-08-17 10:50:29 (16.0 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



## Step 2: Preparing and Processing the data

The first steps consists, starting with the reading of each review and amalgamating them into a unified input structure. Following this, the dataset will be divided into a training set and a testing set.

In [8]:
import os
import glob

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [9]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg



Having read the raw training and testing data from the downloaded dataset, the positive and negative reviews will be combined, followed by shuffling the resulting records.

In [10]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test

In [11]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


Upon unifying and preparing our training and testing sets, it is prudent to conduct a quick inspection to examine an example of the data our model will be trained on. This practice is beneficial as it provides insight into how each subsequent processing step influences the reviews, while also verifying the correct loading of the data.

In [12]:
print(train_X[100])
print(train_y[100])

Before this, the flawed "Slaughterhouse Five" was the best. But this screen adaptation of "Mother Night" is very true to the book and keeps the comedy, mystery, and tragedy intent. Thankfully it wasn't Hollywood-ized or idiotized a la the movie of "Breakfast of Champions." Another good thing about this movie is that you don't have to be familiar with the book to follow it (as I think you do for Slaughterhouse Five). That's probably true of Breakfast of Champions also but they did such a bad job of that you're better off just reading the book and not seeing the movie! Nick Nolte did an excellent job in this film.
1



The initial step in processing the reviews involves removing any HTML tags present. Additionally, tokenization of the input is necessary to treat words such as *entertained* and *entertaining* equivalently in sentiment analysis.

In [13]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

The review_to_words method, as defined above, employs BeautifulSoup to eliminate any HTML tags and utilizes the nltk package for review tokenization. To verify the functionality, apply review_to_words to one of the reviews in the training set.

In [14]:
review_to_words(train_X[100])

['flaw',
 'slaughterhous',
 'five',
 'best',
 'screen',
 'adapt',
 'mother',
 'night',
 'true',
 'book',
 'keep',
 'comedi',
 'mysteri',
 'tragedi',
 'intent',
 'thank',
 'hollywood',
 'ize',
 'idiot',
 'la',
 'movi',
 'breakfast',
 'champion',
 'anoth',
 'good',
 'thing',
 'movi',
 'familiar',
 'book',
 'follow',
 'think',
 'slaughterhous',
 'five',
 'probabl',
 'true',
 'breakfast',
 'champion',
 'also',
 'bad',
 'job',
 'better',
 'read',
 'book',
 'see',
 'movi',
 'nick',
 'nolt',
 'excel',
 'job',
 'film']

The review_to_words method along with removing html formatting and tockenizing words it also

1) Converts text to lowercase

2) Splits the string into words

3) Removes stop words such as in,the,and, etc.



The following method applies the review_to_words function to each review in both the training and testing datasets, while also caching the results. This is essential because this processing step can be time-consuming. By caching the results, it ensures that if the notebook cannot be completed in the current session, the data processing step can be skipped during subsequent sessions.

In [16]:
import pickle

cache_dir = os.path.join("../cache", "sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        #words_train = list(map(review_to_words, data_train))
        #words_test = list(map(review_to_words, data_test))
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [17]:
# Preprocess data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Read preprocessed data from cache file: preprocessed_data.pkl


For the model construction in this notebook, the feature representation will encode each word as an integer. However, infrequently occurring words in the reviews may not contribute significantly to sentiment analysis. To handle this, vocabulary size is limited to include only the most common words. Any uncommon words will be grouped into a single category, marked as `1`.

Given the use of a recurrent neural network, it's advantageous for all reviews to have uniform lengths. To achieve this, we'll establish a fixed review size. Short reviews will be padded with the 'no word' category (labeled as `0`), while longer reviews will be truncated.

To start, the task involves constructing a method to map words present in the reviews to integers. Here, the size of the vocabulary (comprising the 'no word' and 'infrequent' categories) is fixed at `5000`, although adjustments can be made to observe its impact on the model.

In [18]:
import numpy as np

def build_dict(data, vocab_size = 5000):
    """Construct and return a dictionary mapping each of the most frequently appearing words to a unique integer."""
    
    # TODO: Determine how often each word appears in `data`. Note that `data` is a list of sentences and that a
    #       sentence is a list of words.
    
    word_count = {} # A dict storing the words that appear in the reviews along with how often they occur
    for sentence in data:
        for word in sentence:
            if word in word_count:
                word_count[word]+=1
            else:
                word_count[word]=1
            
    # TODO: Sort the words found in `data` so that sorted_words[0] is the most frequently appearing word and
    #       sorted_words[-1] is the least frequently appearing word.
    Word_freq_sorted = sorted(word_count.items(), key=lambda x: x[1], reverse=True)
    sorted_words = [word for word, freq in Word_freq_sorted]
    
    
    word_dict = {} # This is what we are building, a dictionary that translates words into integers
    for idx, word in enumerate(sorted_words[:vocab_size - 2]): # The -2 is so that we save room for the 'no word'
        word_dict[word] = idx + 2                              # 'infrequent' labels
        
    return word_dict

In [19]:
word_dict = build_dict(train_X)

In [20]:
# five most frequently appearing words in the training set.
wd=list(word_dict.keys())
print(wd[0:5])

['movi', 'film', 'one', 'like', 'time']


### Save `word_dict`

Later, in the construction of an endpoint processing submitted reviews, utilization of the `word_dict` created earlier will be necessary. Therefore, saving it to a file now facilitates its future use.

In [21]:
data_dir = '../data/pytorch' # The folder we will use for storing data
if not os.path.exists(data_dir): # Make sure that the folder exists
    os.makedirs(data_dir)

In [22]:
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

### Transform the reviews

Now that the word dictionary is established, facilitating the transformation of words in the reviews into integers, it's time to employ it to convert the reviews into their integer sequence representation. This process ensures padding or truncation to a fixed length, set at `500` in our case.

In [23]:
def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0 # We will use 0 to represent the 'no word' category
    INFREQ = 1 # and we use 1 to represent the infrequent words, i.e., words not appearing in word_dict
    
    working_sentence = [NOWORD] * pad
    
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
            
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    result = []
    lengths = []
    
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

In [24]:
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

As a quick check to make sure that things are working as intended, one of the reviews in the training set is reviewed after having been processeed. 

In [23]:
# Use this cell to examine one of the processed reviews to make sure everything is working as intended.

print(train_X[100])
print(len(train_X[100]))

[   2  514    1 1166   56    1 1191   16   56  107    2   59  431  328
   70 1011  122  140  197  451  128  241   16   77  849    2  131 2302
    1   72    2   34    1   37   45  346  630 1167   12  226 4325   75
   49   50 3672    2  355  261  189  426    2  358    6    1 1396  121
   77  328 1393  617   10  405    1    1  204   10  170  465   39  464
   64  152    1    2  644 2771    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

Preprocessing the data can be of great help to caluculate the word count and eventually build a word dictionary. 
But there might be a problem when we covert and pad data as whenever a new word appears it is taken as infrequent and even if the same new word appears multiple times it will still count it as an infrequent word. 

## Step 3: Upload the data to S3

As in the XGBoost notebook, the training dataset needs to be uploaded to S3 for our training code's access. Currently, we'll save it locally, postponing the upload to S3 for later.

### Saving the processed training dataset locally

It's crucial to understand the format of the saved data, as it will be necessary when writing the training code. In our case, each row of the dataset follows the format `label`, `length`, `review[500]`, where `review[500]` represents a sequence of `500` integers denoting the words in the review.

In [24]:
import pandas as pd
    
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

### Uploading the training data

Next, the training data should be uploaded to the SageMaker default S3 bucket, allowing access during model training.

In [25]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'

role = sagemaker.get_execution_role()

In [26]:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

**NOTE:** The cell above uploads the entire contents of our data directory. This includes the `word_dict.pkl` file. This is fortunate as we will need this later on when we create an endpoint that accepts an arbitrary review. For now, we will just take note of the fact that it resides in the data directory (and so also in the S3 training bucket) and that we will need to make sure it gets saved in the model directory.

## Step 4: Build and Train the PyTorch Model

In particular, a model comprises three objects

 - Model Artifacts,
 - Training Code, and
 - Inference Code,
 
each of which interact with one another. Here we will be using containers provided by Amazon with the added benefit of being able to include our own custom code.

Commencing with the implementation of a neural network in PyTorch, accompanied by a training script. For this project, the essential model object is provided in the `model.py` file within the `train` folder. The provided implementation can be reviewed by executing the cell below.

In [27]:
!pygmentize train/model.py

[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m

[34mclass[39;49;00m [04m[32mLSTMClassifier[39;49;00m(nn.Module):
    [33m"""[39;49;00m
[33m    This is the simple RNN model we will be using to perform Sentiment Analysis.[39;49;00m
[33m    """[39;49;00m

    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, embedding_dim, hidden_dim, vocab_size):
        [33m"""[39;49;00m
[33m        Initialize the model by settingg up the various layers.[39;49;00m
[33m        """[39;49;00m
        [36msuper[39;49;00m(LSTMClassifier, [36mself[39;49;00m).[32m__init__[39;49;00m()

        [36mself[39;49;00m.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=[34m0[39;49;00m)
        [36mself[39;49;00m.lstm = nn.LSTM(embedding_dim, hidden_dim)
        [36mself[39;49;00m.dense = nn.Linear(in_features=hidden_dim, out_features=[34m1[39;49;00m)
        [36msel

The essential observation from the provided implementation is the presence of three parameters that may require adjustment to enhance our model's performance. These parameters include the embedding dimension, the hidden dimension, and the vocabulary size. Configuring these parameters within the training script allows for easy modification without altering the script itself. This process will be explored later on. Initially, we'll write some training code in the notebook to facilitate easier diagnosis of any arising issues.

Initially, a small portion of the training dataset will be loaded for use as a sample. Training the model entirely within the notebook would be time-consuming due to the absence of a GPU and the limited computational power of the current compute instance. Nonetheless, working with a small subset of the data allows us to assess the behavior of our training script.

In [28]:
import torch
import torch.utils.data

# Read in only the first 250 rows
train_sample = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)

# Turn the input pandas dataframe into tensors
train_sample_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
train_sample_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()

# Build the dataset
train_sample_ds = torch.utils.data.TensorDataset(train_sample_X, train_sample_y)
# Build the dataloader
train_sample_dl = torch.utils.data.DataLoader(train_sample_ds, batch_size=50)

### training method

Next, the training code itself needs to be written. This process should resemble training methods previously developed for training PyTorch models. Complex tasks, such as model saving/loading and parameter loading, will be addressed later.

In [29]:
def train(model, train_loader, epochs, optimizer, loss_fn, device):
    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0
        for batch in train_loader:         
            batch_X, batch_y = batch
            
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            # TODO: Complete this train method to train the model provided.
            optimizer.zero_grad()
            
            output = model(batch_X)
            
            loss = loss_fn(output, batch_y)
            loss.backward()
            
            optimizer.step()
            
            total_loss += loss.data.item()
        print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))

Assuming the presence of the training method described above, the functionality will be tested by executing a code snippet in the notebook. This code will run our training method on the small sample training set loaded earlier. Performing this task within the notebook allows for early detection and resolution of any arising errors, which are easier to diagnose.

In [30]:
import torch.optim as optim
from train.model import LSTMClassifier

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMClassifier(32, 100, 5000).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = torch.nn.BCELoss()

train(model, train_sample_dl, 5, optimizer, loss_fn, device)

Epoch: 1, BCELoss: 0.6932786464691162
Epoch: 2, BCELoss: 0.6828211784362793
Epoch: 3, BCELoss: 0.6739500045776368
Epoch: 4, BCELoss: 0.6641816258430481
Epoch: 5, BCELoss: 0.6525045871734619


To construct a PyTorch model using SageMaker, a training script must be provided. Optionally, a directory can be included, which will be copied to the container and serve as the location for running our training code. Upon execution of the training container, it will examine the uploaded directory (if available) for a `requirements.txt` file. Subsequently, it will install any necessary Python libraries before running the training script.

### Training the model

When constructing a PyTorch model in SageMaker, an entry point needs specification. The designated Python file, executed during model training, is referred to as the entry point. Within the `train` directory, a file named `train.py` has been provided, containing most of the requisite code for model training. The only missing component is the implementation of the `train()` method, previously written in this notebook.

SageMaker passes hyperparameters to the training script as arguments. These arguments can then be parsed and utilized within the training script. To understand this process, refer to the provided `train/train.py` file.

In [31]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge',
                    hyperparameters={
                        'epochs': 10,
                        'hidden_dim': 200,
                    })

In [32]:
estimator.fit({'training': input_data})

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2021-08-17 12:03:44 Starting - Starting the training job...
2021-08-17 12:03:46 Starting - Launching requested ML instances............
2021-08-17 12:05:59 Starting - Preparing the instances for training............
2021-08-17 12:07:50 Downloading - Downloading input data...
2021-08-17 12:08:32 Training - Downloading the training image..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-08-17 12:08:56,747 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-08-17 12:08:56,773 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-08-17 12:08:56,778 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-08-17 12:08:57,112 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2021-08-17 12:08:57,113 sagemaker-contain

## Step 5: Testing the model

As indicated at the top of this notebook, the model will be tested by deploying it first and then sending the testing data to the deployed endpoint. This process ensures that the deployed model is functioning correctly.

## Step 6: Deploying the model for testing

Now that the model is trained, testing is necessary to assess its performance. Currently, the model accepts input in the form of `review_length, review[500]`, where `review[500]` represents a sequence of `500` integers describing the words in the review, encoded using `word_dict`. Thankfully, SageMaker provides built-in inference code for models with such simple inputs.

However, one requirement is to provide a function to load the saved model. This function, named `model_fn()`, should take a path to the directory containing the model artifacts as its only parameter. This function must also exist in the Python file specified as the entry point. Fortunately, the model loading function has already been provided, requiring no further changes.

**NOTE:** When running the built-in inference code, it must import the `model_fn()` method from the `train.py` file. This is why the training code is wrapped in a main guard (i.e., `if __name__ == '__main__':`).

Since no changes are needed in the uploaded code during training, the current model can be deployed as-is.

**NOTE:** When deploying a model, SageMaker launches a compute instance that remains active until you shut it down. It's crucial to be aware of this because the cost of a deployed endpoint depends on its running duration.

In other words, **if you're no longer using a deployed endpoint, remember to shut it down!**

Deployment of the trained model.

In [None]:
# training_job_name = 'sagemaker-pytorch-2021-08-17-12-03-44-433'

In [34]:
#  Deploy the trained model
predictor = estimator.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
Using already existing model: sagemaker-pytorch-2021-08-17-12-03-44-433


---------------!

## Step 7 - Use the model for testing

Once deployed, we can read in the test data and send it off to our deployed model to get some results. Once we collect all of the results we can determine how accurate our model is.

In [35]:
test_X = pd.concat([pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1)

In [36]:
# We split the data into chunks and send each chunk seperately, accumulating the results.

def predict(data, rows=512):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = np.array([])
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
    
    return predictions

In [37]:
predictions = predict(test_X.values)
predictions = [round(num) for num in predictions]

In [38]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

0.83664

###  More testing

Now, there exists a trained model that has been deployed and can receive processed reviews, returning the predicted sentiment. However, the ultimate goal is to send the model an unprocessed review, meaning the review itself as a string. For example, consider sending the following review to the model.

In [39]:
test_review = 'The simplest pleasures in life are the best, and this film is one of them. Combining a rather basic storyline of love and adventure this movie transcends the usual weekend fair with wit and unmitigated charm.'

To send the review to the model, the data processing steps performed on the IMDb dataset need to be repeated. Specifically, two steps were undertaken:
- Eliminating any HTML tags and stemming the input
- Encoding the review as a sequence of integers using `word_dict`

To process the review, the `review_to_words` and `convert_and_pad` methods from the first section should be utilized to convert `test_review` into a numpy array `test_data` suitable for transmission to the model. It's important to note that the model anticipates input in the format `review_length, review[500]`.

In [40]:
# Convert test_review into a form usable by the model and save the results 
r_words= review_to_words(test_review)
test_data, test_data_len = convert_and_pad(word_dict,r_words)
test_data = np.array([np.array([test_data_len] + test_data)])

Now that the review has been processed, the resulting array can be sent to the model to predict the sentiment of the review.

In [41]:
predictor.predict(test_data)

array(0.5964819, dtype=float32)

Since the return value of the model is close to `1`, it is certain that the review submitted is positive.

### Delete the endpoint

Once an endpoint has been deployed, it continues to run until instructed to shut down. As the endpoint is no longer needed, it can be deleted.

In [42]:
estimator.delete_endpoint()

estimator.delete_endpoint() will be deprecated in SageMaker Python SDK v2. Please use the delete_endpoint() function on your predictor instead.


## Step 6 (again) - Deploy the model for the web app

In the context of deploying the model for the web app, custom inference code needs to be developed to enable the model to analyze unprocessed reviews and determine their sentiment. This involves storing the code in the `serve` directory, which contains essential files such as `model.py`, `utils.py`, and `predict.py`. The `requirements.txt` file specifies the necessary Python libraries for the custom inference code.

When deploying a PyTorch model in SageMaker, four functions are expected for the SageMaker inference container to utilize:
- `model_fn`: Loads the model.
- `input_fn`: Deserializes and prepares the input for inference.
- `output_fn`: Serializes the output for return.
- `predict_fn`: Conducts the actual prediction, which needs to be implemented.

For the simple website being developed, `input_fn` and `output_fn` methods are straightforward, requiring the ability to accept a string as input and return a single value as output. However, in more complex applications, input or output may involve image data or other binary data, necessitating serialization efforts.

### Writing Inference Code

Before proceeding with custom inference code, it's essential to review the provided code.

In [44]:
!pygmentize serve/predict.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36msagemaker_containers[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36moptim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m

[34mfrom[39;49;00m 

### Deploying the model

Now that the custom inference code has been written, the model can be created and deployed. To begin with, a new PyTorchModel object needs to be constructed which points to the model artifacts created during training and also points to the inference code that is used. Then the deploy method can be called to launch the deployment container.

**NOTE**: By default, a deployed PyTorch model expects input in the form of a `numpy` array. However, since the input is a string, a basic wrapper around the `RealTimePredictor` class needs to be created to handle simple strings. In more complex scenarios, such as sending image data, a serialization object might need to be provided.

In [45]:
from sagemaker.predictor import RealTimePredictor
from sagemaker.pytorch import PyTorchModel

class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='0.4.0',
                     entry_point='predict.py',
                     source_dir='serve',
                     predictor_cls=StringPredictor)
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


-----------------!

### Testing the model

Now that the model has been deployed with the custom inference code, it's time to conduct a test to ensure everything is functioning correctly. In this test, we'll load the first `250` positive and negative reviews, send them to the endpoint, collect the results, and analyze the outcomes. The reason for only sending a portion of the data is that the time required for the model to process the input and perform inference is quite lengthy, making it impractical to test the entire dataset at once.

In [46]:
import glob

def test_reviews(data_dir='../data/aclImdb', stop=250):
    
    results = []
    ground = []
    
    # We make sure to test both positive and negative reviews    
    for sentiment in ['pos', 'neg']:
        
        path = os.path.join(data_dir, 'test', sentiment, '*.txt')
        files = glob.glob(path)
        
        files_read = 0
        
        print('Starting ', sentiment, ' files')
        
        # Iterate through the files and send them to the predictor
        for f in files:
            with open(f) as review:
                # First, we store the ground truth (was the review positive or negative)
                if sentiment == 'pos':
                    ground.append(1)
                else:
                    ground.append(0)
                # Read in the review and convert to 'utf-8' for transmission via HTTP
                review_input = review.read().encode('utf-8')
                # Send the review to the predictor and store the results
                results.append(float(predictor.predict(review_input)))
                
            # Sending reviews to our endpoint one at a time takes a while so we
            # only send a small number of reviews
            files_read += 1
            if files_read == stop:
                break
            
    return ground, results

In [47]:
ground, results = test_reviews()

Starting  pos  files
Starting  neg  files


In [48]:
from sklearn.metrics import accuracy_score
accuracy_score(ground, results)

0.84

As an additional test, we can try sending the `test_review` that we looked at earlier.

In [49]:
predictor.predict(test_review)

b'1.0'

Now that the endpoint is confirmed to be functioning correctly, the web page will be set up to interact with it. If unable to complete the project at this time, remember to skip ahead to the end of this notebook and shut down the endpoint. The endpoint can be deployed again later when returning.

## Step 7 (again): Use the model for the web app

> **NOTE:** This entire section and the next contain tasks completed, mostly using the AWS console.

So far, the model endpoint has been accessed by constructing a predictor object that utilizes the endpoint and then using it for inference. Accessing the model via a web app is not feasible with the current setup, as it requires the app to authenticate with AWS using an IAM role with access to SageMaker endpoints. However, an alternative method can be employed using additional AWS services.

<img src="Web App Diagram.svg">

The diagram above illustrates how these services will interact. On the right side is the trained model deployed using SageMaker. On the left side is our web app, which collects a user's movie review, sends it, and expects a positive or negative sentiment in return.

In the middle, a Lambda function will be constructed—a simple Python function that executes when a specified event occurs. This function will be granted permission to send and receive data from a SageMaker endpoint.

Lastly, an endpoint will be created using API Gateway to execute the Lambda function. This endpoint will listen for data, pass it to the Lambda function, and return the result, effectively serving as an interface for the web app to communicate with the Lambda function.

### Setting up a Lambda function

The first step is to create a Lambda function. This function will execute whenever data is sent to our public API. It will receive the data, process it, send it to the SageMaker endpoint, and return the result.

#### Part A: Create an IAM Role for the Lambda function

To allow the Lambda function to call the SageMaker endpoint, we need to create a role with the necessary permissions.

1. Navigate to the **IAM** page in the AWS Console and click on **Roles**.
2. Click **Create role**, ensuring that **AWS service** is selected as the trusted entity, and choose **Lambda** as the service.
3. Click **Next: Permissions**, search for `sagemaker`, and select the **AmazonSageMakerFullAccess** policy.
4. Click **Next: Review**, name the role (e.g., `LambdaSageMakerRole`), and click **Create role**.

#### Part B: Create a Lambda function

1. Navigate to the AWS Lambda page and click **Create function**.
2. Choose **Author from scratch**, name your Lambda function (e.g., `sentiment_analysis_func`), and select **Python 3.6** as the runtime.
3. Choose the role created in the previous step, and click **Create Function**.
4. In the editor, paste the code provided below.

```python
# We need to use the low-level library to interact with SageMaker since the SageMaker API
# is not available natively through Lambda.
import boto3

def lambda_handler(event, context):

    # The SageMaker runtime is what allows us to invoke the endpoint that we've created.
    runtime = boto3.Session().client('sagemaker-runtime')

    # Now we use the SageMaker runtime to invoke our endpoint, sending the review we were given
    response = runtime.invoke_endpoint(EndpointName = '**ENDPOINT NAME HERE**',    # The name of the endpoint we created
                                       ContentType = 'text/plain',                 # The data format that is expected
                                       Body = event['body'])                       # The actual review

    # The response is an HTTP response whose body contains the result of our inference
    result = response['Body'].read().decode('utf-8')

    return {
        'statusCode' : 200,
        'headers' : { 'Content-Type' : 'text/plain', 'Access-Control-Allow-Origin' : '*' },
        'body' : result
    }
```

In [50]:
predictor.endpoint

'sagemaker-pytorch-2021-08-17-13-18-21-055'

After adding the endpoint name to the Lambda function, click on **Save** to ensure the changes are applied. The Lambda function is now operational. Next, we need to establish a method for our web app to execute the Lambda function.

### Establishing API Gateway

With the AWS Console, navigate to **Amazon API Gateway** and select **Get started**.

Ensure that **New API** is chosen, then assign a name to the new API, such as `sentiment_analysis_api`. Proceed by clicking **Create API**.

Although the API has been created, it currently lacks functionality. Our objective is to configure it to trigger the previously created Lambda function.

From the **Actions** dropdown menu, choose **Create Method**. A new method will be generated; select its dropdown menu and opt for **POST**, then confirm with the check mark.

For the integration point, ensure **Lambda Function** is selected and choose **Use Lambda Proxy integration**. This option ensures that data sent to the API is directly forwarded to the Lambda function without any intermediary processing. It also necessitates that the return value conforms to a proper response object, as it won't be processed by API Gateway.

Enter the name of the Lambda function created earlier into the **Lambda Function** field, then click **Save**. Grant permission to API Gateway to invoke the Lambda function by clicking **OK** in the subsequent prompt.

To finalize the creation of API Gateway, select **Actions**, then **Deploy API**. Create a new Deployment stage, naming it as desired, such as `prod`.

You've now successfully established a public API to access your SageMaker model. Ensure to record the URL provided for invoking your newly created public API, as it will be required in the subsequent step. This URL is located at the top of the page, highlighted in blue next to the text **Invoke URL**.

## Step 4: Deploying the web app

Now that the is a publicly available API, we can start using it in a web app. For our purposes, we have a simple static html file which can make use of the public api created earlier.

In the `website` folder there should be a file called `index.html`. Download the file to your computer and open that file up in a text editor of your choice. There should be a line which contains **\*\*REPLACE WITH PUBLIC API URL\*\***. Replace this string with the url that you wrote down in the last step and then save the file.

Now, if you open `index.html` on your local computer, your browser will behave as a local web server and you can use the provided site to interact with your SageMaker model.

If you'd like to go further, you can host this html file anywhere you'd like, for example using github or hosting a static site on Amazon's S3. Once you have done this you can share the link with anyone you'd like and have them play with it too!

> **Important Note** In order for the web app to communicate with the SageMaker endpoint, the endpoint has to actually be deployed and running. This means that you are paying for it. Make sure that the endpoint is running when you want to use the web app but that you shut it down when you don't need it, otherwise you will end up with a surprisingly large AWS bill.


**Sample Outputs:**

"The movie was utterly disappointing." - Your review was NEGATIVE!

"Beautiful. Would love to watch it again." -Your review was POSITIVE!

"I hate that I loved it from the bottom of my heart." - Your review was POSITIVE!

### Delete the endpoint

Remember to always shut down your endpoint if you are no longer using it. You are charged for the length of time that the endpoint is running so if you forget and leave it on you could end up with an unexpectedly large bill.

In [51]:
predictor.delete_endpoint()