# Sentiment Analysis Web App

## Introduction

In this project, thanks to **SageMaker** and **PyTorch**, we are going construct a basic (but functional) **Sentiment Analysis Web App** from end to end. Our goal will be to have a simple web page which a user can use to enter a movie review; the web page will then send the review off to our deployed model which will predict the sentiment of the entered review.

As a technical prerequisite, to perform that, first, we need to make some imports:

In [1]:
# Needed imports:
import numpy as np
import pandas as pd
import os, glob, nltk, re, pickle, sagemaker, torch, time
from bs4 import BeautifulSoup
from collections import Counter
from sklearn.utils import shuffle
from nltk.corpus import stopwords
from torch.nn import BCELoss
from torch.optim import Adam
from train.model import LSTMClassifier
from sagemaker.pytorch import PyTorch, PyTorchModel
from sagemaker.predictor import RealTimePredictor
from sklearn.metrics import accuracy_score
from nltk.stem.porter import PorterStemmer
from torch.utils.data import DataLoader, TensorDataset

## General Outline

Before starting this project, we can recall the general outline for SageMaker projects using a notebook instance:

1. Download or otherwise retrieve the data;
2. Process / Prepare the data;
3. Upload the processed data to S3;
4. Train a chosen model;
5. Test the trained model (typically using a batch transform job);
6. Deploy the trained model;
7. Use the deployed model.

For this project, we will be following the steps in the general outline with some modifications. 

Indeed, we will not be testing the model in its own step: We will do it by deploying our model and then using the deployed model by sending the test data to it. One of the reasons for doing this is so that we can make sure that our deployed model is working correctly before moving forward.

In addition, we will deploy and use our trained model a second time: In this second iteration, we will customize the way that our trained model is deployed by including some of our own code. In addition, our newly deployed model will be used in the sentiment analysis web app.

## Step 1: Download the data

Here, we will be using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

*Nota Bene:* A publication using this dataset can be found is this one.
> Maas, Andrew L., et al. [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/). In _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_. Association for Computational Linguistics, 2011.

In [2]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

--2018-12-25 08:19:18--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2018-12-25 08:19:28 (8.52 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



## Step 2: Process / Prepare the data

Now, we will be doing some initial data processing: To begin with, we will read in each of the reviews and combine them into a single input structure; then, we will split the dataset into a training set and a testing set.

In [3]:
# Function read_imdb_data:

def read_imdb_data(data_dir='../data/aclImdb'):
    """
    Read data from source
    """
    
    # Initialize data and labels:
    data = {}
    labels = {}
    
    # Initialize train and test sections:
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        # Initialize sentiment classification sections:
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            # Create path and files:
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            # Build data and labels:
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
            
            # Manage exception:
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
    
    # Return data and labels:
    return data, labels

In [4]:
# Create data and labels from source:
data, labels = read_imdb_data()

# Some stats on the dataset:
print("IMDb reviews:")
print("=> train: {} positive / {} negative".format(len(data['train']['pos']),
                                                   len(data['train']['neg'])))
print("=> test: {} positive / {} negative".format(len(data['test']['pos']),
                                                  len(data['test']['neg'])))

IMDb reviews:
=> train: 12500 positive / 12500 negative
=> test: 12500 positive / 12500 negative


Now that we've read the raw training and testing data from the downloaded dataset, we will combine the positive and negative reviews and shuffle the resulting records.

In [5]:
# Function prepare_imdb_data:

def prepare_imdb_data(data, labels):
    """
    Prepare training and test sets from IMDb movie reviews
    """
    
    # Combine positive and negative reviews and labels:
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    # Shuffle reviews and corresponding labels within training and test sets:
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets:
    return data_train, data_test, labels_train, labels_test

In [6]:
# Prepare the different datasets:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)

# Some stats on the dataset:
print("IMDb reviews (combined):")
print("=> train: {}".format(len(train_X)))
print("=> test: {}".format(len(test_X)))

IMDb reviews (combined):
=> train: 25000
=> test: 25000


Now that we have our training and testing sets unified and prepared, we should do a quick check and see an example of the data our model will be trained on. This is generally a good idea as it allows us to see how each of the further processing steps affects the reviews and it also ensures that the data has been loaded correctly.

In [7]:
# Review #42 in the training set:
print("=> Review #42 in the training set:")
print(train_X[42])

# Label #42 in the training set ('0': negative, '1': positive):
print("\n=> Label #42 in the training set ('0': negative, '1': positive): {}".format(train_y[42]))

=> Review #42 in the training set:
This is a dramatic film in the whole sense of the word. It tells a tail that here in Greece we live as a routine in everyday life without realizing how sad it is. Sure it has some extremes.. but every now and then real life sorrow surpasses art.It is deeply critical of the goals we pursue and the whole social structure build around them. The film has a deeper understanding of Greek ways of life, stereotypes, and social structure. Unlike most Greek films that have a certain fast-food-mainstream audience, this one does not target anyone in particular but while you watch it you feel that someone put the best possible words and pictures to describe your feelings. I am not a big fan of traditional music either but I wouldn't like to hear anything else when it was played during the film.<br /><br />If someone told me to say something against this film I'd define the following, sometimes the transition between scenes seemed sudden or somewhat cut. I guess th

The first step in processing the reviews is to make sure that any html tags that appear should be removed. In addition we wish to tokenize our input, that way words such as *entertained* and *entertaining* are considered the same with regard to sentiment analysis.

In [8]:
# Function review_to_words:

def review_to_words(review):
    """
    This function transforms a review into its corresponding list of words
    """

    # Download and set removing aspects:
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()

    # Remove HTML tags:
    text = BeautifulSoup(review, "html.parser").get_text()

    # Convert to lower case:
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

    # Split string into words:
    words = text.split()

    # Remove stopwords:
    words = [w for w in words if w not in stopwords.words("english")]

    # Remove commoner morphological and inflexional endings:
    words = [PorterStemmer().stem(w) for w in words]

    # Return review as a list of words:
    return words

The `review_to_words` function defined above uses `BeautifulSoup` to remove any html tags that appear and uses the `nltk` package to tokenize the reviews. As a check to ensure we know how everything is working, try applying `review_to_words` to one of the reviews in the training set.

In [9]:
# Apply review_to_words to #42 in the training set:
review_42_to_words = review_to_words(train_X[42])
print("Review #42 in the training set processed by 'review_to_words':")
print(review_42_to_words)

Review #42 in the training set processed by 'review_to_words':
['dramat', 'film', 'whole', 'sens', 'word', 'tell', 'tail', 'greec', 'live', 'routin', 'everyday', 'life', 'without', 'realiz', 'sad', 'sure', 'extrem', 'everi', 'real', 'life', 'sorrow', 'surpass', 'art', 'deepli', 'critic', 'goal', 'pursu', 'whole', 'social', 'structur', 'build', 'around', 'film', 'deeper', 'understand', 'greek', 'way', 'life', 'stereotyp', 'social', 'structur', 'unlik', 'greek', 'film', 'certain', 'fast', 'food', 'mainstream', 'audienc', 'one', 'target', 'anyon', 'particular', 'watch', 'feel', 'someon', 'put', 'best', 'possibl', 'word', 'pictur', 'describ', 'feel', 'big', 'fan', 'tradit', 'music', 'either', 'like', 'hear', 'anyth', 'els', 'play', 'film', 'someon', 'told', 'say', 'someth', 'film', 'defin', 'follow', 'sometim', 'transit', 'scene', 'seem', 'sudden', 'somewhat', 'cut', 'guess', 'edit', 'cut', 'fit', '2hour', 'bit', 'theatr', 'anyway', 'could', 'write', 'express', 'thought', 'guess', 'u', 'se

Above, we mentioned that `review_to_words` function removes html formatting and allows us to tokenize the words found in a review, converting, for example, *entertained* and *entertaining* into *entertain*, so that they are treated as though they are the same word.

Furthermore, it can be noted that this function converts a review in lower case, removes "stop words" (words that are not relevant in a text classification task, i.e. "the", "a"...) and splits the review into a array of individual words.

The function below applies the `review_to_words` function to each of the reviews in the training and testing datasets, and, in addition, it caches the results.

*Nota Bene:* We do this because performing this processing step can take a long time, so, this way, if we are unable to complete the notebook in the current session, we can come back without needing to process the data a second time.

In [10]:
# Define where to store cache files:
cache_dir = os.path.join("../cache", "sentiment_analysis")

# Ensure cache directory exists:
os.makedirs(cache_dir, exist_ok=True)

# Function preprocess_data:

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """
    Convert each review to words (read from cache if available)
    """

    # If cache_file is not None, try to read from it first:
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), 'rb') as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file {}... Done.".format(cache_file))
        except:
            # Unable to read from cache (but that's okay):
            pass
    
    # If cache is missing, then do the heavy lifting:
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review:
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs:
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), 'wb') as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file {}... Done.".format(cache_file))
    else:
        # Unpack data loaded from cache file:
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                                                              cache_data['words_test'],
                                                              cache_data['labels_train'],
                                                              cache_data['labels_test'])
    
    # Return preprocessed data:
    return words_train, words_test, labels_train, labels_test

In [11]:
# Preprocess data:
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Wrote preprocessed data to cache file preprocessed_data.pkl... Done.


For the model we are going to construct in this notebook, we will construct a feature representation which is very similar to a bag-of-words feature representation.

To start, we will represent each word as an integer. Of course, some of the words that appear in the reviews occur very infrequently and so likely don't contain much information for the purposes of sentiment analysis, the way we will deal with this problem is that we will fix the size of our working vocabulary and we will only include the words that appear most frequently.

We will then combine all of the infrequent words into a single category and, in our case, we will label it as `1`.

Since we will be using a recurrent neural network, it will be convenient if the length of each review is the same: To do this, we will fix a size for our reviews and then pad short reviews with the category 'no word' (which we will label `0`) and truncate long reviews.

Let's begin now.

To begin with, we need to construct a way to map words that appear in the reviews to integers.

Here we fix the size of our vocabulary (including the 'no word' and 'infrequent' categories) to be `5000`, but it will be possible to change this to see how it affects the model.

*Nota Bene:* Below, we are going to implement the `build_dict()` function. It can be noted that even though the vocab_size is set to `5000`, we only want to construct a mapping for the most frequently appearing `4998` words. This is because we want to reserve the special labels `0` for 'no word' and `1` for 'infrequent word'.

In [12]:
# Function build_dict:

def build_dict(data, vocab_size = 5000):
    """
    Construct and return a dictionary mapping each of the most frequently appearing words to a unique integer
    ('data' is a list of sentences, and a sentence is a list of words)
    """
    
    # Initialize and build word list:
    word_list = []
    for sentence in data:
        for word in sentence:
            word_list.append(word)
    
    # Store in a dictionary the words that appear in the reviews along with how often they occur:
    word_count = dict(Counter(word_list))
    
    # Sort in frequency descending order the words of the previous dictionary:
    sorted_words = sorted(word_count, key=word_count.get, reverse=True)
    
    # Initialize and build a dictionary that translates words into intergers:
    word_dict = {}
    for idx, word in enumerate(sorted_words[:vocab_size - 2]):
        # Save room for the 'no word' and 'infrequent' labels:
        word_dict[word] = idx + 2
    
    # Return built dictionary:
    return word_dict

In [13]:
# Construct word dictionary for training set:
word_dict = build_dict(train_X)

In [14]:
# Determine the five most frequently appearing words in the training set:
sorted_words = sorted(word_dict, key=word_dict.get)
print("The five most frequently appearing words in the training set are:")
print("1- {}".format(sorted_words[0]))
print("2- {}".format(sorted_words[1]))
print("3- {}".format(sorted_words[2]))
print("4- {}".format(sorted_words[3]))
print("5- {}".format(sorted_words[4]))

The five most frequently appearing words in the training set are:
1- movi
2- film
3- one
4- like
5- time


As it can be seen above, the five most frequently appearing words in the training set are not so surprising, indeed, they are common words belonging to "movie critics lexicography".

Later on, when we will construct an endpoint which will process a submitted review, we will need to make use of the `word_dict` which we have created. As such, we will save it to a file now for future use.

In [15]:
# The folder we will use for storing data:
data_dir = '../data/pytorch'

# Make sure that the folder exists:
if not os.path.exists(data_dir):
    os.makedirs(data_dir)
    
# Save word_dict:
with open(os.path.join(data_dir, 'word_dict.pkl'), 'wb') as f:
    pickle.dump(word_dict, f)

Now that we have our word dictionary which allows us to transform the words appearing in the reviews into integers, it is time to make use of it and convert our reviews to their integer sequence representation, making sure to pad or truncate to a fixed length, which in our case is `500`.

In [16]:
# Function convert_and_pad:

def convert_and_pad(word_dict, sentence, pad=500):
    """
    This function transforms a sentence through a word dictionary and a padding zone
    """

    # Represent the 'no word' category by '0':
    NOWORD = 0

    # Represent the infrequent words (i.e. not appearing in word_dict) by '1':
    INFREQ = 1

    # Create default working sentence:
    working_sentence = [NOWORD]*pad

    # Treatment of each word of the sentence:
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ

    # Determine the min value between the length of the sentence and the padding zone:
    min_length_zone = min(len(sentence), pad)

    # Return the results:
    return working_sentence, min_length_zone

# Function convert_and_pad_data:

def convert_and_pad_data(word_dict, data, pad=500):
    """
    This function transforms a dataset through a word dictionary and a padding zone
    """
    
    # Initialize result lists:
    result = []
    lengths = []
    
    # Treatment of each sentence of the dataset:
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
    
    # Convert results into arrays:
    result_array = np.array(result)
    lengths_array = np.array(lengths)
    
    # Return results:
    return result_array, lengths_array

In [17]:
# Apply convert_and_pad_data to training set:
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)

# Apply convert_and_pad_data to testing set:
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

As a quick check to make sure that things are working as intended, we are going to check to see what one of the reviews in the training set looks like after having been processeed. Does this look reasonable? What is the length of a review in the training set?

In [19]:
# Examine review #42 in the training set:
print("Review #42 in the training set processed by 'convert_and_pad_data':")
print("\n=> Processed review:\n{}".format(train_X[42]))
print("\n=> Length review: {}".format(train_X_len[42]))

Review #42 in the training set processed by 'convert_and_pad_data':

=> Processed review:
[ 702    3  144  196  307  133 3675    1   75 1735 2520   49  129  420
  488  142  385   93   71   49 4862 3819  337 1536  507 2268 2378  144
  867 1707  612  110    3 2425  244 3134   31   49  901  867 1707  601
 3134    3  735  636 1481 2180  179    4 1524  181  763   12   62  201
  140   53  284  307  269  773   62  116  123 1088   85  298    5  526
  153  265   33    3  201  512   38   66    3 1902  214  434 2811   18
   39 1866  582  373  355  445  373  651    1  124 1407  448   36  251
  802   99  355 1098   11 1299  132  130   11    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0   

As it can be seen above, all seems quite reasonable for the review 42 of the training set.

*Nota Bene:* In the cells above, we use the `preprocess_data` and `convert_and_pad_data` functions to process both the training and testing set: Although the word dictionary has been built only thanks to the training set, we can reasonnably think that this should not be a problem if both training and testing sets have been constituted carefully.

## Step 3: Upload the processed data to S3

Now, we will need to upload the training dataset to S3 in order for our training code to access it. Nevertheless, first, we will save it locally and we will upload to S3 later on.

It is important to note the format of the data that we are saving as we will need to know it when we write the training code. In our case, each row of the dataset has the form `label`, `length`, `review[500]` where `review[500]` is a sequence of `500` integers representing the words in the review.

In [20]:
# Save locally the training dataset:
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

Next, we need to upload the training data to the SageMaker default S3 bucket so that we can provide access to it while training our model.

In [21]:
# Create SageMaker session:
sagemaker_session = sagemaker.Session()

# Define bucket, prefix and role:
bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'
role = sagemaker.get_execution_role()

INFO:sagemaker:Created S3 bucket: sagemaker-eu-west-1-579406810085


In [22]:
# Upload training data to SageMaker default S3 bucket:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

*Nota Bene:* The cell above uploads the entire contents of our data directory. This includes the `word_dict.pkl` file. This is fortunate as we will need this later on when we create an endpoint that accepts an arbitrary review. For now, we will just take note of the fact that it resides in the data directory (and so also in the S3 training bucket) and that we will need to make sure it gets saved in the model directory.

## Step 4: Build and Train the PyTorch Model

In the SageMaker framework, a model comprises three objects:

- Model Artifacts;
- Training Code;
- Inference Code.
 
Each of these objects interacts with one another. It is possible to use training and inference code that is provided by Amazon, nevertheless, here, we will still be using containers provided by Amazon, but with the added benefit of being able to include our own custom code.

We will start by implementing our own neural network in PyTorch along with a training script.

For the purposes of this project, we are going to use the model object defined in the `model.py` file, inside of the `train` folder.

The provided implementation can be observed by running the cell below.

In [23]:
!pygmentize train/model.py

[37m#!/usr/bin/env python3[39;49;00m
[37m# -*- coding: utf-8 -*-[39;49;00m


[37m#############################################################[39;49;00m
[37m# PROGRAMMER: Pierre-Antoine Ksinant                        #[39;49;00m
[37m# DATE CREATED: 20/12/2018                                  #[39;49;00m
[37m# REVISED DATE: -                                           #[39;49;00m
[37m# PURPOSE: Construct a RNN model based on LSTM units        #[39;49;00m
[37m#############################################################[39;49;00m


[37m##################[39;49;00m
[37m# Needed imports #[39;49;00m
[37m##################[39;49;00m

[34mimport[39;49;00m [04m[36mtorch.nn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m


[37m######################################################[39;49;00m
[37m# Class to construct a RNN model based on LSTM units #[39;49;00m
[37m######################################################[39;49;00m

[34mcla

The important takeaway from the implementation provided is that there are three parameters that we may wish to tweak to improve the performance of our model. These are the embedding dimension, the hidden dimension and the size of the vocabulary. We will likely want to make these parameters configurable in the training script so that if we wish to modify them we do not need to modify the script itself. We will see how to do this later on. To start we will write some of the training code in the notebook so that we can more easily diagnose any issues that arise.

First we will load a small portion of the training data set to use as a sample. It would be very time consuming to try and train the model completely in the notebook as we do not have access to a gpu and the compute instance that we are using is not particularly powerful. However, we can work on a small bit of the data to get a feel for how our training script is behaving.

In [24]:
# Read in only the first 250 rows:
train_sample = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)

# Turn the input pandas dataframe into tensors:
train_sample_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
train_sample_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()

# Build the dataset:
train_sample_ds = TensorDataset(train_sample_X, train_sample_y)

# Build the dataloader:
train_sample_dl = DataLoader(train_sample_ds, batch_size=50)

Next we need to write the training code itself.

In [25]:
# Function train:

def train(model, train_loader, epochs, optimizer, loss_fn, device):
    """
    This is the training function that is called by the PyTorch training script. The parameters
    passed are as follows:
    model        - The PyTorch model that we wish to train.
    train_loader - The PyTorch DataLoader that should be used during training.
    epochs       - The total number of epochs to train for.
    optimizer    - The optimizer to use during training.
    loss_fn      - The loss function used for training.
    device       - Where the model and data should be loaded (gpu or cpu).
    """
    
    # Track the training session:
    print("Training (for {} epoch(s)):\n*****".format(epochs))
    start_time = time.time()
    
    # Perform forward and backpropagation passes:
    for epoch in range(1, epochs + 1):
        
        # Put model in training mode:
        model.train()
        
        # Set total loss to zero:
        total_loss = 0
        
        # Move on training data loader:
        for batch in train_loader:
            
            # Define data and label:
            batch_X, batch_y = batch
            
            # Move to consistent device:
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            # Zero accumulated gradients:
            optimizer.zero_grad()
            
            # Get the output from the model:
            batch_output = model.forward(batch_X)
            
            # Calculate the loss:
            loss = loss_fn(batch_output, batch_y)
            
            # Perform backpropagation:
            loss.backward()
            
            # Perform optimization:
            optimizer.step()
            
            # Calculate loss over data loader:
            total_loss += loss.item()
        
        # Print and register loss stats:
        print("Epoch {}... Loss {}...".format(epoch, total_loss/len(train_loader)))
    
    # Time performance:
    end_time = time.time()
    total_time = int(end_time - start_time)
    hours = total_time//3600
    minutes = (total_time%3600)//60
    seconds = (total_time%3600)%60
    print("*****\nEnd of the training: {:02d}h {:02d}m {:02d}s".format(hours,
                                                                       minutes,
                                                                       seconds))

Supposing we have the training method above, we will test that it is working by writing a bit of code in the notebook that executes our training method on the small sample training set that we loaded earlier. The reason for doing this in the notebook is so that we have an opportunity to fix any errors that arise early when they are easier to diagnose.

In [26]:
# Set device:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define model and move it to device:
model = LSTMClassifier(32, 100, 5000).to(device)

# Define optimizer and loss function:
optimizer = Adam(model.parameters())
loss_fn = BCELoss()

# Perform training for 5 epochs:
train(model, train_sample_dl, 5, optimizer, loss_fn, device)

Training (for 5 epoch(s)):
*****
Epoch 1... Loss 0.6964830875396728...
Epoch 2... Loss 0.688408088684082...
Epoch 3... Loss 0.6823609828948974...
Epoch 4... Loss 0.675974678993225...
Epoch 5... Loss 0.6683764576911926...
*****
End of the training: 00h 00m 29s


In order to construct a PyTorch model using SageMaker, we must provide SageMaker with a training script.

We may optionally include a directory which will be copied to the container and from which our training code will be run: When the training container is executed, it will check the uploaded directory (if there is one) for a `requirements.txt` file and install any required Python libraries, after which the training script will be run.

Furthermore, when a PyTorch model is constructed in SageMaker, an entry point must be specified, this is the Python file which will be executed when the model is trained.

Inside of the `train` directory, there is a file called `train.py` which contains most of the necessary code to train our model (in particular, the `train()` function written above is the same as the one that can be found into this `train.py` file).

The way that SageMaker passes hyperparameters to the training script is by way of arguments, these arguments can then be parsed and used in the training script: To see how this is done, take a look at the `train/train.py` file.

In [27]:
# Define estimator for SageMaker:
estimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge',
                    hyperparameters={
                        'epochs': 100,
                        'hidden_dim': 200,
                    })

In [28]:
# Perform training thanks to SageMaker:
estimator.fit({'training': input_data})

INFO:sagemaker:Creating training-job with name: sagemaker-pytorch-2018-12-25-09-12-36-047


2018-12-25 09:12:36 Starting - Starting the training job...
2018-12-25 09:12:38 Starting - Launching requested ML instances...
2018-12-25 09:13:34 Starting - Preparing the instances for training.........
2018-12-25 09:14:48 Downloading - Downloading input data...
2018-12-25 09:15:17 Training - Downloading the training image..
[31mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[31mbash: no job control in this shell[0m
[31m2018-12-25 09:15:50,660 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[31m2018-12-25 09:15:50,686 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[31m2018-12-25 09:15:50,692 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[31m2018-12-25 09:15:50,966 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[31mGenerating setup.py[0m
[31m2018-12-25 09:15:50,967 sagemaker-containers INFO   

[31mModel loaded with embedding_dim 32, hidden_dim 200, vocab_size 5000...[0m
[31mTraining (for 100 epoch(s)):[0m
[31m*****[0m
[31mEpoch 1... Loss 0.668552460719128...[0m
[31mEpoch 2... Loss 0.5984970866417398...[0m
[31mEpoch 3... Loss 0.5067791008219427...[0m
[31mEpoch 4... Loss 0.4338025779140239...[0m
[31mEpoch 5... Loss 0.39505745196829034...[0m
[31mEpoch 6... Loss 0.35590209887952223...[0m
[31mEpoch 7... Loss 0.32559014522299473...[0m
[31mEpoch 8... Loss 0.29723833653391624...[0m
[31mEpoch 9... Loss 0.27673907851686286...[0m
[31mEpoch 10... Loss 0.2626749964392915...[0m
[31mEpoch 11... Loss 0.26320676110228713...[0m
[31mEpoch 12... Loss 0.24904222695194944...[0m
[31mEpoch 13... Loss 0.23269278814598005...[0m
[31mEpoch 14... Loss 0.22138099919776527...[0m
[31mEpoch 15... Loss 0.2086204463730053...[0m
[31mEpoch 16... Loss 0.21428000835739835...[0m
[31mEpoch 17... Loss 0.20140253068233022...[0m
[31mEpoch 18... Loss 0.19684941792974667...[0m


## Step 5: Test the trained model

As mentioned at the top of this notebook, we will be testing this model by first deploying it and then sending the testing data to the deployed endpoint. We will do this so that we can make sure that the deployed model is working correctly.

## Step 6: Deploy the trained model

Now that we have trained our model, we would like to test it to see how it performs. Currently our model takes input of the form `review_length, review[500]` where `review[500]` is a sequence of `500` integers which describe the words present in the review, encoded using `word_dict`. Fortunately for us, SageMaker provides built-in inference code for models with simple inputs such as this.

There is one thing that we need to provide, however, and that is a function which loads the saved model. This function must be called `model_fn()` and takes as its only parameter a path to the directory where the model artifacts are stored. This function must also be present in the python file which we specified as the entry point. In our case the model loading function has been provided and so no changes need to be made.

*Nota Bene:* When the built-in inference code is run, it must import the `model_fn()` function from the `train.py` file. This is why the training code is wrapped in a main guard (i.e. `if __name__ == '__main__':`). Since we don't need to change anything in the code that was uploaded during training, we can simply deploy the current model as-is.

*Nota Bene:* When deploying a model, we are asking SageMaker to launch an compute instance that will wait for data to be sent to it. As a result, this compute instance will continue to run until **we** shut it down: This is important to know since the cost of a deployed endpoint depends on how long it has been running for, in other words, **If we are no longer using a deployed endpoint, we must shut it down!**

Now, we are going to deploy the trained model.

In [29]:
# Deploy the trained model:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: sagemaker-pytorch-2018-12-25-09-12-36-047
INFO:sagemaker:Creating endpoint with name sagemaker-pytorch-2018-12-25-09-12-36-047


--------------------------------------------------------------------------!

## Step 7: Use the deployed model

Once deployed, we can read in the test data and send it off to our deployed model to get some results, and, once we collect all of the results, we can determine how accurate our model is.

In [30]:
# Constitute test data:
test_X = pd.concat([pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1)

In [31]:
# We split the data into chunks and send each chunk seperately, accumulating the results.

def predict(data, rows=512):
    """
    This function performs prediction on splitted data
    """
    
    # Split data:
    split_array = np.array_split(data, int(data.shape[0]/float(rows) + 1))
    
    # Initialize predictions:
    predictions = np.array([])
    
    # Move on splitted data for predictions:
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
    
    # Returns predictions:
    return predictions

In [32]:
# Perform predictions on testing set:
predictions = predict(test_X.values)
predictions = [round(num) for num in predictions]

In [33]:
# Determine accuracy on testing set:
print("On testing set, the accuracy we get is:")
accuracy_score(test_y, predictions)

On testing set, the accuracy we get is:


0.84464

*Nota Bene:* This model's accurancy score is better than other ones that can be obtained thanks to other approaches, as for example using "Kaggle competition killer" XGBoost: Indeed, RNNs are particulary adapted for tasks which challenge us in this project.

We now have a trained model which has been deployed and which we can send processed reviews to, and which returns the predicted sentiment. However, ultimately, we would like to be able to send our model an unprocessed review, that is, we would like to send the review itself as a string.

For example, suppose we wish to send the following review to our model.

In [34]:
# Example of unprocessed review:
test_review = "The simplest pleasures in life are the best, " \
              "and this film is one of them. Combining a rather " \
              "basic storyline of love and adventure this movie " \
              "transcends the usual weekend fair with wit and unmitigated charm."

The question we now need to answer is, how do we send this review to our model?

In the first section of this notebook, we did a bunch of data processing to the IMDb dataset, in particular, we did two specific things to the provided reviews:

- Removed any html tags and stemmed the input;
- Encoded the review as a sequence of integers using `word_dict`.
 
So, in order to process the review, we will need to repeat these two steps.

In [35]:
# Convert test_review into a form usable by the model and save the results in test_data:
test_data = [np.array(convert_and_pad(word_dict, review_to_words(test_review))[0])]

Now that we have processed the review, we can send the resulting array to our model to predict the sentiment of the review.

In [36]:
# Perform prediction on the test review:
predictor.predict(test_data)

array(0.775861, dtype=float32)

Since the return value of our model is close to `1`, we can be certain that the review we submitted is positive.

As it has been said previously, once we've deployed an endpoint, it continues to run until we tell it to shut down. So, since we are done using our endpoint for now, we can delete it.

In [37]:
# Delete endpoint:
estimator.delete_endpoint()

INFO:sagemaker:Deleting endpoint with name: sagemaker-pytorch-2018-12-25-09-12-36-047


## Step 6 (again): Deploy the trained model (for the web app)

Now that we know that our model is working, it's time to create some custom inference code so that we can send the model a review which has not been processed and have it determine the sentiment of the review.

As we saw above, by default, the estimator which we created, when deployed, will use the entry script and directory which we provided when creating the model. However, since we now wish to accept a string as input and our model expects a processed review, we need to write some custom inference code.

We will store the code that we write in the `serve` directory: In this directory, it can be found the `model.py` file that we used to construct our model, a `utils.py` file which contains the `review_to_words` and `convert_and_pad` pre-processing functions which we used during the initial data processing, and `predict.py`, the file which will contain our custom inference code. Note also that `requirements.txt` is present and will tell SageMaker what Python libraries are required by our custom inference code.

When deploying a PyTorch model in SageMaker, it is expected to provide four functions which the SageMaker inference container will use.

- `model_fn`: This function is the same function that we used in the training script and it tells SageMaker how to load our model;
- `input_fn`: This function receives the raw serialized input that has been sent to the model's endpoint and its job is to de-serialize and make the input available for the inference code;
- `output_fn`: This function takes the output of the inference code and its job is to serialize this output and return it to the caller of the model's endpoint;
- `predict_fn`: The heart of the inference script, this is where the actual prediction is done.

For the simple website that we are constructing during this project, the `input_fn` and `output_fn` methods are relatively straightforward: We only require being able to accept a string as input and we expect to return a single value as output.

*Nota Bene:* In a more complex application, the input or output may be an image data or some other binary data which would require some effort to serialize.

Now, it's time to create and deploy our model: To begin with, we need to construct a new PyTorchModel object which points to the model artifacts created during training and also points to the inference code that we wish to use, then, we can call the deploy method to launch the deployment container.

*Nota Bene:* The default behaviour for a deployed PyTorch model is to assume that any input passed to the predictor is a `numpy` array. In our case, we want to send a string, so, we need to construct a simple wrapper around the `RealTimePredictor` class to accomodate simple strings. As it has been yet said previously, in a more complicated situation, we could imagine that we would need to provide a serialized object, as for example if we wanted to send image data.

In [38]:
# Define StringPredictor class:
class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

# Define model:
model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='0.4.0',
                     entry_point='predict.py',
                     source_dir='serve',
                     predictor_cls=StringPredictor)

# Define predictor:
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: sagemaker-pytorch-2018-12-25-09-57-40-179
INFO:sagemaker:Creating endpoint with name sagemaker-pytorch-2018-12-25-09-57-40-179


--------------------------------------------------------------------------!

Now that we have deployed our model with the custom inference code, we should test to see if everything is working.

Here, we test our model by loading the first `250` positive and negative reviews and send them to the endpoint, then, collect the results. The reason for only sending some of the data is that the amount of time it takes for our model to process the input and then perform inference is quite long and so testing the entire data set would be prohibitive.

In [39]:
# Function test_reviews:

def test_reviews(data_dir='../data/aclImdb', stop=250):
    """
    Function to test the model
    """
    
    # Initialize results and ground:
    results = []
    ground = []
    
    # Make sure to test both positive and negative reviews:
    for sentiment in ['pos', 'neg']:
        
        # Define path to data to test:
        path = os.path.join(data_dir, 'test', sentiment, '*.txt')
        files = glob.glob(path)
        
        # Initialize files counter:
        files_read = 0
        
        print("Start testing '{}' files...".format(sentiment))
        
        # Iterate through the files and send them to the predictor:
        for f in files:
            with open(f) as review:
                # Store the ground truth:
                if sentiment == 'pos':
                    ground.append(1)
                else:
                    ground.append(0)
                # Read in the review and convert to 'utf-8' for transmission via HTTP:
                review_input = review.read().encode('utf-8')
                # Send the review to the predictor and store the results:
                results.append(int(predictor.predict(review_input)))
                
            # Update files counter:
            files_read += 1
            if files_read == stop:
                print("Done.")
                break
            
    # Return labels and predictions:
    return ground, results

In [40]:
# Perform tests:
ground, results = test_reviews()

Start testing 'pos' files...
Done.
Start testing 'neg' files...
Done.


In [41]:
# Determine accuracy:
accuracy_score(ground, results)

0.838

As an additional test, we can try sending the `test_review` that we looked at earlier.

In [42]:
# Perform prediction on the test review:
predictor.predict(test_review)

b'1'

Now that we know our endpoint is working as expected, we can set up the web page that will interact with it.

## Step 7 (again): Use the deployed model (for the web app)

*Nota Bene:* This entire section, and the next, contain tasks to complete mostly using the AWS console.

So far, we have been accessing our model endpoint by constructing a predictor object which uses the endpoint, and then, just using the predictor object to perform inference. What if we wanted to create a web app which accessed our model?

The way things are set up currently makes that it is not possible since, in order to access a SageMaker endpoint, the app would first have to authenticate with AWS using an IAM role which included access to SageMaker endpoints. However, there is an easier way, we just need to use some additional AWS services.

<img src="assets/web_app_diagram.png">

The diagram above gives an overview of how the various services will work together. On the far right, there is the model which we trained above and which is deployed using SageMaker. On the far left, there is our web app that collects a user's movie review, sends it off, and expects a positive or negative sentiment in return.

In the middle, there is where some of the "magic" happens: We will construct a Lambda function, which can be thought of as a straightforward Python function that can be executed whenever a specified event occurs.

We will give this function permission to send and recieve data from a SageMaker endpoint.

Lastly, the method we will use to execute the Lambda function is a new endpoint that we will create using API Gateway. This endpoint will be a url that listens for data to be sent to it. Once it gets some data, it will pass that data on to the Lambda function and then return whatever the Lambda function returns. Essentially, it will act as an interface that lets our web app communicate with the Lambda function.

The first thing we are going to do is to set up a Lambda function.

This Lambda function will be executed whenever our public API has data sent to it. When it is executed it will receive the data, perform any sort of processing that is required, send the data (the review) to the SageMaker endpoint we've created, and then return the result.

Now, we are going to create an IAM Role for the Lambda function.

Since we want the Lambda function to call a SageMaker endpoint, we need to make sure that it has permission to do so: To do this, we will construct a role that we can later give the Lambda function.

Using the AWS Console, we can navigate to the **IAM** page and click on **Roles**. Then, we can click on **Create role**. We have to make sure that the **AWS service** is the type of trusted entity selected and we have to choose **Lambda** as the service that will use this role, then, we can click on **Next: Permissions**.

In the search box, we can type `sagemaker` and select the check box next to the **AmazonSageMakerFullAccess** policy. Then, we have to click on **Next: Review**.

Lastly, we need to give this role a name, and, obvioulsly, to make sure we use a name that we will remember later on, (for example, `LambdaSageMakerRole`). Then, we can click on **Create role**.

Now it is time to actually create the Lambda function.

Using the AWS Console, we can navigate to the AWS Lambda page and click on **Create a function**. When we get to the next page, we have to make sure that **Author from scratch** is selected. Now, we can name our Lambda function, using a name that you will remember later on (for example, `sentiment_analysis_func`). We have to make sure that the **Python 3.6** runtime is selected and then choose the role that we created just above. Then, we can click on **Create Function**.

On the next page we will see some information about the Lambda function we've just created. If we scroll down, we should see an editor in which we can write the code that will be executed when our Lambda function is triggered.

Here, we will use the code below: 

```python
# We need to use the low-level library to interact with SageMaker since the SageMaker API:
# (the SageMaker API is not available natively through Lambda)
import boto3

def lambda_handler(event, context):

    # The SageMaker runtime is what allows us to invoke the endpoint that we've created:
    runtime = boto3.Session().client('sagemaker-runtime')

    # Now we use the SageMaker runtime to invoke our endpoint, sending the review we were given:
    # (endpoint name we created, expected data format, actual review)
    response = runtime.invoke_endpoint(EndpointName = '**ENDPOINT NAME HERE**',
                                       ContentType = 'text/plain',
                                       Body = event['body'])

    # The response is an HTTP response whose body contains the result of our inference:
    result = response['Body'].read().decode('utf-8')

    return {
        'statusCode' : 200,
        'headers' : { 'Content-Type' : 'text/plain', 'Access-Control-Allow-Origin' : '*' },
        'body' : result
    }
```

Once we have copied and pasted the code above into the Lambda code editor, replace the `**ENDPOINT NAME HERE**` portion with the name of the endpoint that we deployed earlier.

*Nota Bene:* We can determine the name of this endpoint using the code cell below.

In [43]:
# Determine deployed endpoint name:
predictor.endpoint

'sagemaker-pytorch-2018-12-25-09-57-40-179'

Once we have added the endpoint name to the Lambda function, we can click on **Save**: Our Lambda function is now up and running!

Next, we need to create a way for our web app to execute the Lambda function: It is time to create a new API using API Gateway that will trigger the Lambda function we have just created.

Using AWS Console, we can navigate to **Amazon API Gateway** and then click on **Get started**.

On the next page, we have to make sure that **New API** is selected and give the new api a name (for example, `sentiment_analysis_api`). Then, we can click on **Create API**.

Now, we have created an API, however, it doesn't currently do anything: What we want it to do is to trigger the Lambda function that we created earlier.

For that, we have to select the **Actions** dropdown menu and click **Create Method**. A new blank method will be created, we have to select its dropdown menu and select **POST**, and then, click on the check mark beside it.

For the integration point, we have to make sure that **Lambda Function** is selected and click on the **Use Lambda Proxy integration**. This option makes sure that the data that is sent to the API is then sent directly to the Lambda function with no processing. It also means that the return value must be a proper response object as it will also not be processed by API Gateway.

We have to type the name of the Lambda function we created earlier into the **Lambda Function** text entry box, and then, click on **Save**. Now, we can click on **OK** in the pop-up box that then appears, giving permission to API Gateway to invoke the Lambda function we created.

The last step in creating the API Gateway is to select the **Actions** dropdown and click on **Deploy API**. We will need to create a new Deployment stage and name it anything we like (for example, `prod`).

We have now successfully set up a public API to access our SageMaker model. So, we need to make sure to copy or write down the URL provided to invoke our newly created public API, as this will be needed in the next step.

*Nota Bene:* This URL can be found at the top of the page, highlighted in blue next to the text **Invoke URL**.

Now that we have a publicly available API, we can start using it in a web app. For these purposes, a simple static html file which can make use of the public api we created earlier is present (in the `website` folder, there should be a file called `index.html`).

If we open `index.html` on our local computer, our browser will behave as a local web server and we can use the provided site to interact with our SageMaker model!

*Nota Bene:* If we'd like to go further, we can host this html file anywhere we'd like, for example, using github or hosting a static site on Amazon's S3. Once we have done this, we can share the link with anyone we'd like and have them play with it too!

> **Important Note**: In order for the web app to communicate with the SageMaker endpoint, the endpoint has to actually be deployed and running. This means that we are **paying** for it. So, we have to make sure that the endpoint is running when we want to use the web app, but that we shut it down when we don't need it, otherwise we will end up with a surprisingly **large AWS bill**!

Now that our web app is working, we can try to play around with it and see how well it works: Below are some examples!

<img src="assets/review_0.png">

<img src="assets/review_1.png">

<img src="assets/review_2.png">

<img src="assets/review_3.png">

<img src="assets/review_4.png">

> **Important Note**: As it has been said previously, we have to remember to always shut down our endpoint if we are no longer using it: We are charged for the length of time that the endpoint is running so if we forget and leave it on, we could end up with an **unexpectedly large bill**!

In [44]:
# Delete endpoint to avoid unexpectedly large bill:
predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint with name: sagemaker-pytorch-2018-12-25-09-57-40-179
