#  Sentiment Analysis with TensorFlow

A Convolutional Neural Net (CNN) is sometimes used in text classification tasks such as sentiment analysis.  We'll use a CNN built with TensorFlow to perform sentiment analysis in Amazon SageMaker on the IMDB dataset, which consists of movie reviews labeled as having positive or negative sentiment. Three aspects of Amazon SageMaker will be demonstrated:

- How to use Script Mode with a prebuilt TensorFlow container, along with a training script similar to one you would use outside SageMaker. 
- Local Mode training, which allows you to test your code on your notebook instance before creating a full scale training job.
- Batch Transform for offline, asynchronous predictions on large batches of data. 

### Setup

Let's start by doing a little housework, just to make sure we have the latest everything we need

Run the cell by clicking either (1) the play symbol that appears to the left of In[] when you hover over it, or (2) the 'Run cell' button in the toolbar above, or (3) using Control + Enter from your keyboard.

#  Prepare Dataset

We'll begin by loading the reviews dataset, and padding the reviews so all reviews have the same length.  Each review is represented as an array of numbers, where each number represents an indexed word.  Training data for both Local Mode and Hosted Training must be saved as files, so we'll also save the transformed data to files.

In [1]:
import os
from keras.preprocessing import sequence
from keras.datasets import imdb


max_features = 20000
maxlen = 400

# Unfortunately an update in numpy broke the imdb.load_data functionality of tensorflow/keras. We will use a quick hack shown here to make this work again:
# https://stackoverflow.com/questions/55890813/how-to-fix-object-arrays-cannot-be-loaded-when-allow-pickle-false-for-imdb-loa
import numpy as np
np_load_old = np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

# call load_data with allow_pickle implicitly set to true
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

np.load = np_load_old



print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Using TensorFlow backend.



Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz
25000 train sequences
25000 test sequences
x_train shape: (25000, 400)
x_test shape: (25000, 400)


In [2]:
import os

data_dir = os.path.join(os.getcwd(), 'sentiment-files/data')
os.makedirs(data_dir, exist_ok=True)

train_dir = os.path.join(os.getcwd(), 'sentiment-files/data/train')
os.makedirs(train_dir, exist_ok=True)

test_dir = os.path.join(os.getcwd(), 'sentiment-files/data/test')
os.makedirs(test_dir, exist_ok=True)

csv_test_dir = os.path.join(os.getcwd(), 'sentiment-files/data/csv-test')
os.makedirs(csv_test_dir, exist_ok=True)

In [3]:
import numpy as np

np.save(os.path.join(train_dir, 'x_train.npy'), x_train)
np.save(os.path.join(train_dir, 'y_train.npy'), y_train)
np.save(os.path.join(test_dir, 'x_test.npy'), x_test)
np.save(os.path.join(test_dir, 'y_test.npy'), y_test)
np.savetxt(os.path.join(csv_test_dir, 'csv-test.csv'), np.array(x_test[:100], dtype=np.int32), fmt='%d', delimiter=",")

# Local Mode Training

Amazon SageMaker’s Local Mode training feature is a convenient way to make sure your code is working as expected before moving on to full scale, hosted training. With Local Mode, you can run quick tests with just a sample of training data, and/or a small number of epochs (passes over the full training set), while avoiding the time and expense of attempting full scale hosted training using possibly buggy code.  

To train in Local Mode, it is necessary to have docker-compose or nvidia-docker-compose (for GPU) installed in the notebook instance. Running following script will install docker-compose or nvidia-docker-compose and configure the notebook environment for you.

In [4]:
!/bin/bash ./sentiment-files/setup.sh

SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


The next step is to set up a TensorFlow Estimator for Local Mode training. A key parameters for the Estimator is the `train_instance_type`, which is the kind of hardware on which training will run. In the case of Local Mode, we simply set this parameter to `local_gpu` to invoke Local Mode training on the GPU, or to `local` if the instance has a CPU. Other parameters of note are the algorithm’s hyperparameters, which are passed in as a dictionary, and a Boolean parameter indicating that we are using Script Mode.

In [5]:
import sagemaker
from sagemaker.tensorflow import TensorFlow

model_dir = '/opt/ml/model'
train_instance_type = 'local'
hyperparameters = {'epochs': 1, 'batch_size': 128}
local_estimator = TensorFlow(entry_point='sentiment-files/sentiment.py',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       base_job_name='tf-keras-sentiment',
                       framework_version='1.12.0',
                       py_version='py3',
                       script_mode=True)

Now we'll briefly train the model in Local Mode.  Since this is just to make sure the code is working, we'll train for only one epoch.  (Note that on a CPU-based notebook instance, this one epoch will take at least 3 or 4 minutes.)  As you'll see from the logs below the cell when training is complete, even when trained for only one epoch, the accuracy of the model on training data is already at almost 80%.  

In [6]:
inputs = {'train': f'file://{train_dir}',
          'test': f'file://{test_dir}'}

local_estimator.fit(inputs)

Creating tmpoposzn85_algo-1-l17zl_1 ... 
[1BAttaching to tmpoposzn85_algo-1-l17zl_12mdone[0m
[36malgo-1-l17zl_1  |[0m 2020-03-28 08:28:14,496 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training
[36malgo-1-l17zl_1  |[0m 2020-03-28 08:28:14,502 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-l17zl_1  |[0m 2020-03-28 08:28:14,655 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-l17zl_1  |[0m 2020-03-28 08:28:14,673 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-l17zl_1  |[0m 2020-03-28 08:28:14,687 sagemaker-containers INFO     Invoking user script
[36malgo-1-l17zl_1  |[0m 
[36malgo-1-l17zl_1  |[0m Training Env:
[36malgo-1-l17zl_1  |[0m 
[36malgo-1-l17zl_1  |[0m {
[36malgo-1-l17zl_1  |[0m     "additional_framework_parameters": {},
[36malgo-1-l17zl_1  |[0m     "channel_input_dirs": {
[36malgo-1-l17zl_1  |[0m  

#  Hosted Training

After we've confirmed our code seems to be working using Local Mode training, we can move on to use SageMaker's hosted training, which uses compute resources separate from your notebook instance.  Hosted training spins up one or more instances (cluster) for training, and then tears the cluster down when training is complete. In general, hosted training is preferred for doing actual training, especially for large-scale, distributed training. Before starting hosted training, the data must be uploaded to S3. 

In [7]:
s3_prefix = 'ml-immersion-day/sentiment-files'

traindata_s3_prefix = '{}/data/train'.format(s3_prefix)
testdata_s3_prefix = '{}/data/test'.format(s3_prefix)

train_s3 = sagemaker.Session().upload_data(path='./sentiment-files/data/train/', key_prefix=traindata_s3_prefix)
test_s3 = sagemaker.Session().upload_data(path='./sentiment-files/data/test/', key_prefix=testdata_s3_prefix)

inputs = {'train':train_s3, 'test': test_s3}
print(inputs)

{'train': 's3://sagemaker-us-east-1-300998013710/ml-immersion-day/sentiment-files/data/train', 'test': 's3://sagemaker-us-east-1-300998013710/ml-immersion-day/sentiment-files/data/test'}


With the training data now in S3, we're ready to set up an Estimator object for hosted training. It is similar to the Local Mode Estimator, except the `train_instance_type` has been set to a ML instance type instead of a local type for Local Mode. Additionally, we've set the number of epochs to a number greater than one for actual training, as opposed to just testing the code.

In [8]:
#train_instance_type = 'ml.p3.2xlarge'
train_instance_type = 'ml.c5.4xlarge'
hyperparameters = {'epochs': 10, 'batch_size': 128}

estimator = TensorFlow(entry_point='sentiment-files/sentiment.py',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       base_job_name='tf-keras-sentiment',
                       framework_version='1.12.0',
                       py_version='py3',
                       script_mode=True)

With the change in training instance type and increase in epochs, we simply call `fit` to start the actual hosted training.  At the end of hosted training, you'll see from the logs below the cell that accuracy on the training set has greatly increased, and accuracy on the validation set is around 90%.  The model may be overfitting now (less able to generalize to data it has not yet seen), even though we are employing dropout as a regularization technique.  In a production situation, further investigation would be necessary.

Training time should be around 15 minutes

In [9]:
estimator.fit(inputs)

2020-03-28 08:34:15 Starting - Starting the training job...
2020-03-28 08:34:17 Starting - Launching requested ML instances.........
2020-03-28 08:35:50 Starting - Preparing the instances for training...
2020-03-28 08:36:39 Downloading - Downloading input data...
2020-03-28 08:37:09 Training - Training image download completed. Training in progress..[34m2020-03-28 08:37:10,840 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2020-03-28 08:37:10,844 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-03-28 08:37:26,412 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-03-28 08:37:26,424 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-03-28 08:37:26,433 sagemaker-containers INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "



[34mEpoch 2/10[0m




[34mEpoch 3/10[0m




[34mEpoch 4/10[0m




[34mEpoch 5/10[0m




[34mEpoch 6/10[0m




[34mEpoch 7/10[0m




[34mEpoch 8/10[0m




[34mEpoch 9/10[0m




[34mEpoch 10/10[0m




[34m2020-03-28 08:46:50,215 sagemaker-containers INFO     Reporting training SUCCESS[0m

2020-03-28 08:47:21 Uploading - Uploading generated training model
2020-03-28 08:47:21 Completed - Training job completed
Training seconds: 642
Billable seconds: 642


# Batch Prediction


If our use case requires individual predictions in near real-time, SageMaker hosted endpoints can be created. Hosted endpoints also can be used for pseudo-batch prediction, but the process is more involved than simply using SageMaker's Batch Transform feature, which is designed for large-scale, asynchronous batch inference.

To use Batch Transform, first we must upload to Amazon S3 some test data in CSV format to be transformed.

In [10]:
csvtestdata_s3_prefix = '{}/data/csv-test'.format(s3_prefix)
csvtest_s3 = sagemaker.Session().upload_data(path='./sentiment-files/data/csv-test/', key_prefix=csvtestdata_s3_prefix)
print(csvtest_s3)

s3://sagemaker-us-east-1-300998013710/ml-immersion-day/sentiment-files/data/csv-test


A Transformer object must be set up to describe the Batch Transform job, including the amount and type of inference hardware to be used.  Then the actual transform job itself is started with a call to the `transform` method of the Transformer.

In [11]:
transformer = estimator.transformer(instance_count=1, instance_type='ml.m5.xlarge')
transformer.transform(csvtest_s3, content_type='text/csv')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()

Waiting for transform job: tf-keras-sentiment-2020-03-28-08-34-15--2020-03-28-08-47-34-511
.....................
[34mINFO:__main__:starting services[0m
[34mINFO:__main__:using default model name: model[0m
[34mINFO:__main__:tensorflow serving model config: [0m
[34mmodel_config_list: {
  config: {
    name: "model",
    base_path: "/opt/ml/model",
    model_platform: "tensorflow"
  }[0m
[34m}

[0m
[34mINFO:__main__:nginx config: [0m
[34mload_module modules/ngx_http_js_module.so;
[0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr info;
[0m
[34mworker_rlimit_nofile 4096;
[0m
[34mevents {
  worker_connections 2048;[0m
[34m}
[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/json;
  access_log /dev/stdout combined;
  js_include tensorflow-serving.js;

  upstream tfs_upstream {
    server localhost:10001;
  }

  upstream gunicorn_upstream {
    server unix:/tmp/gunicorn.sock fail_timeo

We can now download the batch predictions from S3 to the local filesystem on the notebook instance; the predictions are contained in a file with a .out extension, and are embedded in JSON.  Next we'll load the JSON and examine the predictions, which are confidence scores from 0.0 to 1.0 where numbers close to 1.0 indicate positive sentiment, while numbers close to 0.0 indicate negative sentiment.

In [12]:
import json

batch_output = transformer.output_path
!mkdir -p batch_data/output
!aws s3 cp --recursive $batch_output/ sentiment-files/batch_data/output/

with open('./sentiment-files/batch_data/output/csv-test.csv.out', 'r') as f:
    jstr = json.load(f)
    results = [float('%.3f'%(item)) for sublist in jstr['predictions'] for item in sublist]
    print(results)

download: s3://sagemaker-us-east-1-300998013710/tf-keras-sentiment-2020-03-28-08-34-15--2020-03-28-08-47-34-511/csv-test.csv.out to sentiment-files/batch_data/output/csv-test.csv.out
[0.0, 1.0, 0.988, 0.921, 1.0, 1.0, 1.0, 0.0, 0.984, 0.995, 0.999, 0.0, 0.0, 0.567, 1.0, 0.0, 1.0, 0.982, 0.0, 0.0, 1.0, 1.0, 0.059, 1.0, 0.788, 1.0, 0.0, 0.996, 1.0, 0.0, 1.0, 0.031, 0.086, 0.0, 0.0, 0.0, 1.0, 1.0, 0.005, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.998, 0.0, 0.0, 0.0, 0.504, 0.0, 0.009, 1.0, 1.0, 0.998, 0.05, 0.513, 1.0, 0.0, 0.0, 0.001, 0.0, 1.0, 0.0, 0.0, 1.0, 0.028, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.001, 0.0, 1.0, 0.619, 0.0, 0.471, 0.79, 0.92, 0.979, 0.0, 0.0, 0.0, 0.868, 0.0, 0.989, 1.0, 1.0, 0.0, 0.968, 1.0, 0.0, 0.993, 1.0, 0.0, 0.0]


Now let's look at the text of some actual reviews to see the predictions in action.  First, we have to convert the integers representing the words back to the words themselves by using a reversed dictionary.  Next we can decode the reviews, taking into account that the first 3 indices were reserved for "padding", "start of sequence", and "unknown", and removing a string of unknown tokens from the start of the review.

In [13]:
import re

regex = re.compile(r'^[\?\s]+')

word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
first_decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_test[3]])
regex.sub('', first_decoded_review)

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


"i generally love this type of movie however this time i found myself wanting to kick the screen since i can't do that i will just complain about it this was absolutely idiotic the things that happen with the dead kids are very cool but the alive people are absolute idiots i am a grown man pretty big and i can defend myself well however i would not do half the stuff the little girl does in this movie also the mother in this movie is reckless with her children to the point of neglect i wish i wasn't so angry about her and her actions because i would have otherwise enjoyed the flick what a number she was take my advise and fast forward through everything you see her do until the end also is anyone else getting sick of watching movies that are filmed so dark anymore one can hardly see what is being filmed as an audience we are impossibly involved with the actions on the screen so then why the hell can't we have night vision"

Overall, this review looks fairly negative.  Let's compare the actual label with the prediction:

In [14]:
def get_sentiment(score):
    return 'positive' if score > 0.5 else 'negative' 

print('Labeled sentiment for this review is {}, predicted sentiment is {}'.format(get_sentiment(y_test[3]), 
                                                                                  get_sentiment(results[3])))

Labeled sentiment for this review is negative, predicted sentiment is positive


Our negative sentiment prediction agrees with the label for this review.  Let's now examine another review:

In [15]:
second_decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_test[10]])
regex.sub('', second_decoded_review)

"inspired by hitchcock's strangers on a train concept of two men swapping murders in exchange for getting rid of the two people messing up their lives throw ? from the train is an original and very inventive comedy take on the idea it's a credit to danny devito that he both wrote and starred in this minor comedy gem br br anne ramsey is the mother who inspires the film's title and it's understandable why she gets under the skin of danny devito with her sharp tongue and relentlessly putting him down for any minor ? billy crystal is the writer who's wife has stolen his book idea and is now being ? as a great new author even appearing on the oprah show to in ? he should be enjoying thus devito gets the idea of swapping murders to rid themselves of these nuisance factors br br of course everything and anything can happen when writer carl reiner lets his imagination roam with ? ideas for how the plot develops and it's amusing all the way through providing plenty of laughs and chuckles along

In [16]:
print('Labeled sentiment for this review is {}, predicted sentiment is {}'.format(get_sentiment(y_test[10]), 
                                                                                  get_sentiment(results[10])))

Labeled sentiment for this review is positive, predicted sentiment is positive


Again, the prediction agreed with the label for the test data.  Note that there is no need to clean up any Batch Transform resources:  after the transform job is complete, the cluster used to make inferences is torn down.

Now that we've reviewed some sample predictions as a sanity check, we're finished.  Of course, in a typical production situation, the data science project lifecycle is iterative, with repeated cycles of refining the model using a tool such as Amazon SageMaker's Automatic Model Tuning feature, and gathering more data.  