# Keras Text Classification with Word Embeddings

In this notebook we'll look at how to use a Convolutional Neural Net (CNN) in Keras to perform text classification in Amazon SageMaker.  The CNN is based on the Keras example published at https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html.  Two aspects of SageMaker will be demonstrated.  First, we'll use SageMaker's Script Mode with a prebuilt TensorFlow/Keras container, along with a training script similar to one you would use outside SageMaker. Second, we'll see how to use SageMaker channels to load word embeddings into the container for training.  

**Prerequisite:  run this notebook on a GPU type (P3 or P2) notebook instance.**  

We'll begin with some necessary imports.

In [None]:
import os
import sys
import numpy as np
import tensorflow as tf

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

# Prepare Dataset and Embeddings

Initially, we download the 20 Newsgroups dataset.  

In [None]:
!mkdir ./20_newsgroup
!wget -O ./20_newsgroup/news20.tar.gz http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz
!tar -xvzf ./20_newsgroup/news20.tar.gz

The next step is to download the GloVe word embeddings that we will load in the neural net.

In [None]:
!mkdir ./glove.6B
!wget https://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip -d ./glove.6B

We have to map the GloVe embedding vectors into an index.

In [None]:
BASE_DIR = ''
GLOVE_DIR = os.path.join(BASE_DIR, 'glove.6B')
TEXT_DATA_DIR = os.path.join(BASE_DIR, '20_newsgroup')
MAX_SEQUENCE_LENGTH = 1000
MAX_NUM_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

The 20 Newsgroups text also must be preprocessed.  For example, the labels for each sample must be extracted and mapped to a numeric index.

In [None]:
texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids
for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        label_id = len(labels_index)
        labels_index[name] = label_id
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                args = {} if sys.version_info < (3,) else {'encoding': 'latin-1'}
                with open(fpath, **args) as f:
                    t = f.read()
                    i = t.find('\n\n')  # skip header
                    if 0 < i:
                        t = t[i:]
                    texts.append(t)
                labels.append(label_id)

print('Found %s texts.' % len(texts))

We can use Keras text preprocessing functions to tokenize the text, limit the sequence length of the samples, and pad shorter sequences as necessary.  Additionally, the preprocessed dataset must be split into training and validation sets.

In [None]:
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

After the dataset text preprocessing is complete, we can now map the 20 Newsgroup vocabulary words to their GloVe embedding vectors for use in an embedding matrix.  This matrix will be loaded in an Embedding layer of the neural net.

In [None]:
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

print('Number of words:', num_words)
print('Shape of embeddings:', embedding_matrix.shape)

Now the data AND embeddings are saved to file to prepare for training.

Note that we will not be loading the original, unprocessed set of embeddings into the training container — instead, to save loading time, we just save the embedding matrix, which at 16MB is much smaller than the original set of embeddings at 892MB.  Depending on how large of a set of embeddings you need for other use cases, you might save further space by saving the embeddings with joblib (more efficient than the original Python pickle), and/or save the embeddings with half precision (fp16) instead of full precision and then restore them to full precision after they are loaded.

In [None]:
data_dir = os.path.join(os.getcwd(), 'data')
os.makedirs(data_dir, exist_ok=True)

train_dir = os.path.join(os.getcwd(), 'data/train')
os.makedirs(train_dir, exist_ok=True)

val_dir = os.path.join(os.getcwd(), 'data/val')
os.makedirs(val_dir, exist_ok=True)

embedding_dir = os.path.join(os.getcwd(), 'data/embedding')
os.makedirs(embedding_dir, exist_ok=True)

np.save(os.path.join(train_dir, 'x_train.npy'), x_train)
np.save(os.path.join(train_dir, 'y_train.npy'), y_train)
np.save(os.path.join(val_dir, 'x_val.npy'), x_val)
np.save(os.path.join(val_dir, 'y_val.npy'), y_val)
np.save(os.path.join(embedding_dir, 'embedding.npy'), embedding_matrix)

# Local Mode Training

Amazon SageMaker’s Local Mode training feature is a convenient way to make sure your training code is working as expected before moving on to full scale, hosted training. To train in Local Mode, it is necessary to have docker-compose or nvidia-docker-compose (for GPU) installed in the notebook instance. Running following script will install docker-compose or nvidia-docker-compose and configure the notebook environment for you.

In [None]:
!/bin/bash ./setup.sh

Next, we'll set up a TensorFlow Estimator for Local Mode training. One of the key parameters for an Estimator is the train_instance_type, which is the kind of hardware on which training will run. In the case of Local Mode, we simply set this parameter to `local_gpu` to invoke Local Mode training on the GPU (or to `local` if the instance is CPU only). Other parameters of note are the algorithm’s hyperparameters, which are passed in as a dictionary, and a Boolean parameter indicating that we are using Script Mode.

BE SURE TO RUN THIS EXAMPLE ON A GPU TYPE (P3 or P2) NOTEBOOK INSTANCE. 

In [None]:
import sagemaker
from sagemaker.tensorflow import TensorFlow

model_dir = '/opt/ml/model'
train_instance_type = 'local_gpu'
hyperparameters = {'epochs': 20, 
                   'batch_size': 128, 
                   'num_words': num_words,
                   'word_index_len': len(word_index),
                   'labels_index_len': len(labels_index),
                   'embedding_dim': EMBEDDING_DIM,
                   'max_sequence_len': MAX_SEQUENCE_LENGTH
                  }

local_estimator = TensorFlow(entry_point='train.py',
                       source_dir='code',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       base_job_name='tf-20-newsgroups',
                       framework_version='1.13',
                       py_version='py3',
                       script_mode=True)

In [None]:
inputs = {'train': f'file://{train_dir}',
          'val': f'file://{val_dir}',
          'embedding': f'file://{embedding_dir}'}

local_estimator.fit(inputs)

# Local Mode Endpoint

While Amazon SageMaker’s Local Mode training is very useful to make sure your training code is working before moving on to full scale training, it also would be useful to have a convenient way to test your model locally before incurring the time and expense of deploying it to production.  Of course, you could fetch the SavedModel artifact or a model checkpoint saved in Amazon S3, and load it in your notebook for testing.  However, an even easier way to do this is to use the Amazon SageMaker SDK to do this work for you.

The Estimator object from the training job can be used to deploy a model with a single line of code.  With one exception, this code is the same as the code you would use to deploy to production.  In particular, all you need to do is invoke the local Estimator's `deploy` method, and similarly to Local Mode training, specify the instance type as either `local_gpu` or `local`.

In [None]:
local_predictor = local_estimator.deploy(initial_instance_count=1,instance_type='local_gpu')

To get predictions from the local endpoint, simply invoke the Predictor's `predict` method.

In [None]:
local_results = local_predictor.predict(x_val[:10])['predictions'] 

As a sanity check, the predictions can be compared against the actual target values, which are the numbers 0 to 19 representing the twenty different news group categories.

In [None]:
print('predictions: \t{}'.format(np.argmax(local_results, axis=1)))
print('target values: \t{}'.format(np.argmax(y_val[:10], axis=1)))

To avoid having the TensorFlow Serving container running indefinitely on this notebook instance, simply gracefully shut it down by calling the `delete_endpoint` method of the Predictor object.

In [None]:
local_predictor.delete_endpoint()

# SageMaker Hosted Training

Now that we've confirmed our code is working locally, we can move on to use SageMaker's hosted training functionality. Hosted training is preferred to for doing actual training, especially large-scale, distributed training. Before starting hosted training, the data must be uploaded to S3. The word embedding matrix also will be uploaded.  We'll do that now, and confirm the upload was successful.

In [None]:
s3_prefix = 'tf-20-newsgroups'

traindata_s3_prefix = '{}/data/train'.format(s3_prefix)
valdata_s3_prefix = '{}/data/val'.format(s3_prefix)
embeddingdata_s3_prefix = '{}/data/embedding'.format(s3_prefix)

train_s3 = sagemaker.Session().upload_data(path='./data/train/', key_prefix=traindata_s3_prefix)
val_s3 = sagemaker.Session().upload_data(path='./data/val/', key_prefix=valdata_s3_prefix)
embedding_s3 = sagemaker.Session().upload_data(path='./data/embedding/', key_prefix=embeddingdata_s3_prefix)

inputs = {'train':train_s3, 'val': val_s3, 'embedding': embedding_s3}
print(inputs)

We're now ready to set up an Estimator object for hosted training. It is similar to the Local Mode Estimator, except the train_instance_type has been set to a ML instance type instead of local_gpu for Local Mode. With this change, we simply call fit to start the actual hosted training.

In [None]:
train_instance_type = 'ml.p3.2xlarge'
hyperparameters = {'epochs': 20, 
                   'batch_size': 128, 
                   'num_words': num_words,
                   'word_index_len': len(word_index),
                   'labels_index_len': len(labels_index),
                   'embedding_dim': EMBEDDING_DIM,
                   'max_sequence_len': MAX_SEQUENCE_LENGTH
                  }

estimator = TensorFlow(entry_point='train.py',
                       source_dir='code',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       base_job_name='tf-20-newsgroups',
                       framework_version='1.13',
                       py_version='py3',
                       script_mode=True)

In [None]:
estimator.fit(inputs)

# SageMaker hosted endpoint

If we wish to deploy the model to production, the next step is to create a SageMaker hosted endpoint. The endpoint will retrieve the TensorFlow SavedModel created during training and deploy it within a TensorFlow Serving container. This all can be accomplished with one line of code, an invocation of the Estimator's deploy method.

In [None]:
predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge')

We can now compare the predictions generated by the endpoint with a sample of the validation data.  The results are shown as integer labels from 0 to 19 corresponding to the 20 different newsgroups.

In [None]:
results = predictor.predict(x_val[:10])['predictions'] 

print('predictions: \t{}'.format(np.argmax(results, axis=1)))
print('target values: \t{}'.format(np.argmax(y_val[:10], axis=1)))

When you're finished with your review of this notebook, you can delete the prediction endpoint to release the instance(s) associated with it.

In [None]:
sagemaker.Session().delete_endpoint(predictor.endpoint)