# Introduction
Author: Tyler Koon

This document serves as a brief overview of my stage gate one project for CSCI 494-007, National Security Data Science in Action. To demonstrate my competency using Python and related data analysis libraries, I implemented an off-the-shelf neural network using TensorFlow and Keras. Given my inexperience with these two high-level libraries, I followed the "Basic Text Classification" developer guide from the official TensorFlow documentation. This project involved building, training, and evaluating a sequential model for binary sentiment analysis of movie reviews.

In [1]:
import matplotlib.pyplot as plt, os, re, shutil, string, tensorflow as tf
from tensorflow.keras import layers, losses

## Downloading and Preparing the Dataset
This project uses the Large Movie Review Dataset which was published by Andrew Mass et. al. at the Stanford AI Research Laboratory. This dataset contains 50,000 movie reviews sourced from the Internet Movie Database (IMDb), with each review containing a plain-text representation of the review and a binary label indicating whether the sentiment of the review is positive or negative. These data are already split into a balanced pair of training and testing data with 25,000 samples each. Additionally, these data are already clean and require little preprocessing save for vectorization of the text reviews and labels.

The following code chunk is responsible for downloading and extracting the 'Large Movie Review Dataset' from the Stanford Artificial Intelligence laboratory. This is achieved using the keras 'get_file' method, which downloads a resource from a given URL (or from a local cache) and additionally handles extracting the underlying contents of the compressed file.

In [2]:
# The source URL for the Larger Movie Review Dataset
datasetUrl = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

# Download and extract the dataset
dataset = tf.keras.utils.get_file("aclImdb_v1", datasetUrl, untar=True, cache_dir=".", cache_subdir="")

# Get a reference to the dataset directory
datasetDir = os.path.join(os.path.dirname(dataset), "aclImdb")

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [3]:
os.listdir(datasetDir)

['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train']

In [4]:
# Get a reference to the training data directory
trainDir = os.path.join(datasetDir, "train")

os.listdir(trainDir)

['labeledBow.feat',
 'neg',
 'pos',
 'unsup',
 'unsupBow.feat',
 'urls_neg.txt',
 'urls_pos.txt',
 'urls_unsup.txt']

The training data consists of three things: the underlying reviews (split between the `neg`, `pos`, and `unsup` directories), URL for those reviews (split between the `urls_neg.txt`, `urls_pos.txt`, and `urls_unsup.txt`), and what appears to be some sort of encoded labels for the reviews (split between `labeledBow.feat` and `unsupBow.feat`).

In [5]:
# Get a reference to the file containing the 1181 review
sampleFile = os.path.join(trainDir, 'pos/1181_9.txt')

# Read in and print the review
with open(sampleFile) as posReview1181:
    print(posReview1181.read())

Rachel Griffiths writes and directs this award winning short film. A heartwarming story about coping with grief and cherishing the memory of those we've loved and lost. Although, only 15 minutes long, Griffiths manages to capture so much emotion and truth onto film in the short space of time. Bud Tingwell gives a touching performance as Will, a widower struggling to cope with his wife's death. Will is confronted by the harsh reality of loneliness and helplessness as he proceeds to take care of Ruth's pet cow, Tulip. The film displays the grief and responsibility one feels for those they have loved and lost. Good cinematography, great direction, and superbly acted. It will bring tears to all those who have lost a loved one, and survived.


## Loading the Dataset
The following section handles reading in and formatting the data into a structure suitable for preprocessing and consumption.
Because the Dataset is structured using directories (i.e., the individual instances are represented in classes that reside in directories corresponding to their class level), it can easily load the dataset using the `text_dataset_from_directory` method.

According to TensorFlow documentation, it is best practice to split data into three subsets: training, validation, and test sets. Training is used to, of course, train the model. The validation set is used to evaluate the model and its parameters, and the testing set is used for final model performance evaluation. Ideally, these three subsets are disjoint from one another.

In [6]:
removeDir = os.path.join(trainDir, "unsup")

# Remove the 'unsup' directory support the format required by the `text_dataset_from_directory`
shutil.rmtree(removeDir)

In [7]:
# Define training parameters
batchSize = 32

# Use a seed for reproducibility (42 is the seed used in TensorFlow guide)
seed = 42

# Create a new Dataset object that stores the training data
rawTrainingDataset = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batchSize,
    validation_split=0.2,
    subset="training",
    seed=seed
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


In [8]:
# Take a look at the data
for text_batch, label_batch in rawTrainingDataset.take(1):
    for i in range(3):
        print("Review", text_batch.numpy()[i])
        print("Label", label_batch.numpy()[i])

Review b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label 0
Review b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into 

The data contains two attributes per data instance; the underlying review and its classification (positive or negative). These labels are encoded as:

In [9]:
print("Label 0 corresponds to:", rawTrainingDataset.class_names[0])
print("Label 1 corresponds to:", rawTrainingDataset.class_names[1])

Label 0 corresponds to: neg
Label 1 corresponds to: pos


In [10]:
# Create a new Dataset object that stores the validation data
rawValidationDataset = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batchSize,
    validation_split=0.2, # The size of the validation  (20% of the original )
    subset="validation",
    seed=seed
)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [11]:
# Create a new Dataset object that stores the test data (already defined for this dataset)
rawTestDataset = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/test",
    batch_size=batchSize
)

Found 25000 files belonging to 2 classes.


In [12]:
print(len(rawValidationDataset) + len(rawTrainingDataset) == len(rawTestDataset))

True


## Preparing the Data for Training
In this section, the training, and validation get prepared for the training step. This involves three stages: standardization, tokenization, and vectorization.

Standardization, or preprocessing, involves putting the data into a standard format. In our case, this includes removing punctuation and HTML characters that otherwise add complexity to each individual sample. This could include many other actions however, such as dopping missing data or removing irrelevant columns.

Tokenization simply involves tokenizing the characters in the data, often times splitting by whitespace. This structures the content into individual chunks that can be semantically analysed.

Vectorization involves converting tokens into vectors which can be fed into the neural network.


> Note: It is important that the training, validation, and testing datasets are processed identically and under the same conditions to prevent training-testing skew (divergence in performance between the training and testing)

In [13]:
# Custom standardization function that converts the text to lowercase, removes HTML elements, and removes punctuation
def standardizeData(input):
    # Convert all text to lowercase
    lowercase = tf.strings.lower(input)

    # Remove HTML elements
    strippedHTML = tf.strings.regex_replace(lowercase, '<br />', ' ')

    # Return the processed string with all punctuation removed
    return tf.strings.regex_replace(strippedHTML, '[%s]' % re.escape(string.punctuation), '')

In [14]:
# Define preprocessing parameters
maxFeatures = 10000
sequenceLength = 250

# Create a layer that handles standardization (using the custom `standardizeData` function), tokenization, and vectorization
vectorizeLayer = layers.TextVectorization(
    standardize=standardizeData,
    max_tokens=maxFeatures,
    output_mode='int',
    output_sequence_length=sequenceLength
)

In [15]:
# Fit the preprocessing state to an unlabeled
trainText = rawTrainingDataset.map(lambda  x, y: x)
vectorizeLayer.adapt(trainText)

In [16]:

def vectorizeText(text, label):
    text = tf.expand_dims(text, -1)
    return vectorizeLayer(text), label

# Test out the preprocessing on a batch of training data
textBatch, labelBatch = next(iter(rawTrainingDataset))
firstReview, firstLabel = textBatch[0], labelBatch[0]

print("Review", firstReview)
print("Label", rawTrainingDataset.class_names[firstLabel])
print("Vectorized Review", vectorizeText(firstReview, firstLabel))

Review tf.Tensor(b'Great movie - especially the music - Etta James - "At Last". This speaks volumes when you have finally found that special someone.', shape=(), dtype=string)
Label neg
Vectorized Review (<tf.Tensor: shape=(1, 250), dtype=int64, numpy=
array([[  86,   17,  260,    2,  222,    1,  571,   31,  229,   11, 2418,
           1,   51,   22,   25,  404,  251,   12,  306,  282,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
       

The preprocessing layer appears to work as intended. The input review is processed, and then vectorized such that each token is represented with an index.

## Configuring Dataset for Performance
This section introduces some methods for improving training performance, specifically in regard to I/O operations. This includes `Caching` and `Prefetching`.

Caching will keep the data loaded in memory, which makes retrieval of the data during the training process more expedient and efficient. In the event that there is not enough memory to fit the entire dataset, caching can also produce a more efficient on-disk cache that will still result in improved I/O performance.

Prefetching will asynchronously read in the next input vector while the previous vector is being passed through the model. This reduces the 'step time' and consequently the maximum training time.

> Note: Using tf.data.AUTOTUNE enables automatic tuning of the target parameter

The TensorFlow guids go into more detail on how data can be optimized for better training performance: https://www.tensorflow.org/guide/data_performance#overview

In [17]:
# Preprocess each dataset
trainDataset = rawTrainingDataset.map(vectorizeText)
validationDataset = rawValidationDataset.map(vectorizeText)
testDataset = rawTestDataset.map(vectorizeText)

In [18]:
# Constant to indicate that a hyperparameter should be automatically tuned
AUTOTUNE = tf.data.AUTOTUNE

# Enable caching and prefetching for the datasets
trainDataset = trainDataset.cache().prefetch(buffer_size=AUTOTUNE)
validationDataset = validationDataset.cache().prefetch(buffer_size=AUTOTUNE)
testDataset = testDataset.cache().prefetch(buffer_size=AUTOTUNE)

## Defining the Model
Here, we define the Sequential model using Keras. For this project, the model will contain five layers:

Embedding: Takes the integer-encoded reviews and looks up the vector for each word-index. This is essentially a translation layer that produces the vectors with which the model will learn.

GlobalAveragePooling1D: Returns a fixed-length vector for each example which is the average of the sequence dimension. This allows the model to handle input vectors of varying lengths.

Dense: A densely connected layer that outputs to a single dimension.

> Note: The `Dropout` layers are used to prevent over-fitting by randomly setting input values to 0 (and scaling the other values such that the total sum of values remains unchanged): https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout

In [19]:
embeddingDim = 16

# Define a seed for reproducibility
tf.random.set_seed(12345)

# Define the model and its layers
model = tf.keras.Sequential([
    layers.Embedding(maxFeatures + 1, embeddingDim),
    layers.Dropout(0.2),
    layers.GlobalAveragePooling1D(),
    layers.Dropout(0.2),
    layers.Dense(1)
])

# Store initial model weights for future tests
initial_model_weights = model.get_weights()

# Report a model summary
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 16)          160016    
                                                                 
 dropout (Dropout)           (None, None, 16)          0         
                                                                 
 global_average_pooling1d (G  (None, 16)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense (Dense)               (None, 1)                 17        
                                                                 
Total params: 160,033
Trainable params: 160,033
Non-trainable params: 0
__________________________________________________

## Loss Function and Optimizer

This section defines the loss function and optimizer that will guide the training / optimization process. The TensorFlow implements uses binary crossentropy for the loss function (https://www.tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy) and the adam optimizer (A stochastic gradient descent optimizer based on 'adaptive estimation of first-order and second-order moments': https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam). I have only really worked with optimizing for MSE, however the binary crossentropy loss metric appears to be a more optimal metric for working with, well, binary data.

In [20]:
# Define the loss function and optimizer to facilitate the training process
model.compile(
    loss = losses.BinaryCrossentropy(from_logits=True),
    optimizer="adam",
    metrics=tf.metrics.BinaryAccuracy(threshold=0.0)
)

## Training

Train the model for 20 epochs

In [21]:
epochs = 20

# Train the model
history = model.fit(
    trainDataset,
    validation_data = validationDataset,
    epochs = epochs
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [22]:
# Compute the loss and accuracy for the model on the testing data
loss, accuracy = model.evaluate(testDataset)



## Evaluation
Here, we consider the evaluation metrics produced in the training and validation processes. These values are retrieved from the history object that is returned from the fit() method on the model. These evaluation metrics coincide with those metrics specified in the model definition.

Assuming these data are optimal for training with the sequential model previously defined, I expect the loss to degrade over each epoch (should bottom out as the model converges to a locally optimal weights), and the accuracy should simultaneously increase. In an ideal world, this behavior would be the same for when using both the training and validation data...

In [23]:
# Create a dictionary that stores the evaluation metrics from training/validation
historyDict = history.history
historyDict.keys()

dict_keys(['loss', 'binary_accuracy', 'val_loss', 'val_binary_accuracy'])

In [None]:
acc = historyDict["binary_accuracy"]
validationAccuracy = historyDict["val_binary_accuracy"]

loss = historyDict["loss"]
validationLoss = historyDict["val_loss"]

epochs = range(1, len(acc) + 1)

# Plot the training and validation loss with respect to the number if iterations
plt.plot(epochs, loss, 'bo', label = "Training Loss")
plt.plot(epochs, validationLoss, 'b', label = "Validation Loss")
plt.title("Training and Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()

plt.show()

# Plot the training and validation accuracy with respect to the number if iterations
plt.plot(epochs, acc, 'bo', label = "Training Accuracy")
plt.plot(epochs, validationAccuracy, 'b', label = "Validation Accuracy")
plt.title("Training and Validation Accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()

plt.show()


It appears as though the behavior of the loss and accuracy metrics generally match my expectations, with the loss decreasing and the accuracy increasing across epochs. That said, the metrics for the validation data appear to plateau before the metrics for the training data. This indicates that after about four or five epochs, the model begins to stop improving on new data, and starts to overfit the training data. One possible way to deal with this is to terminate the training process once improvement in accuracy/loss for the validation data diminishes. This option is explored in the 'Further Exploration' section below.


## Further Exploration (Implementing Early Stopping)
In order to address the problem of overfitting, I have implemented the TensorFlow.keras.callbacks.EarlyStopping callback, which will monitor a evaluation metric, and terminate the fitting process when that metric stops improving. In this case, I tested this callback when monitoring both the loss and accuracy, using a patience value of 3 for both instances. The results are presented below:

In [None]:
# Recompile the model
model.reset_metrics()
model.set_weights(initial_model_weights)


In [None]:
# Define the patience for the Early Stopping callback (number of iterations
# where no improvement can occur before the fitting process is terminated)
early_stopping_patience = 1

# Define the Early Stopping callback to monitor the loss. That is, when the loss
# stops improving for early_stopping_patience epochs, it will terminate the fitting process
early_stopping = tf.keras.callbacks.EarlyStopping(monitor = "val_loss", patience = early_stopping_patience, mode="min")

In [None]:
early_stopping_epochs = 20

# Refit the model with the Early Stopping callback
early_stopping_history = model.fit(
    trainDataset,
    validation_data = validationDataset,
    callbacks = [early_stopping],
    epochs = early_stopping_epochs
)

In [None]:
# Create a dictionary that stores the evaluation metrics from training/validation
early_stop_history_dict = early_stopping_history.history
early_stop_history_dict.keys()

# Retrieve the metrics from the history dict
early_stop_acc = early_stop_history_dict["binary_accuracy"]
early_stop_validation_accuracy = early_stop_history_dict["val_binary_accuracy"]

early_stop_loss = early_stop_history_dict["loss"]
early_stop_validation_loss = early_stop_history_dict["val_loss"]

# Get the number epochs that the early-stopping model trained for
early_stopping_epochs = range(1, len(early_stopping_history.history["loss"]) + 1)

# Plot the training and validation loss with respect to the number if iterations
plt.plot(early_stopping_epochs, early_stop_loss, 'bo', label = "Training Loss")
plt.plot(early_stopping_epochs, early_stop_validation_loss, 'b', label = "Validation Loss")
plt.title("Training and Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()

plt.show()

# Plot the training and validation accuracy with respect to the number if iterations
plt.plot(early_stopping_epochs, early_stop_acc, 'bo', label = "Training Accuracy")
plt.plot(early_stopping_epochs, early_stop_validation_accuracy, 'b', label = "Validation Accuracy")
plt.title("Training and Validation Accuracy")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()

plt.show()

In [None]:
# Get the loss and accuracy values for this new
loss, accuracy = model.evaluate(testDataset)

As evident in the graphs above, implementing the early stopping callback resulted in a similar accuracy and loss at the cost of fewer epochs. There is clearly less deviation from the training metrics, indicating that less overfitting. This exemplifies the benefits of optimizing the training process, and begs to question how other TensorFlow/Keras callback methods might be used to more efficiently train models.