# Sentiment Analysis with Recurrent Neural Networks

## Introduction

Sentiment analysis, also known as opinion mining, is a natural language processing task that involves determining the sentiment expressed in a piece of text. This project focuses on building a sentiment analysis model using Recurrent Neural Networks (RNNs) for classifying movie reviews as either positive or negative.

### Objective

The primary goal of this project is to create a model capable of accurately predicting sentiment in movie reviews. The model is trained on a dataset of labeled movie reviews, learning the nuances of language and context associated with positive and negative sentiments.

### Model Architecture

The sentiment analysis model is constructed using an RNN architecture, specifically a Long Short-Term Memory (LSTM) network. LSTMs are well-suited for sequence data, making them effective for capturing dependencies and patterns in textual information over time. The architecture includes an embedding layer for word representation, LSTM layers for sequence modeling, and a fully connected layer with a sigmoid activation for binary classification.

### Dataset

The dataset used for training and evaluation consists of movie reviews labeled with sentiment polarity. Each review is preprocessed and tokenized to create a numerical representation that the model can process. The dataset is split into training, validation, and test sets to ensure robust evaluation of the model's performance.

### Training Process

The model is trained using a binary cross-entropy loss function and optimized with the Adam optimizer. During training, the model undergoes backpropagation to adjust its parameters based on the calculated loss. To prevent overfitting, dropout layers are incorporated. The training process is monitored through both training and validation loss.

### Evaluation

The trained model is evaluated on a separate test set to assess its generalization to unseen data. Performance metrics such as test loss and accuracy are calculated to gauge the effectiveness of the sentiment analysis model.

### Inference on New Reviews

The notebook includes a function for making predictions on new reviews. Users can input a movie review, and the model will provide a sentiment prediction, indicating whether the review is likely positive or negative.

--- 

In [2]:
# Import libraries
import numpy as np

# Read data from text files
with open('data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('data/labels.txt', 'r') as f:
    labels = f.read()

In [3]:
# Print first 2000 characters in reviews
print(reviews[:2000])
print()

# Print first 20 characters in labels
print(labels[:20])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn

## Data pre-processing

Next I import the 'punctuation' module from the 'string' library and print a list of punctuation characters. Subsequently, I process the previously loaded 'reviews' data by converting it to lowercase and removing all punctuation (punctuation has no bearing on whether review is pos or neg). The cleaned text is stored in the variable 'all_text.' In essence, this code prepares the textual data for further analysis by standardizing the case and eliminating punctuation from the reviews.

In [4]:
# Import libraries
from string import punctuation

# Print list of punctuation characters
print(punctuation)

# Clean up the reviews: lowercase, remove punctuation
reviews = reviews.lower() # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation]) # get rid of punctuation

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In this code snippet, the cleaned text stored in 'all_text' is further processed. It first splits the text into a list of strings based on new lines and spaces, creating a list called 'reviews_split.' Then, it joins these strings back together into one long string, stored again in 'all_text.' The final step involves creating a list of words by splitting the long string using spaces, and this list is stored in the variable 'words.' Essentially, this code manipulates the text data to organize it into a format suitable for subsequent analysis, breaking it down into lines, rejoining it, and then extracting individual words.

In [5]:
# Split by new lines and spaces
reviews_split = all_text.split('\n') # split by new lines and spaces
all_text = ' '.join(reviews_split) # join all the reviews together into one long string

# Create a list of words
words = all_text.split() # split long string by spaces into words 

Here i print out the first 30 words to inspect whether the pre-processing was sussessful.

In [6]:
# Validate pre-processing on first 30 words
words[:30]

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such',
 'as',
 'teachers',
 'my',
 'years',
 'in',
 'the',
 'teaching',
 'profession',
 'lead',
 'me']

### Encoding the words (Tokenizing)

This code segment focuses on building a vocabulary and tokenizing the reviews. It uses the 'Counter' class from the 'collections' library to create a dictionary ('counts') mapping words to their frequencies in the 'words' list. The vocabulary is then established by sorting the words based on frequency. Each word is assigned a unique integer, excluding 0 for padding purposes, forming a dictionary named 'vocab_to_int.' The code subsequently tokenizes each review in the 'reviews_split' list by mapping the words to their corresponding integers using 'vocab_to_int,' and the results are stored in 'reviews_ints.' Finally, the code prints statistics about the vocabulary size and displays the tokenized representation of the first review.

In [7]:
# Import libraries
from collections import Counter

## Build a dictionary that maps words to integers (do not use 0 as it is used for padding later on)
counts = Counter(words) # create a dictionary of word counts; maps most common words to smallest integers
vocab = sorted(counts, key=counts.get, reverse=True) # sort words by frequency/commonality
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)} # for each word in the vocab list, it will return a tuple containing the index and the word

# Alternative way to create vocab_to_int
# words_unique = np.unique(words)
# vocab_to_int = dict(zip(words_unique, range(1, len(words_unique)+1)))

# Use the dict to tokenize each review in reviews_split
reviews_ints = []
for review in reviews_split:
    word_token = [vocab_to_int[word] for word in review.split()] # tokenize each word
    reviews_ints.append(word_token) # append tokenized word to reviews_ints

# Print stats about vocabulary
print('Unique words: ', len((vocab_to_int)))  # should ~ 74000+
print()

# Print tokens in first review
print('Tokenized review: \n', reviews_ints[:1])

Unique words:  74072

Tokenized review: 
 [[21025, 308, 6, 3, 1050, 207, 8, 2138, 32, 1, 171, 57, 15, 49, 81, 5785, 44, 382, 110, 140, 15, 5194, 60, 154, 9, 1, 4975, 5852, 475, 71, 5, 260, 12, 21025, 308, 13, 1978, 6, 74, 2395, 5, 613, 73, 6, 5194, 1, 24103, 5, 1983, 10166, 1, 5786, 1499, 36, 51, 66, 204, 145, 67, 1199, 5194, 19869, 1, 37442, 4, 1, 221, 883, 31, 2988, 71, 4, 1, 5787, 10, 686, 2, 67, 1499, 54, 10, 216, 1, 383, 9, 62, 3, 1406, 3686, 783, 5, 3483, 180, 1, 382, 10, 1212, 13583, 32, 308, 3, 349, 341, 2913, 10, 143, 127, 5, 7690, 30, 4, 129, 5194, 1406, 2326, 5, 21025, 308, 10, 528, 12, 109, 1448, 4, 60, 543, 102, 12, 21025, 308, 6, 227, 4146, 48, 3, 2211, 12, 8, 215, 23]]


### Encoding the labels (Tokenizing)

This code section deals with encoding labels for sentiment analysis. It splits the 'labels' string at new lines, creating a list called 'labels_split.' Then, it encodes the labels into numerical values, assigning 1 for 'positive' and 0 for 'negative.' The encoded labels are stored in the 'encoded_labels' numpy array. To verify the correctness of the encoding, the code prints the original and encoded labels for the first 10 samples. This process prepares the sentiment labels in a format suitable for machine learning models, where positive sentiments are represented by 1 and negative sentiments by 0.

In [8]:
# Encode labels
labels_split = labels.split('\n') # split at new lines
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split]) # convert 'positive' to 1 and 'negative' to 0

# Check if labels are correct
for i in range(10):
    print(f' Label: {labels_split[i]}, Encoded label: {encoded_labels[i]}')

 Label: positive, Encoded label: 1
 Label: negative, Encoded label: 0
 Label: positive, Encoded label: 1
 Label: negative, Encoded label: 0
 Label: positive, Encoded label: 1
 Label: negative, Encoded label: 0
 Label: positive, Encoded label: 1
 Label: negative, Encoded label: 0
 Label: positive, Encoded label: 1
 Label: negative, Encoded label: 0


### Removing Outliers

This code snippet focuses on analyzing the lengths of tokenized reviews. It uses the 'Counter' class to create a dictionary ('review_lens') that maps the lengths of the tokenized reviews to their frequencies. The code then prints the number of zero-length reviews and the maximum review length in the dataset. This analysis provides insights into the distribution of review lengths, identifying any anomalies such as zero-length reviews and the longest review in terms of the number of tokens. Understanding the distribution of review lengths is crucial for identifying outliers in preprocessing and setting appropriate parameters when training machine learning models.

In [9]:
# Outlier review stats
review_lens = Counter([len(x) for x in reviews_ints]) # Dict of review lengths and their frequency
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 1
Maximum review length: 2514


This code snippet addresses the removal of outlier reviews with zero length. It starts by printing the initial number of reviews in the 'reviews_ints' list. Then, it identifies the indices of reviews with a length of zero using list comprehension and removes those entries from both the 'reviews_ints' and 'encoded_labels' lists. Finally, the code prints the updated number of reviews after the removal of outliers. This process ensures that any reviews with no content are excluded from the dataset, maintaining data integrity and preparing the data for further analysis or machine learning model training.

In [10]:
# Print num reviews before removing outliers
print('Number of reviews before removing outliers: ', len(reviews_ints))

# Remove any reviews/labels with zero length from the reviews_ints list.
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0] # get indices of any reviews with length 0
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx] # remove 0-length reviews 
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx]) # remove 0-length labels

# Print num reviews after removing outliers
print('Number of reviews after removing outliers: ', len(reviews_ints))

Number of reviews before removing outliers:  25001
Number of reviews after removing outliers:  25000


## Padding & Truncating sequences

The provided code defines a function named `pad_features` that pads or truncates a given list of tokenized reviews (`reviews_ints`) to a specified sequence length (`seq_length`). The function initializes a NumPy array called 'features' with zeros, where the number of rows corresponds to the number of reviews, and the number of columns is set to the desired sequence length. 

For each review in the input list, the function populates the corresponding row in the 'features' array. If the review is longer than the specified sequence length, it truncates the review to fit. If the review is shorter, it pads the beginning of the row with zeros.

In summary, this function ensures that all reviews have a consistent length by either padding them with zeros or truncating them to the desired sequence length, making the data suitable for input to machine learning models with fixed-size inputs.

In [11]:
def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's 
        or truncated to the input seq_length.
    '''
    # Getting the correct rows x cols shape
    # this creats an array of zeros with the shape of the number of reviews by the sequence length
    features = np.zeros((len(reviews_ints), # rows = number of reviews
                        seq_length), # cols = desired sequence length
                        dtype=int)

    # For each review, I grab that review and insert it into the array of zeros
    for i, row in enumerate(reviews_ints):
        # if the review is longer than the sequence length, it will be truncated ([:seq_length])
        # if the review is shorter than the sequence length, the previously filled in zeros will remain = it will be padded with zeros
        features[i, -len(row):] = np.array(row)[:seq_length] # start at the end of the array [i] and work backwards towards the beginning of the array [-len(row):]
    
    return features

In this code snippet, the implemented `pad_features` function is tested. It sets the desired sequence length (`seq_length`) to 200 and applies the function to the tokenized reviews stored in `reviews_ints`. The result is stored in the 'features' variable.

The subsequent test statements check whether the number of rows in the 'features' array matches the number of reviews in the original 'reviews_ints' list and whether each row in 'features' has the specified sequence length of 200.

Finally, the code prints the first 10 values of the first 30 batches (rows) of the 'features' array to provide a glimpse into the transformed data, demonstrating how the reviews are padded or truncated to the desired sequence length. This step is crucial for preparing the data for training machine learning models with fixed-size input requirements.

In [12]:
# Test implementation
seq_length = 200
features = pad_features(reviews_ints, seq_length=seq_length)

# Test statements
assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# Print first 10 values of the first 30 batches 
print(features[:30,:10])

[[    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [22382    42 46418    15   706 17139  3389    47    77    35]
 [ 4505   505    15     3  3342   162  8312  1652     6  4819]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [   54    10    14   116    60   798   552    71   364     5]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    1   330   578    34     3   162   748  2731     9   325]
 [    9    11 10171  5305  1946   689   444    22   280   673]
 [    0     0     0     0     0     0     0     0     0

## Training, Validation, Test

This code segment defines a fraction (`split_frac`) to determine the proportion of data to allocate for training. It then calculates the split index based on this fraction, dividing the data into training, validation, and test sets for both features and labels (`train_x`, `val_x`, `test_x`, `train_y`, `val_y`, `test_y`). The split is performed in such a way that 80% of the data is used for training, and the remaining 20% is equally divided between the validation and test sets.

Finally, the code prints out the shapes of the resultant feature data for each set, providing information on the number of samples and the sequence length. This step ensures transparency in understanding the distribution of data across training, validation, and test sets.

In [13]:
# Define fraction to keep for training
split_frac = 0.8

# Split data into training, validation, and test data (features and labels, x and y)
split_idx = int(len(features)*split_frac) # get index at which to split: number of reviews times the split fraction (80%)
train_x, remaining_x = features[:split_idx], features[split_idx:] # use split index to split features into training and remaining
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*0.5) # get half the indeces to split the remaining data into validation and test
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:] # split all left until the test index into validation and all right into test set
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## Print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))


			Feature Shapes:
Train set: 		(20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		(2500, 200)


## DataLoaders and Batching

This code snippet demonstrates the preparation of PyTorch DataLoader objects for training, validation, and test datasets. It utilizes the `TensorDataset` and `DataLoader` classes from the `torch.utils.data` module. 

- Tensor datasets (`train_data`, `valid_data`, `test_data`) are created, converting NumPy arrays to PyTorch tensors.
- The batch size for the data loaders is set to 50.
- The data is shuffled during loading to prevent the model from learning patterns based on the ordering of the data.

These DataLoader objects (`train_loader`, `valid_loader`, `test_loader`) enable efficient loading of batches during model training, validation, and testing. The use of DataLoader facilitates easier iteration over the datasets in mini-batches, which is a common practice in deep learning.

In [14]:
# Import libraries
import torch
from torch.utils.data import TensorDataset, DataLoader

# Create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# Set batch size for data loaders
batch_size = 50

# Shuffle data, so that model doesn't learn anything about ordering of the data, and instead focuses on content
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

In this code snippet, a single batch of training data is obtained using the `iter` function and `next` method on the `train_loader`. The resulting batch consists of both features (`sample_x`) and labels (`sample_y`). The code then prints the shapes and contents of the sampled input features and labels.

This step provides a quick check on the dimensions of a batch, ensuring that the data loading process is functioning as expected. Understanding the shapes of input features and labels in a batch is crucial for designing and debugging deep learning models.

In [15]:
# Obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = next(dataiter)

# Print shape of batch of features and labels
print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([50, 200])
Sample input: 
 tensor([[   1, 2797, 2912,  ..., 9251,    4, 1374],
        [   0,    0,    0,  ...,   35, 1069,  226],
        [   0,    0,    0,  ...,   43,   44,    8],
        ...,
        [   0,    0,    0,  ..., 1511,   71,  350],
        [   0,    0,    0,  ...,   16,  339,  572],
        [   0,    0,    0,  ...,   39,   10,  120]], dtype=torch.int32)

Sample label size:  torch.Size([50])
Sample label: 
 tensor([1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
        0, 1], dtype=torch.int32)


---
# Sentiment Network with PyTorch

This code snippet checks whether a GPU is available for training by using the `torch.cuda.is_available()` function. If a GPU is available, it prints "Training on GPU." Otherwise, it prints "No GPU available, training on CPU."

This check is important as training deep learning models on GPUs can significantly accelerate the training process compared to using CPUs. It allows for parallel processing and handling large-scale computations more efficiently.

In [16]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


This code defines a SentimentRNN class, which is a recurrent neural network (RNN) model for sentiment analysis. The model is built using PyTorch's `nn.Module` class. Here's a breakdown of the key components:

- **Embedding Layer:** Converts word indices to dense vectors of fixed size (`embedding_dim`). Need to add this layer because there are 74000+ words in our vocabulary. It is massively inefficient to one-hot encode that many classes. So, instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table. I could have also trained an embedding layer using Word2Vec, then load it here. Instead, I simply add an embedding layer here, using it for only dimensionality reduction, and let the network learn the weights.
- **LSTM Layer:** Processes the embedded input sequences, capturing sequential dependencies.
- **Dropout Layer:** Introduces regularization by randomly setting a fraction of input units to zero during training.
- **Fully-Connected Layer:** Reduces the dimensionality to the output size (`output_size`).
- **Sigmoid Activation Function:** Converts the output to a probability between 0 and 1.

The `forward` method defines how input is processed through the layers during a forward pass. The `init_hidden` method initializes the hidden state of the LSTM layers.

The model is designed to work with GPU acceleration if available (`train_on_gpu` is `True`). The embedding layer is used to convert word indices to vectors, and the LSTM processes these vectors sequentially. The final output is obtained by passing through dropout, a fully-connected layer, and a sigmoid activation function.

In [17]:
# import libraries
import torch.nn as nn

# Defining the model
class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        # Defining parameters for the model
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # Embedding words for dimensionality reduction (altnernative to one-hot encoding, which is inefficient for large vocabularies)
        # creates a lookup table that maps words to vectors of a specified size
        self.embedding = nn.Embedding(vocab_size, # number of rows: word integers
                                      embedding_dim) # number of columns: the embedding dimension
        
        # Defining the LSTM layer
        self.lstm = nn.LSTM(embedding_dim, # input size is equal to the output of the embedding layer (i.e. embedding dimension)
                            hidden_dim, # output size is equal to the hidden dimension/number of units in the hidden layer
                            n_layers,
                            dropout=drop_prob, 
                            batch_first=True) # True because we are using DataLoaders to batch our data like that
        
        # Define dropout layer
        self.dropout = nn.Dropout(0.3)
        
        # Define linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        # Extracting the size of the first dimension of the tensor x and assigning it to the variable batch_size
        batch_size = x.size(0)

        # Embed text and pass through LSTM layer
        x = x.long() # converts the elements of the tensor x to a 64-bit integer data type (torch.int64 in the case of PyTorch).
        embeds = self.embedding(x) # run the input x through the embedding layer
        lstm_out, hidden = self.lstm(embeds, hidden) # run the embedding output through the LSTM layer
        
        # Stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # Pass through dropout and fully-connected layer
        out = self.dropout(lstm_out) # run the LSTM output through the dropout layer
        out = self.fc(out) # run the dropout output through the fully-connected layer
        
        # Activate: Sigmoid function
        sig_out = self.sig(out) # run the fully-connected output through the sigmoid function
        
        # Reshape to be batch_size first (to feed into next LSTM layer)
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] # get last batch of labels
        
        # Return last sigmoid output and hidden state
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden
        

## Instantiate the network

This code segment instantiates an object of the `SentimentRNN` class with specified hyperparameters. The model is configured with the following hyperparameters:

- `vocab_size`: The size of the vocabulary plus 1 for zero-padding.
- `output_size`: 1, representing the one output neuron for positive/negative sentiment classification.
- `embedding_dim`: 400, indicating the dimensionality of the word embeddings.
- `hidden_dim`: 256, specifying the number of hidden features in the LSTM layers.
- `n_layers`: 2, indicating the number of LSTM layers.

The instantiated model is named `net`, and the `print(net)` statement is used to display the architecture of the model, providing information about its layers and parameters. This model is suitable for sentiment analysis tasks on the provided dataset.


In [18]:
# Instantiate the model w/ hyperparams
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1 # one output neuron/class score for positive/negative
embedding_dim = 400 # just a smaller representation of vocabulary of 70k words --> any value between like 200 and 500 should work
hidden_dim = 256 # 256 hidden features should be enough to distinguish between positive and negative reviews.
n_layers = 2 # 2 layers should be enough to distinguish between positive and negative reviews.

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

SentimentRNN(
  (embedding): Embedding(74073, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


## Training

In this code snippet, the learning rate (`lr`) is defined and the loss function and optimizer are set up for training the sentiment analysis model.

- `criterion`: Binary Cross Entropy Loss (`nn.BCELoss()`), a suitable choice for binary classification problems like sentiment analysis where the goal is to predict whether the sentiment is positive or negative. This loss function applies cross-entropy to a single value between 0 and 1.

- `optimizer`: Adam optimizer (`torch.optim.Adam`) is chosen, which is a popular optimization algorithm. It is used to update the parameters of the neural network during training. The optimizer is applied to the model's parameters (`net.parameters()`) with the specified learning rate (`lr`).

In [19]:
# Define learning rate
lr=0.001

# Set loss and optimization functions
criterion = nn.BCELoss() # Binary Cross Entropy Loss, applies cross entropy loss to a single value between 0 and 1.
optimizer = torch.optim.Adam(net.parameters(), lr=lr)


This code snippet trains the sentiment analysis model for a specified number of epochs (`epochs`). It iterates through the training data in batches, calculating and backpropagating the loss to update the model parameters. Some key components of the training loop include:

- **Hidden State Initialization:** The hidden state (`h`) is initialized at the beginning of each epoch and for each batch.
- **GPU Acceleration:** If a GPU is available (`train_on_gpu` is `True`), the inputs and labels are moved to the GPU.
- **Gradient Clipping:** The `clip_grad_norm_` function is used to prevent the exploding gradient problem in RNNs/LSTMs by scaling gradients during backpropagation.
- **Validation Loss Calculation:** Every `print_every` batches, the model is evaluated on the validation set (`valid_loader`) to calculate the validation loss (`val_loss`). The average validation loss is printed along with the current training loss.
- **Model Mode Switching:** The model is switched between training and evaluation modes using `net.train()` and `net.eval()` to ensure correct behavior of layers like dropout during training and evaluation.

This training loop is crucial for training the sentiment analysis model and monitoring its performance on both the training and validation sets.

In [20]:
# Set training params
epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing
counter = 0
print_every = 100
clip=5 # gradient clipping

# Move model to GPU, if available
if(train_on_gpu):
    net.cuda()

# Set model to train mode
net.train()

# Train for some number of epochs
for e in range(epochs):
    # Initialize hidden state
    h = net.init_hidden(batch_size)

    # Batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise this would backprop through the entire training history
        h = tuple([each.data for each in h])

        # Clear the gradients of all optimized variables
        net.zero_grad()

        # Get the output from the model
        output, h = net(inputs, h)

        # Calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float()) # making sure that our outputs are squeezed so that they do not have an empty dimension
        loss.backward() # Backward pass: compute gradient of the loss with respect to model parameters
        nn.utils.clip_grad_norm_(net.parameters(), clip) # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        optimizer.step() # Perform a single optimization step (parameter update)

        # Get loss stats every few batches
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []

            # Set model to evaluation mode
            net.eval()

            for inputs, labels in valid_loader:
                # Creating new variables for the hidden state, otherwise this would backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                # Get the output from the model
                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())  # making sure that our outputs are squeezed so that they do not have an empty dimension
                val_losses.append(val_loss.item()) # append the validation loss to the list of validation losses

            # Set model back to train mode
            net.train()

            # Print training and validation loss stats
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Train Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

Epoch: 1/4... Step: 100... Train Loss: 0.647868... Val Loss: 0.641024
Epoch: 1/4... Step: 200... Train Loss: 0.576166... Val Loss: 0.598000
Epoch: 1/4... Step: 300... Train Loss: 0.555865... Val Loss: 0.621067
Epoch: 1/4... Step: 400... Train Loss: 0.604562... Val Loss: 0.549225
Epoch: 2/4... Step: 500... Train Loss: 0.284617... Val Loss: 0.516605
Epoch: 2/4... Step: 600... Train Loss: 0.506833... Val Loss: 0.491125
Epoch: 2/4... Step: 700... Train Loss: 0.505795... Val Loss: 0.485445
Epoch: 2/4... Step: 800... Train Loss: 0.321016... Val Loss: 0.449118
Epoch: 3/4... Step: 900... Train Loss: 0.244150... Val Loss: 0.593418
Epoch: 3/4... Step: 1000... Train Loss: 0.350111... Val Loss: 0.509438
Epoch: 3/4... Step: 1100... Train Loss: 0.247316... Val Loss: 0.442915
Epoch: 3/4... Step: 1200... Train Loss: 0.268553... Val Loss: 0.433219
Epoch: 4/4... Step: 1300... Train Loss: 0.303633... Val Loss: 0.515278
Epoch: 4/4... Step: 1400... Train Loss: 0.311155... Val Loss: 0.526051
Epoch: 4/4... S

## Testing

This code segment evaluates the trained sentiment analysis model on the test set and prints out relevant statistics, including the average test loss and test accuracy. Here's a breakdown of the key components:

- **Loss Calculation:** For each batch in the test set, the model's output is compared to the true labels to calculate the test loss (`test_loss`).
- **Accuracy Calculation:** The number of correct predictions is tracked, and the overall test accuracy is computed as the ratio of correct predictions to the total number of test samples.
- **Conversion to Numpy:** If running on GPU (`train_on_gpu` is `True`), the predictions and correctness tensor are moved to the CPU before converting to NumPy arrays.
- **Printing Statistics:** The average test loss and test accuracy are printed to provide insights into the model's performance on unseen data.

This evaluation step is essential for assessing how well the trained model generalizes to new, unseen examples and gives an indication of its overall performance.

In [21]:
# Get test data loss and accuracy
test_losses = [] # track loss
num_correct = 0

# Init hidden state
h = net.init_hidden(batch_size)

# Set model to evaluation mode
net.eval()

# Iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise this would backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # Get predicted outputs
    output, h = net(inputs, h)
    
    # Calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # Convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # Compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)

## Print stats
# Avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# Accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.558
Test accuracy: 0.796


### Inference on a test review

First I define a test review for early testing.

In [22]:
# Define negative test review
test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'

Then I define the `tokenize_review` function, which processes a given test review by performing the following steps:

1. Convert the review to lowercase.
2. Remove punctuation from the review.
3. Split the processed text into individual words.
4. Tokenize the words using the `vocab_to_int` dictionary, where each word is mapped to its corresponding integer index. If a word is not present in the vocabulary, it is assigned an index of 0.

The resulting `test_ints` list contains the tokenized representation of the negative test review provided. It represents the integer indices of the words in the review based on the vocabulary used during model training.

In [23]:
from string import punctuation

# Define function to tokenize review
def tokenize_review(test_review):
    # Coverting test review to lowercase
    test_review = test_review.lower()

    # Getting rid of punctuation
    test_text = ''.join([c for c in test_review if c not in punctuation])

    # Splitting by spaces
    test_words = test_text.split()

    # Tokenizing words using the vocab_to_int dictionary
    test_ints = []
    test_ints.append([vocab_to_int.get(word, 0) for word in test_words])

    return test_ints

# Tokenize test review
test_ints = tokenize_review(test_review_neg)
print(test_ints)

[[1, 247, 18, 10, 28, 108, 113, 14, 388, 2, 10, 181, 60, 273, 144, 11, 18, 68, 76, 113, 2, 1, 410, 14, 539]]


Before defining a comprehensive function in the final step, here, I will first test out, whether all functions work as intended.

In [24]:
# Test padding of test review
seq_length=200 # use the same sequence length that was trained on
features = pad_features(test_ints, seq_length)
print(features)

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   1 247  18  10  28
  108 113  14 388   2  10 181  60 273 144  11  18  68  76 113   2   1 410
   14 539]]


In [25]:
# Test conversion to tensor and pass into your model
feature_tensor = torch.from_numpy(features)
print(feature_tensor.size())

torch.Size([1, 200])


The `predict` function takes a trained model (`net`), a test review (`test_review`), and an optional parameter for the sequence length (`sequence_length`). It performs the following steps:

1. Sets the model to evaluation mode.
2. Tokenizes the test review using the `tokenize_review` function.
3. Pads the tokenized sequence to the specified sequence length.
4. Converts the padded sequence to a PyTorch tensor.
5. Initializes the hidden state.
6. If a GPU is available, moves the tensor to the GPU.
7. Obtains the model's output.
8. Rounds the output to the nearest integer (0 or 1).
9. Prints the raw prediction value before rounding.
10. Outputs a custom response based on whether the rounded prediction is 1 (positive) or 0 (negative).

This function provides a convenient way to use the trained model to predict the sentiment of a given test review and print a custom response indicating whether the review is predicted to be positive or negative.

In [26]:
# Print custom response based on whether test_review is pos/neg
def predict(net, test_review, sequence_length=200):
    ''' Prints out whether a give review is predicted to be 
        positive or negative in sentiment, using a trained model.
        
        params:
        net - A trained net 
        test_review - a review made of normal text and punctuation
        sequence_length - the padded length of a review
        '''
    
    # Set model to evaluation mode
    net.eval()
    
    # Tokenize review
    test_ints = tokenize_review(test_review)
    
    # Pad tokenized sequence
    seq_length=sequence_length
    features = pad_features(test_ints, seq_length)
    
    # Convert to tensor to pass into your model
    feature_tensor = torch.from_numpy(features)
    
    # Get batch size from feature tensor
    batch_size = feature_tensor.size(0)
    
    # Initialize hidden state
    h = net.init_hidden(batch_size)
    
    # Move to GPU if available
    if(train_on_gpu):
        feature_tensor = feature_tensor.cuda()
    
    # Get the output from the model
    output, h = net(feature_tensor, h)
    
    # Convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze()) 

    # Printing output value, before rounding
    print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))
    
    # Print custom response
    if(pred.item()==1):
        print("Positive review detected!")
    else:
        print("Negative review detected.")
        

Testing predict function on both negative and positive test reviews.

In [27]:
# Define negative test review
test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'

# Call function on negative test review
seq_length=200
predict(net, test_review_neg, seq_length)

Prediction value, pre-rounding: 0.004021
Negative review detected.


In [28]:
# Define positive test review
test_review_pos = 'This movie had the best acting and the dialogue was so good. I loved it.'

# Call function on negative test review
seq_length=200
predict(net, test_review_pos, seq_length)

Prediction value, pre-rounding: 0.993114
Positive review detected!
