<a href="https://colab.research.google.com/github/danielgconti/Personal-Website/blob/main/Reccurent_Neural_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Download the dataset

Our goal today is to train a sentiment analysis neural network, which can determine whether a piece of text is positive (happy) or negative (unhappy). To do this, we will use the [Amazon Reviews for Sentiment Analysis](https://www.kaggle.com/datasets/bittlingmayer/amazonreviews) dataset, which contains many 1-2 star reviews and many 4-5 star reviews.

## Step 1A: Download Data from Kaggle

In [None]:
#@markdown We are downloading the dataset from Kaggle. To do this using code, we need access to the Kaggle API, which you can get by downloading an API key from the Kaggle website. Download the key by going to **Kaggle → Account → Create New API Token**, and then run this cell and upload the `kaggle.json` file you created.

from google.colab import files
files.upload()

Once you've uploaded the `kaggle.json` file containing your API key, run the following code to install and prepare the Kaggle API python package:

In [None]:
!pip install -q kaggle
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Next, run this command to download and unzip the dataset:

In [None]:
!kaggle datasets download -d bittlingmayer/amazonreviews
!unzip amazonreviews.zip -d data

Finally, read the text files as python variables. This will take about 2 minutes to run:

In [None]:
import bz2
train_file_all = bz2.BZ2File("./data/train.ft.txt.bz2").readlines()
test_file_all = bz2.BZ2File("./data/test.ft.txt.bz2").readlines()

The dataset is absolutely massive, which could be great if we had a supercomputer, but to keep this managable for our purposes today, let's take a smaller sample from the full dataset (by choosing a total of 200,000 random reviews out of the millions of reviews in the dataset):

In [None]:
import random

# Take a smaller sample from the dataset to work with
num_train = 160000
num_test = 40000

train_file = [x.decode("utf-8") for x in random.sample(train_file_all, num_train)]
test_file = [x.decode("utf-8") for x in random.sample(test_file_all, num_test)]

## Step 1B: Split into sentences & labels

The variable `train_file` is an array of strings, where each string is a line of text from the training data. (Similarly, `test_file` is an array of strings from the testing data.)

**Your job: Print out the first item from the `train_file` array to see what each review string looks like.**

In [None]:
# ??? TODO: Print the first item of train_file
print(train_file[0])

Notice that this string begins with either `__label__0` or `__label__1`, and then a space, and then the review text. `__label__0` means that it's a positive review (good, 4-5 stars) and `__label__1` means that it's a negative review (bad, 1-2 stars). Hopefully your example above is correctly labelled.

What we really want is to split `train_file` into two different variables, one array containing strings of just the review text, and one array containing just the numbers 0 or 1 corresponding to whether each review is negative or positive. (We're going to swap the labelling so that 0 means negative and 1 means positive, because that feels much more intuitive.)

In [None]:
# The data set gives us two labels, which we can interpret as a positive or negative review

train_labels = [0 if x.split(" ")[0] == "__label__1" else 1 for x in train_file]
train_sentences = [x.split(" ", 1)[1][:-1].lower() for x in train_file]

test_labels = [0 if x.split(" ")[0] == "__label__1" else 1 for x in test_file]
test_sentences = [x.split(" ", 1)[1][:-1].lower() for x in test_file]

**Print out the first five elements of `train_labels` and the first five elements of `train_sentences` to make sure the sentences and their label numbers seem correct:**

In [None]:
# ??? TODO: Print the first five elements of train_labels and the first five
# elements of train_sentences to make sure the sentences and their labels
# look good and make sense.
for x in range(5):
  print(train_labels[x])
  print(train_sentences[x])

# Step 2

Since RNNs operate on *sequences* of data, we need to split our paragraph-long review strings into sequences of words (or, actually, sequences of **tokens**, which are slightly smaller chunks than words).

## Step 2A: Tokenize the Paragraphs

The following code imports and prepares a library called `nltk` which will tokenize text for us automatically:

In [None]:
import nltk
nltk.download('punkt')

**Give it a try!** Call the function `nltk.word_tokenize()` and pass it a string with some kind of sentence that you want to tokenize.

In [None]:
# ??? TODO: Call nltk.word_tokenize() and pass in a sentence string to try tokenizing
nltk.word_tokenize("What's up this is a little string I made up to demonstrate tokenization")

Alright, so we can tokenize sentences. The variables `train_sentences` and `test_sentences` are each lists of strings. The following code converts `train_sentences` into a new list, `train_tokens`, which is actually a list of lists: Each sentences has been converted to a list of tokens.

**Edit the code below so that in addition to converting `train_sentences` into `train_tokens`, it also converts `test_sentences` into `test_tokens`.** (This code will take about 2 minutes to run.)

In [None]:
train_tokens = [nltk.word_tokenize(sentence) for sentence in train_sentences]
test_tokens = [nltk.word_tokenize(sentence) for sentence in test_sentences]

Print out an example sentence and tokenized version from the training set and from the testing set:

In [None]:
print("First sentence from training set:")
print(train_sentences[0])
print(train_tokens[0])

print("\nFirst sentence from test set:")
print(test_sentences[0])
print(test_tokens[0])

## Step 2B: Convert Tokens to Numbers

Computers really like numbers, so we want to convert our tokens (words, basically) into numbers. To do this, we'll create a dictionary of all the tokens in our dataset, and assign a number to each token.

Let's start by counting up how many times each different token appears:

In [None]:
from collections import Counter

# Count the number of times each word shows up, so that we can calculate the size of our vocabulary
frequencies = Counter()

for i, tokens in enumerate(train_tokens):
  frequencies.update([token.lower() for token in tokens])

In theory, `frequencies` should now be a counter that can tell us how many times any given word appears. Try running the following code to check how many times the word "good" appears in the dataset. **Then, try checking the count for a different word.**

In [None]:
# ??? TODO: Try editing the line below to check the frequency of a different word
print(frequencies["bad"]) # Check number of times word appeared

Now we need to assign a number to each token. It's common practice to assign smaller numbers (1, 2, 3...) to the most common words and larger numbers (100, 1000...) to less common words. This code creates a sorted list of tokens (most to least common):

In [None]:
tokens = ["_PAD", "_UNK"] + sorted(frequencies, key=frequencies.get, reverse=True)
print(tokens[:100])

Notice that we've also added to special tokens, "_PAD" and "_UNK" which stand for "padding" and "unknown". We'll use these in cases where we get an unrecognized token (unknown) or need to fill extra space (padding).

Finally, let's assign a number to each token in the list:

In [None]:
token2num = {n: i for i, n in enumerate(tokens)}
token2num

Finally, we can convert the training and testing data, `train_tokens` and `test_tokens`, from lists of token strings (which they currently are) into lists of numbers, which we'll call `train_x` and `train_y`. (We'll also copy `train_labels` and `test_labels` into `train_y` and `test_y` to be consistent with naming.)

The following code creates number versions `train_x` and `train_y`. **Write some additional code to generate `test_x` and `test_y`.**

In [None]:
import numpy as np

def to_numbers(tokens):
  return [token2num[token] if token in token2num else 1 for token in tokens]

train_x = [to_numbers(tokens) for tokens in train_tokens]
train_y = np.array(train_labels.copy())

test_x = [to_numbers(tokens) for tokens in test_tokens]
test_y = np.array(test_labels.copy())

Just to make sure it worked, let's take a look at an example data point from `test_tokens` and `test_x` to see if we properly converted a list of tokens into a list of numbers:

In [None]:
print(test_tokens[0])
print(test_x[0])

Hopefully it's correct!

# Step 3: Load Data in Batches

Even though an RNN is able to handle input sequences of any length, we still have a small issue: For efficiency purposes, it's *way* faster to train the neural network using data that is split up into "batches". (A batch is just a bunch of training examples that are all used at once.) Unfortunately, when training in batches, all the sequences in the batch need to be the same length, so for training purposes we are going to take all of our training data and make the sequences all equal in length (we've chosen to make them 200 tokens long). Sequences that are too long will be truncated (chopped) and sequences that are too short will be padded with the special `"_PAD"` token we made earlier.

Run the following code to generate the fixed-length sequences:

In [None]:
def pad(reviews, length):
  result = np.zeros((len(reviews), length), dtype=int)
  for i, review in enumerate(reviews):
    if len(review) != 0:
      result[i, -len(review) :] = np.array(review)[:length]
  return np.array(result)

seq_len = 200

train_x_fixed = pad(train_x, seq_len)
test_x_fixed = pad(test_x, seq_len)

Let's check to make sure our data is in a format that makes sense:

In [None]:
print(train_x_fixed.shape, train_y.shape)
print(test_x_fixed.shape, test_y.shape)

Sweet! Now that we have the fixed-length data, we can create some `DataLoader` objects which will help us load our training and testing data in batches efficiently. This code sets up those loaders:

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader

train_data = TensorDataset(torch.from_numpy(train_x_fixed), torch.from_numpy(train_y))

test_data = TensorDataset(torch.from_numpy(test_x_fixed), torch.from_numpy(test_y))

batch_size = 400

train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

# Step 4: Design RNN Model

Finally! Our data is prepared, so we can actually get to work on designing a model. We want to make a recurrent neural network that uses an embedding layer, an RNN layer, a linear layer, and a sigmoid layer. **Set up the model you want to train.** (See the slides for details.)

You can refer to the [PyTorch documentation](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) to read about the parameters the RNN layer takes



In [None]:
import torch.nn as nn

if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

input_size = 200
embed_dim = 20
hidden_size = 30
n_layers = 1
output_size = 1

vocab_size = len(token2num) + 1

class RNN(nn.Module):
  def __init__(self):
    super(RNN, self).__init__()
    self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
    self.rnn = nn.RNN(embed_dim, hidden_size, n_layers, batch_first=True)
    self.fc = nn.Linear(hidden_size, output_size)
    self.sigmoid = nn.Sigmoid()
    # ??? TODO: Set up the four different model layers we want
    # 1. Encoder layer
    # 2. RNN layer
    # 3. Linear ("fully connected"/MLP) layer
    # 4. Sigmoid layer

  def forward(self, x):
    # ??? TODO: Perform the forward pass
    embeds = self.embedding(x.long())
    out, hidden = self.rnn(embeds)
    last_out = out[:,-1,:]
    out = self.fc(last_out)
    out = self.sigmoid(out)
    return out, hidden

# Step 5: Train the RNN Model

Now that the training data is prepared and the model is designed, we can train the model! There's nothing to do here other than run the code. **Running this will take a few minutes, but you should see the loss (badness score) decreasing over time.** Hopefully Google Colab has assigned you a computer with a GPU. If not, training is going to be *really* slow. You might have to decrease the number of `epochs` of training if that's the case. (You'll get a worse model that way, but at least it won't take as long to train.)

In [None]:
model = RNN()
model.to(device)

epochs = 20
lr = 0.001

criterion = nn.BCELoss() 
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

model.train()
for i in range(epochs):
  step = 0

  for inputs, labels in train_loader:
    inputs, labels = inputs.to(device).float(), labels.to(device).float()

    model.zero_grad()

    output, h = model(inputs)

    output = output.squeeze()

    loss = criterion(output, labels)
    loss.backward()

    nn.utils.clip_grad_norm_(model.parameters(), 5)
    optimizer.step()

    step += 1
    if step % 100 == 0:
      print("loss", loss)

# Step 6: Use & Test the RNN Model

## Step 6A: Try using the model

Amazing! We've trained a model. Let's give it a try. Run the code below to evaluate the sentence and predict whether it is positive or negative in sentiment. **Then, try changing the sentence to put your model to the test. How well does your model work? Can it handle negating phrases like "not good"? Can you fool it with sarcasm?**

In [None]:
sentence = "not amazing or good"

tokens = nltk.word_tokenize(sentence)

x = torch.from_numpy(np.array([to_numbers([token.lower() for token in tokens])])).to(device).float()

model.eval()
output, hidden = model(x)

score = output[0][0]

if score >= 0.5:
  print(f"Positive! :) (Score: {score})")
else:
  print(f"Negative! :( (Score: {score})")

## Step 6B: Test the model's accuracy

Finally, we can use our testing dataset to get a cold, hard measurement of how well our model works. Give it a try!

In [None]:
model.eval()

correct_count = 0
total_count = 0

for inputs, labels in test_loader:
  inputs, labels = inputs.to(device).float(), labels.to(device).float()

  outputs, hiddens = model(inputs)

  outputs = outputs.squeeze()

  correct = (outputs >= 0.5) == (labels == 1.)
  
  correct_count += sum(correct)
  total_count += correct.shape[0]

accuracy = correct_count / total_count

print(f"Testing accuracy: {correct_count}/{total_count} = {int(accuracy * 10000) / 100}%")

# Step 7 (Bonus): Improve the model

You might be able to improve your accuracy by adjusting your model design or training process (learning rate, epochs, etc).

You'll have to go back to steps 4 and 5 to do this. Make sure that if you edit step 4 (the design of the model), you re-train by rerunning step 5. Then test out your new model in step 6!

* Consider replacing `nn.RNN()` in the model with `nn.LSTM()` for possibly better results.