# Sentiment Analysis

Congrats, you finished the part on the data preparation, and we can now move on to the more exciting part of using RNNs/LSTMs to process sequential data. But be careful, even if the previous notebook might seem a little bit boring, it is of great importance. Since we switched to text data in this homework, make sure you have a good understanding of how the data has been prepared.

For the last Deep Learning homework, we want to make use of Recurrent Neural Networks (RNNs) to process sequential data. We will stick with the same dataset we have been looking at in the previous notebook, namely the [IMDb](https://ai.stanford.edu/~amaas/data/sentiment/) sentiment analysis dataset that contains positive and negative movie reviews.

![](https://drive.google.com/uc?export=view&id=1WEg6_Y2cFMu163QHIXpRmbNT0rp80hEQ)


Sentiment analysis is the task of predicting the sentiment of a text. In this notebook, you will train a network to process reviews from the dataset and evaluate whether it has been a positive or a negative review. Below are two examples:

![](https://drive.google.com/uc?export=view&id=1vOxwWdm3aB1k0SiWuMktX_cLRnTuWbOA)

## (Optional) Mount folder in Colab

Uncomment the following cell to mount your gdrive if you are using the notebook in google colab:

In [1]:
# Use the following lines if you want to use Google Colab
# We presume you created a folder "DL_homeworks" within your main drive folder, and put the homework there.
# NOTE: terminate all other colab sessions that use GPU!
# NOTE 2: Make sure the correct homework folder (e.g homework_10) is given.

"""
from google.colab import drive
import os

gdrive_path='/content/gdrive/MyDrive/DL_homeworks/homework_10'

# This will mount your google drive under 'MyDrive'
drive.mount('/content/gdrive', force_remount=True)
# In order to access the files in this notebook we have to navigate to the correct folder
os.chdir(gdrive_path)
# Check manually if all files are present
print(sorted(os.listdir()))
"""

"\nfrom google.colab import drive\nimport os\n\ngdrive_path='/content/gdrive/MyDrive/DL_homeworks/homework_10'\n\n# This will mount your google drive under 'MyDrive'\ndrive.mount('/content/gdrive', force_remount=True)\n# In order to access the files in this notebook we have to navigate to the correct folder\nos.chdir(gdrive_path)\n# Check manually if all files are present\nprint(sorted(os.listdir()))\n"

### Set up PyTorch environment in colab
- (OPTIONAL) Enable GPU via Runtime --> Change runtime type --> GPU
- Uncomment the following cell if you are using the notebook in google colab:

# 0. Setup

As always, we first import some packages to setup the notebook.

In [1]:
import os
import random
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.nn.utils import clip_grad_norm_

from exercise_code.rnn.sentiment_dataset import (
    download_data,
    load_sentiment_data,
    load_vocab,
    SentimentDataset,
    collate
)

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

# 1. Loading Data

As we have learned from the notebook 1, this time we not only load the raw data, but also have the corresponding vocabulary. Let us load the data that we prepared for you:

In [2]:
DL_homeworks_path = os.path.dirname(os.path.abspath(os.getcwd()))
data_root = os.path.join(DL_homeworks_path, "datasets", "SentimentData")
base_dir = download_data(data_root)
vocab = load_vocab(base_dir)
train_data, val_data, test_data = load_sentiment_data(base_dir, vocab)

print("number of training samples: {}".format(len(train_data)))
print("number of validation samples: {}".format(len(val_data)))
print("number of test samples: {}".format(len(test_data)))

number of training samples: 9154
number of validation samples: 3133
number of test samples: 3083


## Dataset Samples

Our raw data consists of tuples `(raw_text, token_list, token_indices, label)`. Let's sample some relatively short texts from our dataset to have a sense how it looks like:

In [3]:
sample_data0 = [datum for datum in train_data if len(datum[1]) < 20 and datum[-1] == 0] # negative
sample_data1 = [datum for datum in train_data if len(datum[1]) < 20 and datum[-1] == 1] # positive

# we sample 2 tuples each from positive set and negative set
sample_data = random.sample(sample_data0, 2) + random.sample(sample_data1, 2)
for text, tokens, indices, label in sample_data:
    print('Text: \n {}\n'.format(text))
    print('Tokens: \n {}\n'.format(tokens))
    print('Indices: \n {}\n'.format(indices))
    print('Label:\n {}\n'.format(label))
    print()

Text: 
 I wouldn't rent this one even on dollar rental night.

Tokens: 
 ['i', 'wouldn', 't', 'rent', 'this', 'one', 'even', 'on', 'dollar', 'rental', 'night']

Indices: 
 [7, 555, 23, 414, 10, 27, 64, 25, 2506, 1292, 314]

Label:
 0


Text: 
 You'd better choose Paul Verhoeven's even if you have watched it.

Tokens: 
 ['you', 'd', 'better', 'choose', 'paul', 'verhoeven', 's', 'even', 'if', 'you', 'have', 'watched', 'it']

Indices: 
 [20, 232, 107, 1999, 855, 4624, 16, 64, 35, 20, 26, 214, 8]

Label:
 0


Text: 
 I don't know why I like this movie so well, but I never get tired of watching it.

Tokens: 
 ['i', 'don', 't', 'know', 'why', 'i', 'like', 'this', 'movie', 'so', 'well', 'but', 'i', 'never', 'get', 'tired', 'of', 'watching', 'it']

Indices: 
 [7, 74, 23, 126, 138, 7, 32, 10, 13, 34, 68, 17, 7, 115, 82, 1225, 5, 116, 8]

Label:
 1


Text: 
 This is the definitive movie version of Hamlet. Branagh cuts nothing, but there are no wasted moments.

Tokens: 
 ['this', 'is', 'the', 'de

## Checking the Vocabulary

In the previous notebook, we discussed the need of a vocabulary for mapping words to unique integer IDs. Instead of creating the vocabulary manually, we provide you with the vocabulary. Let's have a look at some samples from the vocabulary of the dataset:

In [4]:
print('Vocabulary size:', len(vocab), '\n\n  Sample words\n{}'.format('-' * 20))
sample_words = random.sample(list(vocab.keys()), 10)
for word in sample_words:
    print(' {}'.format(word))

Vocabulary size: 5002 

  Sample words
--------------------
 trail
 code
 sticks
 clue
 bone
 davies
 britney
 jack
 tell
 held


Also we saw that there are already indices in the raw data that we loaded. We can check if the indices in the vocabulary match the raw data using the last sentence in `sample_data`. Words that are not in the vocabulary are assigned to the symbol `<unk>`. The output of the following cell should be the same as the indices in the last example of our loaded raw data:

In [5]:
# Last sample from above 
(text, tokens, indices, label) = sample_data[-1]
print('Text: \n {}\n'.format(text))
print('Tokens: \n {}\n'.format(tokens))
print('Indices: \n {}\n'.format(indices))
print('Label:\n {}\n'.format(label))

# Compare with the vocabulary
print('Indices drawn from vocabulary: \n {}\n'.format([vocab.get(x, vocab['<unk>']) for x in sample_data[-1][1]]))

Text: 
 This is the definitive movie version of Hamlet. Branagh cuts nothing, but there are no wasted moments.

Tokens: 
 ['this', 'is', 'the', 'definitive', 'movie', 'version', 'of', 'hamlet', 'branagh', 'cuts', 'nothing', 'but', 'there', 'are', 'no', 'wasted', 'moments']

Indices: 
 [10, 9, 2, 1, 13, 304, 5, 2974, 2206, 2865, 151, 17, 44, 28, 62, 628, 385]

Label:
 1

Indices drawn from vocabulary: 
 [10, 9, 2, 1, 13, 304, 5, 2974, 2206, 2865, 151, 17, 44, 28, 62, 628, 385]



## Wrapping to PyTorch Datasets

Great, the raw data is loaded properly and the vocabulary is matching. Let us wrap our data in a PyTorch dataset. For more details, check out the previous notebook and the corresponding dataset class defined in `exercise_code/rnn/sentiment_dataset.py`.

In [6]:
# Define a Dataset Class for train, val and test set
train_dataset = SentimentDataset(train_data)
val_dataset = SentimentDataset(val_data)
test_dataset = SentimentDataset(test_data)

# 2. Creating a Sentiment Classifier

After we have loaded the data, it is time to define a model and start training and testing.

## Evaluation Metrics

Since we just need to predict positive or negative, we can use `binary cross-entropy loss` to train our model. And accuracy can be used to assess our model's performance. We will use the following evaluation model to compute the accuracy.

In [7]:
bce_loss = nn.BCELoss()

@torch.no_grad()
def compute_accuracy(model, data_loader):
    corrects = 0
    total = 0
    device = next(model.parameters()).device
    
    for i, x in enumerate(data_loader):
        input = x['data'].to(device)
        lengths = x['lengths']
        label = x['label'].to(device)
        pred = model(input, lengths)
        corrects += ((pred > 0.5) == label).sum().item()
        total += label.numel()
        
        if i > 0  and i % 100 == 0:
            print('Step {} / {}'.format(i, len(data_loader)))
    
    return corrects / total

## Step 1: Design your own model

In this part, you need to create a classifier using the Embedding layers you implemented in the first notebook and LSTM. For the LSTM, you may also use the PyTorch implementation.


<div class="alert alert-info">
    <h3>Task: Implement a Classifier</h3>
    
   Go to <code>exercise_code/rnn/text_classifiers.py</code> and implement the <code>RNNClassifier</code>. In the skeleton code, we inherited <code>nn.Module</code>. You can also inherit <code>LightningModule</code> if you want to use PyTorch Lightning.
</div>

This file is mostly empty but contains the expected class name, and the methods that your model needs to implement (only `forward()` basically). 
The only rules your model design has to follow are:
* Inherit from `torch.nn.Module` or `pytorch_lightning.LightningModule`
* Perform the forward pass in `forward()`.
* Have less than 2 million parameters
* Have a model size of less than 50MB after saving

After you finished, edit the below cell to make sure your implementation is correct. You should define the model yourself, which should be small enough (2 Mio. parameters) and have correct output format.

In [8]:
from exercise_code.rnn.tests import classifier_test, parameter_test
from exercise_code.rnn.text_classifiers import RNNClassifier

model = None

########################################################################
# TODO - Create a Model                                               #
########################################################################

num_embeddings = len(vocab)
embedding_dim = 128
hidden_size = 256
use_lstm = True

model = RNNClassifier(num_embeddings, embedding_dim, hidden_size, use_lstm)

########################################################################
#                           END OF YOUR CODE                           #
########################################################################

# Check whether your model is sufficiently small and have a correct output format
parameter_test(model), classifier_test(model, len(vocab))

Total number of parameters: 1035777
Your model is sufficiently small.
All output tests are passed!


(True, True)

## Step 2: Train your own model

In this section, you need to train the classifier you created. Below, you can see some setup code we provided to you. Note the **collate function** used with the `DataLoader`. If you forgot why we need the collate function here, check this out in Notebook 1.

You are free to change the below configs (e.g. batch size, device setting etc.) as you wish.

In [9]:
# Training configs
if torch.backends.mps.is_available():
    device = torch.device('mps')
else:
    device = torch.device('cpu')

print('Using {}...\n'.format(device))

# Move model to the device we are using
model = model.to(device)

# To tackle with the exploding gradient problem, you may want to set gclip and use clip_grad_norm_
# see the first notebook for the explanation
gclip = None

# Dataloaders, note the collate function
train_loader = DataLoader(
  train_dataset, batch_size=128, collate_fn=collate, drop_last=True
)
val_loader = DataLoader(
  val_dataset, batch_size=128, collate_fn=collate, drop_last=False
)

Using mps...



<div class="alert alert-info">
<h3>Task: Implement Training</h3>
    <p>
        In the below cell, you are expected to implement your training loop to train your model. You can use the training loader provided above for iterating over the data. If you want to evaluate your model periodically, you may use the validation loader provided above. You can use pure PyTorch or PyTorch Lightning. 
        
Use `torch.nn.BCELoss` as loss function.

In [10]:
from torch.optim.lr_scheduler import CosineAnnealingLR

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()
NUM_EPOCHS = 15

scheduler = CosineAnnealingLR(optimizer, T_max=len(train_loader) * NUM_EPOCHS)

for epoch in range(NUM_EPOCHS):
    
    for batch_idx, data in enumerate(train_loader, 0):
        
        inputs = data["data"]
        target = data["label"]
        lengths = data["lengths"]

        inputs = inputs.to(device)
        target = target.to(device)
        lengths = lengths.to(device)

        optimizer.zero_grad()
        
        y_pred = model(inputs, lengths)             
        loss = criterion(y_pred, target) 
        loss.backward()             
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 5)
        
        optimizer.step()
        
        scheduler.step()  # Step the scheduler after each batch processed
        
        if not batch_idx % 71:
            print(f'Epoch: {epoch+1:03d}/{NUM_EPOCHS:03d} | '
                  f'Batch {batch_idx:03d}/{len(train_loader):03d} | '
                  f'Loss: {loss:.4f}')
            
    with torch.set_grad_enabled(False):
        train_acc = compute_accuracy(model, train_loader)  
        val_acc = compute_accuracy(model, val_loader)  
        
        print(f'Training Accuracy: {train_acc:.2f}% | Validation Accuracy: {val_acc:.2f}%')

print('FINISH.')

Epoch: 001/015 | Batch 000/071 | Loss: 0.6976
Training Accuracy: 0.69% | Validation Accuracy: 0.69%
Epoch: 002/015 | Batch 000/071 | Loss: 0.7016
Training Accuracy: 0.80% | Validation Accuracy: 0.77%
Epoch: 003/015 | Batch 000/071 | Loss: 0.5410
Training Accuracy: 0.81% | Validation Accuracy: 0.76%
Epoch: 004/015 | Batch 000/071 | Loss: 0.4623
Training Accuracy: 0.88% | Validation Accuracy: 0.82%
Epoch: 005/015 | Batch 000/071 | Loss: 0.3348
Training Accuracy: 0.90% | Validation Accuracy: 0.83%
Epoch: 006/015 | Batch 000/071 | Loss: 0.2987
Training Accuracy: 0.90% | Validation Accuracy: 0.85%
Epoch: 007/015 | Batch 000/071 | Loss: 0.2793
Training Accuracy: 0.91% | Validation Accuracy: 0.85%
Epoch: 008/015 | Batch 000/071 | Loss: 0.2123
Training Accuracy: 0.95% | Validation Accuracy: 0.84%
Epoch: 009/015 | Batch 000/071 | Loss: 0.1673
Training Accuracy: 0.96% | Validation Accuracy: 0.85%
Epoch: 010/015 | Batch 000/071 | Loss: 0.1423
Training Accuracy: 0.97% | Validation Accuracy: 0.84%


In [10]:
########################################################################
#                     TODO - Train Your Model                          #
########################################################################

import pytorch_lightning as pl

trainer = pl.Trainer(max_epochs=10, gradient_clip_val=40)
trainer.fit(model, train_loader, val_loader)

########################################################################
#                           END OF YOUR CODE                           #
########################################################################

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name      | Type      | Params
----------------------------------------
0 | embedding | Embedding | 640 K 
1 | rnn       | LSTM      | 395 K 
2 | fc        | Linear    | 257   
3 | sigmoid   | Sigmoid   | 0     
----------------------------------------
1.0 M     Trainable params
0         Non-trainable params
1.0 M     Total params
4.143     Total estimated model params size (MB)


Sanity Checking: |                                        | 0/? [00:00<?, ?it/s]

/Users/tigrangaplanyan/anaconda3/envs/Homeworks/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/Users/tigrangaplanyan/anaconda3/envs/Homeworks/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:77: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 150. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
/Users/tigrangaplanyan/anaconda3/envs/Homeworks/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:77: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 147. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
/Users/tigrangaplanyan/anaconda3/envs/Homeworks/lib/python3.10/site-packages/pytorch_lightni

Training: |                                               | 0/? [00:00<?, ?it/s]

Validation: |                                             | 0/? [00:00<?, ?it/s]

/Users/tigrangaplanyan/anaconda3/envs/Homeworks/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:77: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 144. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
/Users/tigrangaplanyan/anaconda3/envs/Homeworks/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:77: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 142. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
/Users/tigrangaplanyan/anaconda3/envs/Homeworks/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:77: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 139. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
/Users/tigrangaplanyan/anaconda3/envs/Homeworks/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:77: Trying to infer the `batc

Validation: |                                             | 0/? [00:00<?, ?it/s]

Validation: |                                             | 0/? [00:00<?, ?it/s]

Validation: |                                             | 0/? [00:00<?, ?it/s]

Validation: |                                             | 0/? [00:00<?, ?it/s]

Validation: |                                             | 0/? [00:00<?, ?it/s]

Validation: |                                             | 0/? [00:00<?, ?it/s]

Validation: |                                             | 0/? [00:00<?, ?it/s]

Validation: |                                             | 0/? [00:00<?, ?it/s]

Validation: |                                             | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.


## Testing the Model

As you trained a model and improved it on the validation set, you can now test it on the test set.

In [11]:
test_loader = DataLoader(
  test_dataset, batch_size=8, collate_fn=collate, drop_last=False
)

print("accuracy on test set: {}".format(compute_accuracy(model, test_loader)))

Step 100 / 386
Step 200 / 386
Step 300 / 386
accuracy on test set: 0.8462536490431398


## Demo


Now that you trained a sufficiently good sentiment classifier, run the below cell and type some text to see some predictions (type exit to quit the demo). Since we used a small data, don't expect too much.
<div class="alert alert-warning">
<h3>Warning!</h3>
    <p>
        As there is a while True loop in the cell below, you can skip this one for now and run the cell under '3. Submission' first to save your model. 
   </p>
</div>

In [12]:
from exercise_code.rnn.sentiment_dataset import tokenize

text = ''
w2i = vocab
while True:
    text = input()
    if text == 'exit':
        break

    words = torch.tensor([
        w2i.get(word, w2i['<unk>'])
        for word in tokenize(text)
    ]).long().to(device).view(-1, 1)  # T x B

    pred = model(words).item()
    sent = pred > 0.5
    
    print('Sentiment -> {}, Confidence -> {}'.format(
        ':)' if sent else ':(', pred if sent else 1 - pred
    ))
    print()

I love you


RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

# 3. Submission

If you got sufficient performance on the test data, you are ready to save your model.

In [12]:
from exercise_code.util.save_model import save_model

save_model(model, 'rnn_classifier.p')

...Your model is saved to models/rnn_classifier.p successfully!


'models/rnn_classifier.p'

Congrats, you finished the last Deep Learning homework! One last time this semester, let's prepare the submission:

In [13]:
# Now zip the folder for upload
from exercise_code.util.submit import submit_exercise

submit_exercise('../output/homework_10')

relevant folders: ['exercise_code', 'models']
notebooks files: ['1_recurrent_neural_networks.ipynb', '3_sentiment_analysis.ipynb', '2_text_preprocessing_and_embedding.ipynb']
Adding folder exercise_code
Adding folder models
Adding notebook 1_recurrent_neural_networks.ipynb
Adding notebook 3_sentiment_analysis.ipynb
Adding notebook 2_text_preprocessing_and_embedding.ipynb
Zipping successful! Zip is stored under: /Users/tigrangaplanyan/Downloads/DL_homeworks/output/homework_10.zip


# 4. Submission Goals

- Goal: Implement and train a recurrent neural network for sentiment analysis.
- Passing Criteria: Reach **Accuracy >= 83%** on the test dataset.

- Submission deadline: __Sunday December 10, 2023 - 23:59__ 
- You can make **$\infty$** submissions until the deadline. Your __best submission__ will be considered.