# Part 3. Enhancement
The RNN model used in Part 2 is a basic model to perform the task of sentiment classification. In
this section, you will design strategies to improve upon the previous model you have built. You are
required to implement the following adjustments:

1. Instead of keeping the word embeddings fixed, now update the word embeddings (the same
way as model parameters) during the training process.
2. As discussed in Question 1(c), apply your solution in mitigating the influence of OOV words
and train your model again.
3. Keeping the above two adjustments, replace your simple RNN model in Part 2 with a biLSTM model and a biGRU model, incorporating recurrent computations in both directions and
stacking multiple layers if possible.
4. Keeping the above two adjustments, replace your simple RNN model in Part 2 with a Convolutional Neural Network (CNN) to produce sentence representations and perform sentiment
classification.
5. Further improve your model. You are free to use any strategy other than the above mentioned solutions. Changing hyper-parameters or stacking more layers is not counted towards
a meaningful improvement.


## Question 1

Instead of keeping the word embeddings fixed, now update the word embeddings (the same
way as model parameters) during the training process.

### Approach

We will use the same model as in part 2 notebook, but now we will also back propagate
the loss into the word embeddings itself. This will mean that as the model learns,
the word embeddings would also update, causing the encoding of the words to change.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from common_utils import EmbeddingMatrix

In [None]:
class RNN(nn.Module):

    def __init__(
        self,
        hidden_dim: int,
        embedding_dim: int,
        word_embeddings: torch.Tensor,
        pad_idx,
        num_layers=1,
        output_size=1,
        dropout_rate=0
    ):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.dropout_rate = dropout_rate
        self.embedding = nn.Embedding.from_pretrained(word_embeddings, freeze=True, padding_idx=pad_idx)
        self.rnn = nn.RNN(
            embedding_dim, hidden_dim, num_layers, batch_first=True
        )  # this is the num rows of the input matrix
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sigmoid = nn.Sigmoid()
        
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)
        
        if self.dropout_rate > 0:
            embedded = self.dropout(self.embedding(x)).float()
        else:
            embedded = self.embedding(x).float()
        
        out, _ = self.rnn(embedded, h0)
        
        if self.dropout_rate > 0:
            out = self.dropout(out)
        # if num_layers > 1, we need to do max pooling
        if self.num_layers > 1:
            out, _ = torch.max(out, 1)
        else:
            out = out[:, -1, :]
        
        out = self.fc(out)  # Use the last output of the RNN for classification
        sig_out = self.sigmoid(out)
        return sig_out

from common_utils import HIDDEN_SIZE, EMBEDDING_DIM, LEARNING_RATE
# initialize word embeddings
word_embeddings = EmbeddingMatrix.load()
word_embeddings.add_padding()
word_embeddings.add_unk_token()

print("The index of <PAD> is: ", word_embeddings.pad_idx)


basic_RNN = RNN(
    hidden_dim=HIDDEN_SIZE,
    embedding_dim=EMBEDDING_DIM,
    word_embeddings=word_embeddings.to_tensor,
    pad_idx=word_embeddings.pad_idx,
    num_layers=1,
)

optim = torch.optim.Adam(basic_RNN.parameters(), lr=LEARNING_RATE)
# scheduler = torch.optim.lr_scheduler.StepLR(optim, step_size=5, gamma=0.9)
scheduler = torch.optim.lr_scheduler.LinearLR(optim, start_factor=1.0, end_factor=0.01, total_iters=100)
criterion = nn.BCELoss()


## Question 2

### Approach

As discussed in part 1, we have mentioned 2 approaches to handling of the
OOV words. We will now demonstrate the first approach, which is to replace the OOV
words with a special token. We will replace the OOV words with a special token
`<UNK>`.

In [None]:
import json
from common_utils import EMBEDDING_DIM, EMBEDDING_MATRIX_PATH, IDX2WORD_PATH, UNK_TOKEN, WORD2IDX_PATH, load_glove_embeddings


w2v_model = EmbeddingMatrix.load()
extended_vocab = w2v_model.vocab
extended_vocab.add(UNK_TOKEN)

glove_dict = load_glove_embeddings()

# Collect words to be removed
missing_words = []
for word in extended_vocab:
    if word not in glove_dict:
        missing_words.append(word)

# Remove missing words from vocab
for word in missing_words:
    extended_vocab.remove(word)
        
print(f"Number of missing words: {len(missing_words)}")
print(f"The missing words are: {missing_words}")

# mapping of words to indices and vice versa
word2idx = {word: idx for idx, word in enumerate(sorted(extended_vocab))}
idx2word = {idx: word for word, idx in word2idx.items()}

print("Building embedding matrix...")
vocab_size = len(word2idx)
print(f"Vocab size: {vocab_size}")
embedding_matrix = np.zeros((vocab_size, EMBEDDING_DIM))

for word, idx in word2idx.items():
    embedding_matrix[idx] = glove_dict[word]

# add random vector for unknown words
embedding_matrix[word2idx[UNK_TOKEN]] = np.random.normal(scale=0.6, size=(EMBEDDING_DIM,))

print("Embedding matrix built successfully.")


Building embedding matrix...
Loading GloVe embeddings...


Repo card metadata block was not found. Setting CardData to empty.


Total GloVe words loaded: 400000
Embedding matrix built successfully.
Embedding matrix saved as './result/embedding_matrix.npy'.
Mapping 'word2idx' saved as './result/word2idx.json'.
Mapping 'idx2word' saved as './result/idx2word.json'.


For the second approach, we will use the FastText embeddings, which are trained on
subword information. This will help in encoding the OOV words as well.

In [20]:
# Implementation of FastText for word embedding
import gensim
from gensim.models import FastText
from common_utils import EMBEDDING_DIM
# Create a FastText model with the same dimensions as the Word2Vec model
fasttext_model = FastText(
    vector_size=EMBEDDING_DIM,
    window=5, # context window size 
    min_count=1, # threshold for word frequency
    workers=4
)

In [4]:
from datasets import load_dataset, Dataset
from common_utils import EmbeddingMatrix, tokenize
import nltk
dataset = load_dataset("rotten_tomatoes")
train_dataset = dataset['train']

corpus = []
for example in train_dataset:
    tokens = nltk.word_tokenize(example['text'])
    corpus.append(tokens)
print("The corpus has {} documents.".format(len(corpus)))

The corpus has 8530 documents.


In [21]:
fasttext_model.build_vocab(corpus_iterable=corpus)
# fasttext_model.build_vocab(corpus_iterable=corpus, update=True)

print("Length of vocabulary:", len(fasttext_model.wv.key_to_index))


fasttext_model.train(
    corpus_iterable=corpus, epochs=fasttext_model.epochs,
    total_examples=fasttext_model.corpus_count, total_words=fasttext_model.corpus_total_words,
)

Length of vocabulary: 18030


(657457, 919840)

In [22]:
"computer" in fasttext_model.wv.key_to_index

True

In [23]:
fasttext_vocab = set(fasttext_model.wv.key_to_index.keys())
w2v_model = EmbeddingMatrix.load()

In [24]:
w2v_model.vocab - fasttext_vocab

set()

In [25]:
fasttext_vocab - w2v_model.vocab

{'unslick',
 'rocky-like',
 'one-joke',
 'heavy-handedness',
 "'fish",
 'sequel-itis',
 'japanimator',
 'enrapturing',
 'again-courage',
 'enviará',
 "'enigma",
 'redneck-versus-blueblood',
 'scary-funny',
 "'topless",
 'in-the-ring',
 'unreligious',
 "'wayne",
 'post-full',
 'not-so-stock',
 'dridi',
 'speeds/',
 'escapa',
 'satirizado',
 'cliché-riddled',
 'semi-coherent',
 'disney-fied',
 'boom-bam',
 'techno-horror',
 "'korean",
 'scene-by-scene',
 'half-formed',
 'raw-nerved',
 'poo-poo',
 'walking-dead',
 'miscasts',
 "'tweener",
 'contando',
 'oft-described',
 "'true",
 'young-guns',
 'quick-cut',
 'clung-to',
 'janklowicz-mann',
 'decirles',
 'anti-hollywood',
 'self-mutilating',
 'migraine-inducing',
 'contemplarse',
 'episódio',
 'time-switching',
 'anti-darwinian',
 'efteriades',
 'heartwarmingly',
 'odd-couple',
 'half-asleep',
 'fleet-footed',
 'turkey-on-rolls',
 'dearly-loved',
 "'tonight",
 'junk-calorie',
 "'have-yourself-a-happy-little-holocaust",
 'gay-niche',
 'fare

In [26]:
print("The FastText model has {} words.".format(len(fasttext_model.wv.key_to_index)))
print("The Word2Vec model has {} words.".format(len(w2v_model.vocab)))
print("The FastText model has {} words that are not in the Word2Vec model.".format(len(fasttext_vocab - w2v_model.vocab)))
print("The Word2Vec model has {} words that are not in the FastText model.".format(len(w2v_model.vocab - fasttext_vocab)))

The FastText model has 18030 words.
The Word2Vec model has 16163 words.
The FastText model has 1867 words that are not in the Word2Vec model.
The Word2Vec model has 0 words that are not in the FastText model.


In [27]:
# save the FastText model
fasttext_model.save("fasttext_model.model")

# Question 3. Enhancement
(a) Report the accuracy score on the test set when the word embeddings are updated (Part 3.1).
   
(b) Report the accuracy score on the test set when applying your method to deal with OOV words
in Part 3.2.
   
(c) Report the accuracy scores of biLSTM and biGRU on the test set (Part 3.3).
   
(d) Report the accuracy scores of CNN on the test set (Part 3.4).
   
(e) Describe your final improvement strategy in Part 3.5. Report the accuracy on the test set
using your improved model.
   
(f) Compare the results across different solutions above and describe your observations with possible discussions.
