# Deep Learning 2021
## Assignment 5 - Recurrent Neural Networks

### 1. Word Embeddings
Consider a Word2Vec model with the vocabulary $\{A, B, C, D, E\}$ and the weight matrices
\begin{equation*}
    W =
        \begin{pmatrix}
            1 & -1  & 0\\
            0 & 1   & 2\\
            2 & 2   & 2\\
            2 & 1   & 0\\
            2 & -1  & 0
        \end{pmatrix}
    \quad \text{and} \quad
    W' =
        \begin{pmatrix}
            2 & 3   & 4   & 2 & 0\\
            1 & 3   & -1  & 2 & 3\\
            1 & -2  & 1   & 0 & 1
        \end{pmatrix}.
\end{equation*}
Assume that the model uses the __CBOW__ architecture and that the one-hot indices correspond to the order of the words in the vocabulary above (the one-hot vector $(1, 0, 0, 0, 0)$ encodes the word $A$, the vector $(0, 1, 0, 0, 0)$ encodes $B$ and so on).

1. What is one way the word $A$ can be embedded using these matrices?
2. Suppose that at one point in time during training $W$ and $W'$ have the above values and the window is $B\ C\ D$. What would be the corresponding network input and output?
3. What loss function is used in Word2Vec? What would be the inputs of the loss function in the given example?
4. Compute the loss for the given example.

#### Solution

1. The one-hot vektor of word A is (1,0,0,0,0), product the vektor with W， in hidden layer we can get the a vektor of A: $h = W^T*A = W_1$, which is the first row of matrice W. Then left-product the matrice W' with the hidden layer vektor we can get another embedding vektor:$W'^T * h$
2. input:
$$
(0,1,0,0,0)\\
(0,0,1,0,0)\\
(0,0,0,1,0)
$$
hiddenlayer:
$$
h = \frac{1}{3} * W^T * (B + C + D)\\
  = (4/3, 4/3, 4/3)
$$
y:
$$
y = W'h = (16/3, 16/3, 16/3, 16/3, 16/3)
$$
3. binary cross entropy loss, the input is 0 or 1.

### 2. Backpropagation through Time 
What happens to the gradient in vanilla RNNs if you backpropagate through a long sequence? What are some of the possible solutions proposed in literature to solve such problems?

#### Solution

Your solution goes here

### C0.
We recommend reading [this tutorial](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html) on building an LSTM model in PyTorch.

### C1. Word2Vec
The [MovieLens 25M](https://grouplens.org/datasets/movielens/25m/) dataset contains movie titles and corresponding tags added by users. For every movie in the dataset, we concatenate all tags and treat the resulting list of tags as a sentence.

Your task is to build a simple search engine based on Word2Vec, treating each tag as a word:

1. Train a Word2Vec model on the tag sentences to obtain $64$-dimensional word embeddings. Use the [gensim's Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) function for this task. Set the window size to $5$ and the min_count to 1.

Now suppose we represent a movie $m$ that has a set of tags $T$ as the average of all word vectors, i.e.
\begin{equation}
    v_m = \frac{\sum_{t \in T} E(t)}{|T|}
\end{equation}
where $E(t)$ is the embedding of a tag $t$ and $v_m$ is the vector representation of $m$.

2. Implement a small search engine, where each movie is represented by a vector as described above. Queries are represented the same way. The relevance of a movie w.r.t. a query should be the cosine similarity between the two vectors. Print the top-$10$ results (the movie titles) for the query title `Toy Story (1995)`. Hint: the most relevant movie to a query is the query movie itself. It should have a cosine similarity of 1.0 .

Run the cells below to download, extract and setup the data.

In [None]:
# installing gensim
!pip install gensim==4.0.0

Collecting gensim==4.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/c3/dd/5e00b6e788a9c522b48f9df10472b2017102ffa65b10bc657471e0713542/gensim-4.0.0-cp37-cp37m-manylinux1_x86_64.whl (23.9MB)
[K     |████████████████████████████████| 23.9MB 44.7MB/s 
Installing collected packages: gensim
  Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.0.0


In [None]:
!wget 'http://files.grouplens.org/datasets/movielens/ml-25m.zip' && unzip -o 'ml-25m.zip'

--2021-05-17 14:30:16--  http://files.grouplens.org/datasets/movielens/ml-25m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 261978986 (250M) [application/zip]
Saving to: ‘ml-25m.zip’


2021-05-17 14:30:33 (15.2 MB/s) - ‘ml-25m.zip’ saved [261978986/261978986]

Archive:  ml-25m.zip
   creating: ml-25m/
  inflating: ml-25m/tags.csv         
  inflating: ml-25m/links.csv        
  inflating: ml-25m/README.txt       
  inflating: ml-25m/ratings.csv      
  inflating: ml-25m/genome-tags.csv  
  inflating: ml-25m/genome-scores.csv  
  inflating: ml-25m/movies.csv       


In [None]:
# load and preprocess data
from pathlib import Path
import pandas as pd
data_dir = Path.cwd() / 'ml-25m'
movies_df = pd.read_csv(data_dir / 'movies.csv')
tags_df = pd.read_csv(data_dir / 'tags.csv', converters={'tag': str}).groupby('movieId')['tag'].agg(list)
df = pd.merge(movies_df[['movieId', 'title']], tags_df, how='right', left_on='movieId', right_index=True).rename(columns={'tag': 'tags'})
df['movie_id'] = list(range(len(df)))

# these dictionaries contain everything you need for the task
movie_id_to_title = dict(zip(df['movie_id'], df['title']))  # maps movie_id to title
title_to_movie_id = dict(zip(df['title'], df['movie_id']))  # maps title to movie_id
movie_id_to_tags = dict(zip(df['movie_id'], df['tags']))    # maps movie_id to a list of tags. Use all the lists of tags as the input to train the word2vec model.

print(f'number of movies in dataset: {len(movie_id_to_title)}')
print(f'first movie: title: {movie_id_to_title[0]}')
print(f'first movie: first 20 tags: {movie_id_to_tags[0][:20]}...')
print(f'id of title "Toy Story (1995)": {title_to_movie_id["Toy Story (1995)"]}')

number of movies in dataset: 45251
first movie: title: Toy Story (1995)
first movie: first 20 tags: ['Owned', 'imdb top 250', 'Pixar', 'Pixar', 'time travel', 'children', 'comedy', 'funny', 'witty', 'rated-G', 'animation', 'Pixar', 'computer animation', 'good cartoon chindren', 'pixar', 'friendship', 'bright', 'DARING RESCUES', 'fanciful', 'HEROIC MISSION']...
id of title "Toy Story (1995)": 0


#### Solution

In [None]:
import torch
from torch.nn.functional import cosine_similarity
from gensim.models import Word2Vec # you can ignore any UserWarning about the missing levenshtein module

# ToDo: Your solution goes here
sentences = []
for id in movie_id_to_tags:
    sentences.append(movie_id_to_tags[id])

model = Word2Vec(sentences, vector_size=64, window=5, min_count=1)
model.save("word2vec.model")

In [None]:
model = Word2Vec.load("word2vec.model")
word_vectors = model.wv

In [None]:
def list_add(a,b):
    c = []
    for i in range(len(a)):
        c.append(a[i]+b[i])
    return c

def list_div(v, d):
    for i in range(len(v)):
        v[i] = v[i] / d
    return v 

In [None]:
movie_vectors = []
for id in movie_id_to_tags:
    tags = movie_id_to_tags[id]
    vector = [0] * 64
    T = len(tags)
    for tag in tags:
        vector = list_add(vector, word_vectors[tag])
    vector = list_div(vector, T)
    movie_vectors.append(vector)

In [None]:
title_to_vector = {}
for i in range(len(movie_id_to_tags)):
    title_to_vector[movie_id_to_title[i]] = movie_vectors[i]

In [None]:
query = title_to_vector["Toy Story (1995)"]
score = {}
for i in range(len(title_to_vector)):
    score[movie_id_to_title[i]] = cosine_similarity(torch.Tensor(query), torch.Tensor(title_to_vector[movie_id_to_title[i]]), dim=-1)
sorted_score = sorted(score.items(), key=lambda d:d[1], reverse=True)

In [None]:
sorted_score[:10]

[('Toy Story (1995)', tensor(1.)),
 ('Toy Story 2 (1999)', tensor(0.9921)),
 ('Finding Nemo (2003)', tensor(0.9914)),
 ("Bug's Life, A (1998)", tensor(0.9879)),
 ('Ice Age (2002)', tensor(0.9865)),
 ('Monsters, Inc. (2001)', tensor(0.9835)),
 ('Storks (2016)', tensor(0.9821)),
 ('The Secret Life of Pets (2016)', tensor(0.9786)),
 ('Ratatouille (2007)', tensor(0.9774)),
 ("Emperor's New Groove, The (2000)", tensor(0.9758))]

### C2. Sentiment Classification with LSTMs
This task is about implementing a many-to-one LSTM for sentiment classification. We will use the IMDB dataset, which contains movie reviews associated with sentiments (positive/negative). The task is to classify each review into one of these two classes.

We provide you with the following data setup:
First, we install `torchtext` by ```pip install torchtext==0.9.1```. Then we load the IMDB dataset from `torchtext.datasets`, build the vocabulary and split to train/valid/test instances using predefined functions from `torchtext.data`. Don't worry if you get confused by the dataset API, lets focus on the model architecture and training methods.

In [None]:
!pip install torchtext==0.9.1



In [None]:
import re
import torch
from torchtext.legacy import data, datasets

max_length = 200   # we want the maximum words in each text instance to be 200.
max_vocab = 20000  # We want the vocabulary size not to exceed 20000.

# define a function to preprocess raw text input, define the input and output field.
text_cleaner = lambda X: [re.sub(r'[^\w\s]', '', x) for x in X]  
TEXT = data.Field(preprocessing=text_cleaner, lower=True, batch_first=True, fix_length=max_length)
LABEL = data.LabelField()
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)  # This will download IMDB for the first run.
train_data, valid_data = train_data.split(split_ratio=0.7)

# Init the vocabulary from training data. 
TEXT.build_vocab(train_data, max_size= max_vocab)
LABEL.build_vocab(train_data)

You can get the text vocabulary size with `len(TEXT.vocab)`. Note that two special tokens `('<pad>', '<unk>')` are added to the vocabulary. You can check the index by e.g, `TEXT.vocab.stoi['<pad>']` and the inverse index by `TEXT.vocab.itos[1]`. Similarly, to check the index for labels, you can use `LABEL.vocab`

Next, we make the dataset iteratable (in batches) for training the model. We need to decide on the batch size, and already send the data to the gpu or cpu device. Note that each batch of data contains a tuple of input text and output label.

In [None]:
batch_size = 32   # you can change this size.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iter, valid_iter, test_iter = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = batch_size,
    sort_within_batch = True,
    device = device)

In [None]:
# example of iterating the training data
for batch in train_iter:
    print(batch.text) # tensor of size (batch_size x max_length) containing the tokenized words
    print(batch.label) # 1d tensor containing the binary labels
    print(f'first sentence in batch:\n{" ".join([TEXT.vocab.itos[token_id] for token_id in batch.text[0]])}')
    print(f'first label in batch: {LABEL.vocab.itos[batch.label[0]]}')
    break

tensor([[132,  10,  60,  ...,   1,   1,   1],
        [250, 528,  39,  ...,   1,   1,   1],
        [150,  10, 175,  ...,   1,   1,   1],
        ...,
        [ 11,  20,   7,  ...,   1,   1,   1],
        [ 58, 601,  14,  ...,   1,   1,   1],
        [ 10, 417, 386,  ...,   1,   1,   1]])
tensor([1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
        0, 1, 1, 1, 0, 1, 0, 1])
first sentence in batch:
while i would say i enjoy the show i expected something completely different from when i first saw what i like about you i expected to find something along the lines of all that i am not sure if it is going on anymore but i have to say i do like the show and while i dont classify it as a breakthrough show it is very charming and i do like the chemistry between the characters as well including the supporting castbr br i would definitely say that it is great to see wesley jonathan back on the screen because i really loved him in city guy i had also seen the woman who 

Now, the datasets have been prepared.


***Your task is to*** : 

1. Define a model named SentimentClassifier with following features:
* An Embedding layer with 100 embedding dimension. Use `nn.Embedding`.
* One bidirectional LSTM layer with hidden dimension 400. Use `nn.LSTM` and set `bidirectional=True`.
* Some dropout with probability 0.3 on the LSTM output. Use `nn.Dropout`.
* One fully connected layer to map the output features of the LSTM layer to a single output. Use a sigmoid activation for the output.

2. Train your model with binary cross entropy loss (`torch.nn.BCELoss`) and the adam optimizer (use `torch.optim.Adam` with default parameters) for a maximum of 20 epochs. After every epoch, compute the validation loss. Implement an early stopping mechanism that keeps track of the best model parameters based on lowest validation loss. Stop training if the validation loss does not improve for continuous 3 epochs and revert back to the best model.

3. Evaluate the accurracy of your trained model on the `test_iter` dataset.

In [None]:
import torch.nn as nn
import torch.optim as optim

# todo: fill the __init__() and forward() function.
class SentimentClassifier(nn.Module):
    def __init__(self):
        super(SentimentClassifier, self).__init__()

        self.output_size = 1
        self.n_layers = 2
        self.hidden_dim = 400
        self.vocab_size = 20002
        self.embedding_dim = 100
        self.bidirectional = True
        self.text_size=200


        self.embed = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.lstm = nn.LSTM(self.embedding_dim, self.hidden_dim, self.n_layers,
                            bidirectional=self.bidirectional)
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(800, 1)
        self.sig = nn.Sigmoid()

    def forward(self, text, hidden):
        batch_size = text.shape[0]
        embed_out = self.embed(text)
        lstm_out, hidden = self.lstm(embed_out, hidden)
        out = lstm_out.contiguous().view(-1, self.hidden_dim*2)
        out = self.dropout(lstm_out)
        out = self.fc(out)
        sig_out = self.sig(out)

        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1]
        return sig_out, hidden
    
    def init_hidden(self):
        weight = next(self.parameters()).data
        hidden = (weight.new(self.n_layers*2, self.text_size, self.hidden_dim).zero_(),
                  weight.new(self.n_layers*2, self.text_size, self.hidden_dim).zero_())
        return hidden

Defining `loss function` and `optimizer`:

In [None]:
# todo
net = SentimentClassifier()
print(net)

loss_function = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=0.001)

SentimentClassifier(
  (embed): Embedding(20002, 100)
  (lstm): LSTM(100, 400, num_layers=2, bidirectional=True)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=800, out_features=1, bias=True)
  (sig): Sigmoid()
)


In [None]:
train_on_gpu=torch.cuda.is_available()
 
if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

No GPU available, training on CPU.


Training

please retrain this cell.

In [None]:
# todo
import numpy as np


Max_epoch = 20
clip = 5

if(train_on_gpu):
    net.cuda()

net.train()
total_val_loss = []
best_val_loss = 1000
for e in range(Max_epoch):
    
    h = net.init_hidden()
    counter = 0
 
    for text, labels in train_iter:
        counter += 1
 
        if(train_on_gpu):
            text, labels = text.cuda(), labels.cuda()

        h = tuple([each.data for each in h])
        net.zero_grad()

        output, h = net(text, h)

        loss = loss_function(output.squeeze(), labels.float())
        loss.backward()

        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()
        #print(loss)




    val_h = net.init_hidden()
    val_losses = []
    net.eval()
    for text, labels in valid_iter:
 
        val_h = tuple([each.data for each in val_h])
 
        if(train_on_gpu):
            text, labels = text.cuda(), labels.cuda()
 
        output, val_h = net(text, val_h)
        val_loss = loss_function(output.squeeze(), labels.float())
 
        val_losses.append(val_loss.item())
    net.train()


    total_val_loss.append(np.mean(val_losses))
    
    print("Epoch: {}/{}...".format(e+1, Max_epoch),
          "Step: {}...".format(counter),
          "Val Loss: {:.4f}".format(total_val_loss[e]))

    if e == 0:
        torch.save(net, "best_net.pth")
        best_val_loss = total_val_loss[e]
        print("first save: epoch 1")
    if e>=1 and total_val_loss[e] <= best_val_loss:
        torch.save(net, "best_net.pth")
        best_val_loss = total_val_loss[e]
        print("last save: epoch", e+1)
    
    if e>2 and total_val_loss[e] > best_val_loss and total_val_loss[e-1] > best_val_loss and total_val_loss[e-2] > best_val_loss:
        print("Early Stopping\n")
        print("best val loss: {:.4f}".format(best_val_loss))
        break


KeyboardInterrupt: ignored

Testing

In [None]:
# todo
net = torch.load("best_net.pth")
test_h = net.init_hidden()
test_losses = []
accuracy = []
for text, labels in test_iter:
 
    if(train_on_gpu):
        text, labels = text.cuda(), labels.cuda()
        
    test_h = tuple([each.data for each in test_h])
    # zero accumulated gradients
    net.zero_grad()
 
    
    output, test_h = net(text, test_h)
    test_loss = loss_function(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    correct = 0
    for i in range(len(output)):
        if output[i] >= 0.5:
            output[i] = 1
        else:
            output[i] = 0
        if output[i] == labels[i]:
            correct += 1
    accuracy.append(correct / len(output))
    
print("total test loss : %.4f" % np.mean(test_losses))
print("total test accuracy: %4f" % np.mean(accuracy))