# Deep Learning
## Exercise 5 - Recurrent Neural Networks

### 0. LSTMs in PyTorch
We recommend reading [this tutorial](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html) on building an LSTM model in PyTorch.

### 1. Word2Vec
The [MovieLens 25M](https://grouplens.org/datasets/movielens/25m/) dataset contains movie titles and corresponding tags added by users. For every movie in the dataset, we concatenate all tags and treat the resulting list of tags as a sentence.

Your task is to build a simple search engine based on Word2Vec, treating each tag as a word.

Run the cells below to download, extract and setup the data.

In [None]:
!wget -P './data' 'http://files.grouplens.org/datasets/movielens/ml-25m.zip' && unzip -o './data/ml-25m.zip' -d './data'

In [None]:
# load and preprocess data
from pathlib import Path
import pandas as pd
data_dir = './data/ml-25m'
movies_df = pd.read_csv(data_dir + '/movies.csv')
tags_df = pd.read_csv(data_dir + '/tags.csv', converters={'tag': str}).groupby('movieId')['tag'].agg(list)
df = pd.merge(movies_df[['movieId', 'title']], tags_df, how='right', left_on='movieId', right_index=True).rename(columns={'tag': 'tags'})
df = df.drop('movieId', axis=1)
df['movie_id'] = list(range(len(df)))
df = df.set_index('movie_id', drop=False)

print(f'number of movies in dataset: {df.shape[0]}')
print(f'first movie: title: {df.iloc[0]["title"]}')
print(f'first movie: first 20 tags: {df.iloc[0]["tags"][:20]}...')
print(f'id of title "Toy Story (1995)": {df.loc[df.loc[:,"title"]=="Toy Story (1995)", "movie_id"].item()}')

The data has been loaded into a pandas DataFrame, which now has the columns: `movie_id`, `title`, `tags`. It also has an index, which is identical to the column `movie_id`.

The cleanest way to access the data is using `.loc` or `.iloc` (make sure you understand the difference).

In the following tasks, exploit the data structure by avoiding any loops over the entries. *Hint*: check out `df.map`

#### 1. Train a Word2Vec model

Use the [gensim's Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) function to train on the tag sentences and optain $64$-dimensional word embeddings. Set the window size to $5$ and the min_count to 1.

In [None]:
import torch
from gensim.models import Word2Vec # you can ignore any UserWarning about the missing levenshtein module
import numpy as np

In [None]:
#ToDo: build your Word2Vec Model

In [None]:
model = Word2Vec(sentences = df['tags'], vector_size=64, window = 5, min_count=1)

In [None]:
#We only need the word vectors for all further tasks, so we don't need to keep the model.
#You can access the representation of 'word' by word_vectors['word']
word_vectors = model.wv
del model

#### 2. Create a vector representation for each movie

We want to represent a movie $m$ that has a set of tags $T$ as the average of all word vectors, i.e.
\begin{equation}
    v_m = \frac{\sum_{t \in T} E(t)}{|T|}
\end{equation}
where $E(t)$ is the embedding of a tag $t$ and $v_m$ is the vector representation of $m$.



In [None]:
#ToDo: Extract vector representations for each movie.
# Don't use a for-loop. Exploit the functionalities of the pandas.DataFrame().

In [None]:
def get_representation(tags_list, word_vectors):
    """
    Transform the list of tags into a representation for the corresponding movie.
    
    Input values:
        tags_list : a list of tags
        word_vectors : the KeyedVectors from your Word2Vec model.
    
    Output value:
        representation : the representation
    """
    vectors_list=[word_vectors[tag] for tag in tags_list]
    vectors = np.array(vectors_list)
    representation = np.mean(vectors, axis=0)
    return representation

print(get_representation(df.loc[0,'tags'], word_vectors)) 
df['representations'] = df['tags'].map(lambda tags: get_representation(tags, word_vectors))

#### 3. Implement a small search engine for the query `Toy Story (1995)`

For movies and queries, we use the representation defined above. The relevance of a movie w.r.t. a query should be the cosine similarity between the two vectors.

Print the top-$10$ results (the movie titles) for the query title `Toy Story (1995)`.

*Hint*: the most relevant movie to a query is the query movie itself. It should have a cosine similarity of 1.0 .

In [None]:
from torch.nn.functional import cosine_similarity
q = "Toy Story (1995)"


In [None]:
#ToDo: Find the 10 most similar movies to the query

In [None]:
query = get_representation(df.loc[df.loc[:,"title"]==q, "tags"].item(), word_vectors)
df['cosine_sim'] = df['representations'].map(
    lambda rep: cosine_similarity(torch.tensor(rep), torch.tensor(query), dim=0).item())
df = df.sort_values(by='cosine_sim', ascending=False)
print(df.iloc[:10][['title', 'cosine_sim']])

### 2. Sentiment Classification with LSTMs
This task is about implementing a many-to-one LSTM for sentiment classification. We will use the IMDB dataset, which contains movie reviews associated with sentiments (positive/negative). The task is to classify each review into one of these two classes.

We provide you with the following data setup:
First, we load the IMDB dataset from `torchtext.datasets`, build the vocabulary and split to train/valid/test instances. Don't worry if you get confused by the dataset API, lets focus on the model architecture and training methods.

For development and debugging you can set `debugging = True` which will load a smaller subset for training and testing.

In [None]:
import random
import re
import torch
from torchtext import data, datasets, vocab
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

import numpy as np
from collections import Counter, OrderedDict
from sklearn.model_selection import train_test_split

random_seed = 0
data_directory = './data'
debugging = True #This can be set to True, if you want to test your implementation on a smaller subset


random.seed(random_seed)
torch.manual_seed(random_seed)
np.random.seed(random_seed)

max_length = 200   # we want the maximum words in each text instance to be 200.
max_vocab = 20000  # We want the vocabulary size not to exceed 20000.

# define a function to preprocess and tokenize raw text input
tokenizer = data.get_tokenizer('basic_english')
def text_tokenizer(entry):
    entry = re.sub('<\w{1,2} />', ' ', entry) #replace <br /> and similar
    entry = re.sub(r'[^\w\s]', ' ', entry) #remove any non-space or non-word characters
    entry = re.sub(r'\s+', ' ', entry) #replace multiple spaces by one space
    tokens = tokenizer(entry)
    return tokens

# read the dataset, the first call also downloads the dataset. Split the training_data into training and validation
train_set, test_set = datasets.IMDB(root=data_directory)
train_set = list(train_set)
test_set = list(test_set)


if debugging: 
    train_labels = [l for l,t in train_set]
    test_labels = [l for l, t in test_set]
    train_set, _ = train_test_split(train_set, train_size=0.2, stratify=train_labels, random_state=random_seed)
    test_set, _ = train_test_split(test_set, train_size=0.2, stratify=test_labels, random_state=random_seed)

train_labels = [l for l,t in train_set]
train_set, val_set = train_test_split(train_set, train_size=0.7, stratify=train_labels, random_state=random_seed)

# build the vocabulary from the training data
counter = Counter()
for label, text in train_set:
    tokens = text_tokenizer(text)
    counter.update(tokens)
    
vocabulary = vocab.vocab(OrderedDict(counter.most_common()[:max_vocab]))
special_tokens = ['<unk>', '<pad>']
for i, tok in enumerate(special_tokens):
    vocabulary.insert_token(tok, i)
vocabulary.set_default_index(vocabulary['<unk>'])

You can get the text vocabulary size with `len(vocabulary)`. Note that two special tokens `('<pad>', '<unk>')` are added to the vocabulary. You can check the index by e.g, `vocabulary.get_stoi()['<pad>']` or simply `vocabulary['<pad>']` and the inverse index by `vocabulary.get_itos()[1]`.

Next, we make the dataset iterable (in batches) for training the model. We need to decide on the batch size. Note that each batch of data contains a tuple of input text and output label.

In [None]:
# build the train, validation and test dataloaders
def collate_fn(batch):
    labels, indexes = [], []
    for label, text in batch:
        labels.append(1 if label =='pos' else 0)
        
        tokens = text_tokenizer(text)
        indexes += [torch.tensor([vocabulary[t] for t in tokens][:max_length])]
    labels = torch.tensor(labels)
    padded_indices = pad_sequence(indexes, padding_value=vocabulary['<pad>'], batch_first=True)
        
    return labels, padded_indices

train_dataloader = DataLoader(train_set, batch_size=32, shuffle=True, 
                              collate_fn=collate_fn, drop_last=True)

val_dataloader = DataLoader(val_set, batch_size=32, shuffle=True, 
                              collate_fn=collate_fn, drop_last=True)

test_dataloader = DataLoader(test_set, batch_size=32, shuffle=False, 
                              collate_fn=collate_fn, drop_last=False)

In [None]:
# example of iterating the training data
for label, indices in train_dataloader:
    print(indices) # tensor of size (batch_size x max_length) containing the tokenized words
    print(label) # 1d tensor containing the binary labels
    print(f'first sentence in batch:\n{" ".join(vocabulary.lookup_tokens(indices.squeeze()[0].numpy().tolist()))}')
    print(f'first label in batch: {"pos" if label[0]==1 else "neg"}')
    break

Now, the datasets have been prepared.

#### 1. Define a model named SentimentClassifier with following features:
* An Embedding layer with 100 embedding dimension. Use `nn.Embedding`.
* One bidirectional LSTM layer with hidden dimension 400. Use `nn.LSTM` and set `bidirectional=True`.
* Some dropout with probability 0.3 on the LSTM output. Use `nn.Dropout`.
* One fully connected layer to map the output features of the LSTM layer to a single output. Use a sigmoid activation for the output.

In [None]:
import torch.nn as nn
import torch.optim as optim

In [None]:
# todo: fill the __init__() and forward() function. Add arguments to both if you need them.
class SentimentClassifier(nn.Module):
    def __init__(self, ):
        super(SentimentClassifier, self).__init__()
        
        
    def forward(self, ):
        
    
SC = SentimentClassifier()

for label, indices in train_dataloader:
    print(indices[0])
    print(SC(indices))
    break

In [None]:
class SentimentClassifier(nn.Module):
    def __init__(self, vocab_size):
        super(SentimentClassifier, self).__init__()
        self.emb = nn.Embedding(num_embeddings=vocab_size, embedding_dim=100)
        self.lstm = nn.LSTM(input_size=100, hidden_size=200, num_layers=1, bidirectional=True, batch_first=True)
        self.dropout = nn.Dropout(p=0.3)
        self.full_conn = nn.Linear(in_features=400, out_features=1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, text):
        h = self.emb(text)
        h,_ = self.lstm(h)
        h = h[:, -1, :]
        h = self.dropout(h)
        h = self.full_conn(h)
        res = self.sigmoid(h)
        return res.flatten()
    
SC = SentimentClassifier(len(vocabulary))

for label, indices in train_dataloader:
    print(indices[0])
    print(SC(indices))
    break


#### 2. Train your model

Use the binary cross entropy loss (`torch.nn.BCELoss`) and the adam optimizer (use `torch.optim.Adam` with default parameters) for a maximum of 20 epochs. After every epoch, compute the validation loss. Implement an early stopping mechanism that keeps track of the best model parameters based on lowest validation loss. Stop training if the validation loss does not improve for continuous 3 epochs and revert back to the best model.

In [None]:
#ToDo: Implement training your model with early stopping

In [None]:
from tqdm import tqdm
loss_function = torch.nn.BCELoss()
optimizer = torch.optim.Adam(SC.parameters())

def train(num_epochs, model, loss_funtion, optimizer, train_loader, val_loader, break_criterium):
    best_val_loss = 100000
    no_improve=0
    for epoch in range(num_epochs):
        train_loss = 0
        model.train()
        for labels, texts in tqdm(train_loader, desc='Train Iter', ascii=True):
            output = model(texts)
            loss = loss_function(output, labels.float())
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            train_loss += loss.item()
        train_loss = train_loss / len(train_loader)
        model.eval()
        with torch.no_grad():
            cum_loss = 0
            for labels, texts in tqdm(val_loader, desc='Valid Iter', ascii=True):
                output = model(texts)
                loss = loss_function(output, labels.float())
                cum_loss += loss.item()
            cum_loss = cum_loss/len(val_loader)
            print(f"Epoch {epoch} \t Train Loss {train_loss:.5f} \t Val Loss {cum_loss:.5f}")
            if cum_loss < best_val_loss:
                best_val_loss = cum_loss
                torch.save(model.state_dict(), 'imdb.pt')   # save the best model
                no_improve=0
            else:
                no_improve+=1
            if no_improve >= break_criterium:
                break
    model.load_state_dict(torch.load('imdb.pt'))
    return model
    
SC = train(20, SC, loss_function, optimizer, train_dataloader, val_dataloader, 5) 

#### 3. Evaluate the accuracy of your trained model on the `test_iter` dataset.

In [None]:
#ToDo: Implement the evaluation of your model

In [None]:

def evaluate(model, test_loader):
    model.eval()
    correct = 0
    total_entries = 0
    for labels, texts in tqdm(test_loader, desc='Test Iter', ascii=True):
        output = model(texts)
        preds = (output>0.5).int()
        correct += (preds == labels).sum()
        total_entries += len(texts)
    return (correct/total_entries).item()
    
acc = evaluate(SC, test_dataloader)

print(f"Test Accuracy: {acc:.4f}")