# Task definition
Implement LSTM Sentiment Tagger for imdb reviews dataset.

1. (5pt) Fill missing code below
    * 1pt implement vectorization
    * 2pt implement \_\_init\_\_ and forward methods of models
    * 2pt implement collate function
2. (4pt) Implement training loop, choose proper loss function, use clear ml for max points.
    * 2pts is a baseline for well written, working code
    * 2pts if clear ml used properly
3. (3pt) Train the models (find proper hyperparams). Make sure you are not overfitting or underfitting. Visualize training of your best model (plot training, and test loss/accuracy in time). Your model should reach at least 87% accuracy. For max points it should exceed 89%. 
    * 1pt for accuracy above 89%
    * 1pt for accuracy above 87%
    * 1pt for visualizations

Remarks:
* Use embeddings of size 50
* Use 0.5 threshold when computing accuracy.
* Use supplied dataset for training and evaluation.
* You do not have to use validation set.
* You should monitor overfitting during training.
* For max points use clear ml to store and manage logs from your experiments. 
* We encourage to use pytorch lightning library (Addtional point for using it - however the sum must not exceed 12)

[Clear ML documentation](https://clear.ml/docs/latest/docs/)

[Clear ML notebook exercise from bootcamp](https://colab.research.google.com/drive/1wtLb4gg8beLS7smcyJlOZppn6_rQvSxL?usp=sharing)

In [1]:
!pip install clearml
!pip install pytorch-lightning

import os
from collections import defaultdict

import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import torchtext
from clearml import Task

import torch
from torch import nn
from torch import optim
from torch.nn import functional as F

from torch.utils.data import Dataset, DataLoader

from pytorch_lightning import Trainer, LightningDataModule, Callback
from pytorch_lightning.core.lightning import LightningModule
from pytorch_lightning.callbacks import ModelCheckpoint



In [2]:
web_server = 'https://app.community.clear.ml'
api_server = 'https://api.community.clear.ml'
files_server = 'https://files.community.clear.ml'
access_key = ' 6A09MEFAGC8ZZZCZTGFL '#@param {type:"string"}
secret_key = ' C67ivHg6VkJxo31Vec0O0Y5wyEppkirPTQjDPfR0CXt0Jypy4C '#@param {type:"string"}

Task.set_credentials(web_host=web_server,
                     api_host=api_server,
                     files_host=files_server,
                     key=access_key,
                     secret=secret_key)

In [3]:
!pip install gdown
!gdown https://drive.google.com/uc?id=1hK-3iiRPlbePb99Fe-34LJNZ5yB-nduq
!tar -xvzf imdb_dataset.gz
data = pd.read_csv("imdb_dataset.csv")

Downloading...
From: https://drive.google.com/uc?id=1hK-3iiRPlbePb99Fe-34LJNZ5yB-nduq
To: /content/imdb_dataset.gz
100% 77.0M/77.0M [00:00<00:00, 211MB/s]
imdb_dataset.csv


# New Section

In [4]:
data

Unnamed: 0.1,Unnamed: 0,text,split,id,stars,sentiment,tokenized
0,0,"Gary Cooper, (Michael Brandon) played the role...",test,6182,8,1.0,"gary cooper , ( michael brandon ) played the r..."
1,1,"This film is a tapestry, a series of portraits...",test,7654,10,1.0,"this film is a tapestry , a series of portrait..."
2,2,i see there are great reviews of this film alr...,test,10435,7,1.0,i see there are great reviews of this film alr...
3,3,This film says everything there is to say abou...,test,10476,10,1.0,this film says everything there is to say abou...
4,4,Apparently this Australian film based on Nevil...,test,9769,9,1.0,apparently this australian film based on nevil...
...,...,...,...,...,...,...,...
99995,99995,Wow! I am still in absolute shock from this fi...,unsup,13698,0,,wow ! i am still in absolute shock from this f...
99996,99996,As someone who always likes to solve the New Y...,unsup,6887,0,,as someone who always likes to solve the new y...
99997,99997,What can i say positive about this movie? Abso...,unsup,13748,0,,what can i say positive about this movie ? abs...
99998,99998,I am really amazed how bad acting can really b...,unsup,48085,0,,i am really amazed how bad acting can really b...


In [5]:
import re
from collections import Counter
from itertools import chain

PADDING_VALUE = 0

class NaiveVectorizer:
    def __init__(self, tokenized_data, **kwargs):
        """Converts data from string to vector of ints that represent words. 
        Prepare lookup dict (self.wv) that maps token to int. Reserve index 0 for padding.
        """

        tokenized_data = [seq.split() for seq in tokenized_data]
        tokenized_data = list(chain(*tokenized_data))

        counter = Counter(tokenized_data)
        most_common = counter.most_common(5_000)

        self.vw = {token: ind for ind, (token, _) in enumerate(most_common, start=1)}

        ##################################

    def vectorize(self, tokenized_seq):
        """Converts sequence of tokens into sequence of indices.
        If the token does not appear in the vocabulary(self.wv) it is ommited
        Returns torch tensor of shape (seq_len,) and type long."""

        inds = []
        
        for token in tokenized_seq:
            if token in self.vw:
                inds.append(self.vw[token])

        return torch.tensor(inds, dtype=torch.long)

        ##################################

class ImdbDataset(Dataset):
    SPLIT_TYPES = ["train", "test", "unsup"]

    def __init__(self, data, preprocess_fn, split="train"):
        super(ImdbDataset, self).__init__()
        if split not in self.SPLIT_TYPES:
            raise AttributeError(f"No such split type: {split}")

        self.split = split
        self.label = [i for i, c in enumerate(data.columns) if c == "sentiment"][0]
        self.data_col = [i for i, c in enumerate(data.columns) if c == "tokenized"][0]
        self.data = data[data["split"] == self.split]
        self.preprocess_fn = preprocess_fn

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        seq = self.preprocess_fn(self.data.iloc[idx, self.data_col].split())
        label = self.data.iloc[idx, self.label]
        return (seq, label)

naive_vectorizer = NaiveVectorizer(data.loc[data["split"] == "train", "tokenized"])

def get_datasets():
    train_dataset = ImdbDataset(data, naive_vectorizer.vectorize)
    test_dataset = ImdbDataset(data, naive_vectorizer.vectorize, split="test")
        
    return train_dataset, test_dataset

def custom_collate_fn(pairs):
    """This function is supposed to be used by dataloader to prepare batches
    Input: list of tuples (sequence, label)
    Output: sequences_padded_to_the_same_lenths, original_lenghts_of_sequences, lables.
    torch.nn.utils.rnn.pad_sequence might be usefull here
    """
    
    sequences = [seqc for seqc, _ in pairs]
    lengths = torch.Tensor([len(seqc) for seqc in sequences])
    labels = torch.Tensor([label for _, label in pairs])
    seqcs = nn.utils.rnn.pad_sequence(sequences)

    #################################
    return seqcs, lengths, labels

In [6]:
naive_vectorizer.vw

{'the': 1,
 '.': 2,
 ',': 3,
 'and': 4,
 'a': 5,
 'of': 6,
 'to': 7,
 'is': 8,
 'it': 9,
 'in': 10,
 'i': 11,
 'this': 12,
 'that': 13,
 '"': 14,
 "'s": 15,
 '-': 16,
 'was': 17,
 'as': 18,
 'for': 19,
 'with': 20,
 'movie': 21,
 'but': 22,
 'film': 23,
 ')': 24,
 'on': 25,
 'you': 26,
 "n't": 27,
 '(': 28,
 'not': 29,
 'are': 30,
 'he': 31,
 'his': 32,
 'have': 33,
 'be': 34,
 'one': 35,
 '!': 36,
 'all': 37,
 'at': 38,
 'they': 39,
 'by': 40,
 'an': 41,
 'who': 42,
 'so': 43,
 'from': 44,
 'like': 45,
 'there': 46,
 'her': 47,
 'or': 48,
 'just': 49,
 'do': 50,
 "'": 51,
 'about': 52,
 'has': 53,
 'out': 54,
 'if': 55,
 'what': 56,
 '?': 57,
 'some': 58,
 'good': 59,
 'more': 60,
 'she': 61,
 'when': 62,
 'very': 63,
 'would': 64,
 'up': 65,
 'time': 66,
 'no': 67,
 'even': 68,
 'my': 69,
 'can': 70,
 'which': 71,
 'story': 72,
 'only': 73,
 'really': 74,
 'had': 75,
 'see': 76,
 'their': 77,
 'were': 78,
 'we': 79,
 'me': 80,
 'did': 81,
 'well': 82,
 'does': 83,
 'than': 84,
 'much

In [7]:
pairs = [(torch.Tensor([1,1,1]), 0), (torch.Tensor([7]), 1)]
seqcs, lengths, labels = custom_collate_fn(pairs)

print(seqcs)
print(labels)
print(lengths)

tensor([[1., 7.],
        [1., 0.],
        [1., 0.]])
tensor([0., 1.])
tensor([3., 1.])


# Trainig loop and visualizations


In [8]:
"""Implement LSTMSentimentTagger. 
The model should use a LSTM module.
Use torch.nn.utils.rnn.pack_padded_sequence to optimize processing of sequences.
When computing vocab_size of embedding layer remeber that padding_symbol counts to the vocab.
Use sigmoid activation function.
"""
class LSTMSentimentTagger(LightningModule):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, lr):
        super(LSTMSentimentTagger, self).__init__()
        self.save_hyperparameters()

        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size
        self.lr = lr

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.dropout = nn.Dropout(0.5)
        self.lstm = nn.LSTM(embedding_dim, self.hidden_dim, proj_size = int(embedding_dim/2))
        self.fc = nn.Linear(int(embedding_dim/2), 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, sentence, lengths):
        lengths = lengths.to('cpu')
        batch_size = sentence.shape[1]

        out = self.embedding(sentence)

        out = self.dropout(out)

        out = torch.nn.utils.rnn.pack_padded_sequence(out, lengths, enforce_sorted=False)
        out, (hidden, cell) = self.lstm(out)
        out, _ = torch.nn.utils.rnn.pad_packed_sequence(out)
        out = torch.stack([out[int(lengths[i]-1), i, :] for i in range(batch_size)])

        out = self.fc(hidden)
        out = self.sigmoid(out)

        scores = out.squeeze()

        return scores

    def training_step(self, batch, batch_idx):
        seqcs, lengths, labels = batch
        output = self(seqcs, lengths)
        loss = F.binary_cross_entropy(output.float(), labels.float())

        self.log("train loss", loss , batch_size=len(lengths),
                 on_step=True, on_epoch=True, prog_bar=True)
        self.log("train accuracy", self.acc(output,labels) / len(labels), batch_size=len(lengths),
                 on_step=True, on_epoch=True, prog_bar=True)
        
        return loss

    def validation_step(self, batch, batch_idx):
        seqcs, lengths, labels = batch
        output = self(seqcs, lengths)
        loss = F.binary_cross_entropy(output.float(), labels.float())

        self.log("validation loss", loss, batch_size=len(lengths),
                 on_epoch=True, prog_bar=True)
        self.log("validation accuracy", self.acc(output,labels) / len(labels), batch_size=len(lengths),
                 on_epoch=True, prog_bar=True)

        return loss

    def test_step(self, batch, batch_idx):
        seqcs, lengths, labels = batch
        output = self(seqcs, lengths)
        loss = F.binary_cross_entropy(output.float(), labels.float())

        self.log("test loss", loss, batch_size=len(lengths),
                 on_epoch=True, prog_bar=True)
        self.log("test accuracy", self.acc(output,labels) / len(labels), batch_size=len(lengths),
                 on_epoch=True, prog_bar=True)

        return loss

    def acc(self, pred, label):
        pred = torch.round(pred.squeeze())
        return torch.sum(pred == label.squeeze()).item()

    def configure_optimizers(self):
        return torch.optim.Adam(model.parameters(), self.lr)

In [9]:
class DataModule(LightningDataModule):
    def __init__(self):
        super().__init__()

    def setup(self, stage = None):
        self.train_dataset, self.test_dataset = get_datasets()

    def train_dataloader(self):
        return DataLoader(self.train_dataset, shuffle=True, batch_size=64, collate_fn=custom_collate_fn)

    def val_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=64, collate_fn=custom_collate_fn)

    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=64, collate_fn=custom_collate_fn)

In [10]:
class MetricTracker(Callback):
    def __init__(self, clearML_logger):
        self.clearML_logger = clearML_logger

        self.train_loss = []
        self.train_accuracy = []
        self.validation_loss = []
        self.validation_accuracy = []

        self.first_validation = True

    def on_validation_epoch_end(self, trainer, module):
        logged = trainer.logged_metrics

        if self.first_validation:
            self.first_validation = False
            return

        loss = logged['validation loss']
        acc = logged['validation accuracy']

        self.validation_loss.append(loss)
        self.validation_accuracy.append(acc)

        self.clearML_logger.report_scalar(title='Validation Loss', series='validation', 
                                          iteration=module.current_epoch, value=loss)
        self.clearML_logger.report_scalar(title='Validation Accuracy', series='validation',
                                          iteration=module.current_epoch, value=acc)

    def on_train_epoch_end(self, trainer, module):
        logged = trainer.logged_metrics
    
        self.train_loss.append(logged['train loss_epoch'])
        self.train_accuracy.append(logged['train accuracy_epoch'])


In [11]:
!rm chechpoints/*

checkpoint_callback = ModelCheckpoint(
    monitor="validation accuracy",
    dirpath="chechpoints/",
    filename="sample-mnist-{epoch:02d}-{validation accuracy:.2f}",
    mode="max",
)

In [12]:
embedding_dim = 50
hidden_dim = 400
vocab_size = len(naive_vectorizer.vw)+1
lr = 0.005

config = {
  'hidden_dim': hidden_dim,
  'embedding_dim': embedding_dim,
  'vocab_size': vocab_size,
  'lr': lr,
}

task = Task.create(project_name='Assigment3', task_name='Assigment 3')
task.mark_started()
logger = task.get_logger()
task.connect(config)

mt = MetricTracker(logger)
dm = DataModule()
model = LSTMSentimentTagger(embedding_dim, hidden_dim, vocab_size, lr)
trainer = Trainer(callbacks=[mt, checkpoint_callback], gpus=1, max_epochs=15)

trainer.fit(model, dm)

task.mark_completed()
task.close()

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type      | Params
----------------------------------------
0 | embedding | Embedding | 250 K 
1 | dropout   | Dropout   | 0     
2 | lstm      | LSTM      | 133 K 
3 | fc        | Linear    | 26    
4 | sigmoid   | Sigmoid   | 0     
----------------------------------------
383 K     Trainable params
0         Non-trainable params
383 K     Total params
1.533     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

In [13]:
import plotly.express as px

fig = px.line(x=range(len(mt.validation_accuracy)), y=mt.validation_accuracy, 
                      title='Validation accuracy')
fig.show()

fig = px.line(x=range(len(mt.validation_loss)), y=mt.validation_loss, 
                      title='Validation loss')
fig.show()

fig = px.line(x=range(len(mt.train_accuracy)), y=mt.train_accuracy, 
                      title='Train accuracy')
fig.show()

fig = px.line(x=range(len(mt.train_loss)), y=mt.train_loss, 
                      title='Train loss')
fig.show()

In [14]:
from os import listdir

filename = listdir('chechpoints/')[0]
trainer = Trainer(gpus=1)
model = LSTMSentimentTagger.load_from_checkpoint(f'chechpoints/{filename}')
trainer.test(model, dm)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test accuracy': 0.8947200179100037, 'test loss': 0.26467078924179077}
--------------------------------------------------------------------------------


[{'test accuracy': 0.8947200179100037, 'test loss': 0.26467078924179077}]