# Assignment 10 - NLP using Deep Learning

## Goals

In this assignment you will get to work with recurrent network architectures with application to language processing tasks and observe behaviour of the learning using tensorboard visualization.

You'll learn to use

 * word embeddings
 * LSTMs
 * tensorboard visualization
 * optionally, but easy to try, state-of-the-art transformer model

While the notebook is heavy with code, the actual **TODO**s for you are lightweight and easy to find. Use the lab machines and provided environment to get started and finish quickly.
The main intention of this exercise is to provide you with entry points to approach common NLP tasks with simple and elaborate methods.

## Use the deep learning environment in the lab

With the same kind of preparation as in [Assignment 6](../A6/A6.ipynb) we are going to use **[pytorch](http://pytorch.org)** for the deep learning aspects of the assignment.

There is a pytorch setup in the big data under the globally available anaconda installation.
However, it is recommended that you use the custom **gt** conda environment that contains all python package dependencies that are relevant for this assignment (and also tensorflow, etc.).

You could load it directly
```
source activate /usr/shared/CMPT/big-data/condaenv/gt
```
Once activated, you couls also add it as a user kernel to your jupyter installation
```
python -m ipykernel install --user --name="py-gt"
```
and then choose it as kernel when running this notebook.
To reproduce this environment on your own system, you could use `conda env export > environment.yml` and then use `mamba env update --prefix wherever_you_want_to_create_yours -f environment.yml` to make your own instance of this environment.

In [356]:
import torch
import torch.nn as nn
import numpy as np

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# location of "GoogleNews-vectors-negative300.bin.gz", only required if word2vec embedding is chosen
from pathlib import Path
bdenv_loc = Path('/usr/shared/CMPT/big-data')
bdata = bdenv_loc / 'data'

# Task 1: Explore Word Embeddings

Word embeddings are mappings between words and multi-dimensional vectors, where the difference between two word vectors has some relationship with the meaning of the corresponding words, i.e. words that are similar in meaning are mapped closely together (ideally). This part of the assignment should enable you to

* Load a pretrained word embedding
* Perform basic operations, such as distance queries and evaluate simple analogies

In [357]:
import gensim
# Load Google's pre-trained Word2Vec model, trained on news articles
model = gensim.models.KeyedVectors.load_word2vec_format(bdata/'GoogleNews-vectors-negative300.bin.gz', binary=True)

In [358]:
# read up about the word2vec API in gensim and
# obtain a vector representation for a word of your choice

# TODO ...
vector_word=model['Bus']
print(len(vector_word))

# to confirm that this worked, print out the number of elements of the vector

300


In [359]:
# determine the 10 words that are closest in the embedding to the word vector your produced above

# TODO ...
model.most_similar('Bus')
# are the nearest neighbours similar in meaning?
# try different seed words, until you find one whose neighbourhood looks OK

[('bus', 0.7499595880508423),
 ('Buses', 0.681564211845398),
 ('Minibus', 0.6650453209877014),
 ('buses', 0.6404081583023071),
 ('Train', 0.6006012558937073),
 ('Trolley', 0.5985462069511414),
 ('Bus_Union_RTBU', 0.5962828397750854),
 ('Taxi', 0.5935410857200623),
 ('busses', 0.5892709493637085),
 ('Busses', 0.5729268193244934)]

In [360]:
# using a combination of positive and negative words, find out which word is most
# similar to woman + king - man

# TODO ...
model.most_similar(positive=["woman","king"],negative=["man"])[0]
# note, gensim's API allows you to combine positive and negative words without having to obtain their vectors

('queen', 0.7118193507194519)

In [361]:
# you may find that the results of most word analogy combinations don't work as well as we'd hope.
# however, explore a bit and find two more cases where the output of your word vector algebra makes sense.

# TODO ...
model.most_similar(positive=['domesticated','pets'],negative=['dog'])[0]

('domesticated_cats', 0.5584822297096252)

In [362]:
model.most_similar(positive=['Grocery','retailer'],negative=['supermarkets'])[0]

('store', 0.5448563694953918)

In [364]:
# Rerun at least one of your above word embedding examples using a different embedding, instead of word2vec, i.e. this version of GLOVE

import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-100")

# TODO ...


In [365]:
word_vectors.most_similar(positive=['domesticated','pets'],negative=['dog'])[0]

('sirenians', 0.5787887573242188)

In [366]:
word_vectors.most_similar(positive=['grocery','retailer'],negative=['supermarkets'])[0]

('store', 0.6948113441467285)

# Task 2: Sequence modeling with RNNs or transformers

In this task you will get to use a learning and a rule-based model of text sentiment analysis. To keep things simple, you will receive almost all the code and are just left with the task to tune the given algorithms, see the part about instrumentation below.
Look for *TODO* to find places where your input is required.

## SST-2 Binary text classification with XLM-RoBERTa model and LSTMs

The XLM-RoBERTa related portions of this notebook are from [a tutorial](https://pytorch.org/text/main/tutorials/sst2_classification_non_distributed.html) authored by `Parmeet Bhatia <parmeetbhatia@fb.com>`

Adaptation of the modern torchtext pipeline to also allow switching to recurrent model with different pre-trained word embeddings by `Steven Bergner <sbergner@sfu.ca>`

The steps below demonstrate how to train a text classifier on SST-2 binary dataset using a pre-trained XLM-RoBERTa (XLM-R) model. Customizations to switch parts of the pipeline to different models are also enabled.

We will show how to use torchtext library to:

1. build text pre-processing pipeline for XLM-R model
2. read SST-2 dataset and transform it using text and label transformation
3. instantiate a classification model using pre-trained XLM-R encoder
4. change pipeline components to swap out any part of the data and model pipeline


## Data Transformation

Models like XLM-R cannot work directly with raw text. The first step in training
these models is to transform input text into tensor (numerical) form such that it
can then be processed by models to make predictions. A standard way to process text is:

1. Tokenize text
2. Convert tokens into (integer) IDs
3. Add any special tokens IDs

XLM-R uses sentencepiece model for text tokenization. Below, we use pre-trained sentence piece
model along with corresponding vocabulary to build text pre-processing pipeline using torchtext's transforms.
The transforms are pipelined using :py:func:`torchtext.transforms.Sequential` which is similar to :py:func:`torch.nn.Sequential`
but is torchscriptable. Note that the transforms support both batched and non-batched text inputs i.e, one
can either pass a single sentence or list of sentences.




Caution: If you want to learn more about torchtext, be careful to **not** read the docs at:
https://torchtext.readthedocs.io/en/latest/
They claim to be "latest", but are of version 0.4.0

Instead, find **current docs** here: https://pytorch.org/text/stable/index.html
or simply keep reading, as this tutorial shows how to use the recent version.

In [367]:
import torchtext.transforms as T
from torch.hub import load_state_dict_from_url
from torch.utils.data import DataLoader

padding_idx = 1
bos_idx = 0
eos_idx = 2
max_seq_len = 256
xlmr_vocab_path = r"https://download.pytorch.org/models/text/xlmr.vocab.pt"
xlmr_spm_model_path = r"https://download.pytorch.org/models/text/xlmr.sentencepiece.bpe.model"

text_transform = T.Sequential(
    T.SentencePieceTokenizer(xlmr_spm_model_path),
    T.VocabTransform(load_state_dict_from_url(xlmr_vocab_path)),
    T.Truncate(max_seq_len - 2),
    T.AddToken(token=bos_idx, begin=True),
    T.AddToken(token=eos_idx, begin=False),
)

In [368]:
# obtain the vocabulary of the data pipeline, so that we can convert word <--> word_index
# allowing us to plug in different word embeddings
vocab = text_transform[1].vocab.vocab
word_to_idx = vocab.get_stoi()

In addition to the transformer model, we also create an LSTM based model for text classification.

Change the parameters below to switch between models and make adjustments to the training.

In [369]:
import time

# TODO make adjustments here to achieve acceptable training performance with LSTMs
# Also, try out the Roberta model for comparison

EPOCHS = 8
USE_GPU = torch.cuda.is_available()
DROPOUT = .1
timestamp = str(int(time.time()))
best_dev_acc = 0.0

do_use_roberta_model = False
if do_use_roberta_model:
    LEARNING_RATE = 1e-5
    EPOCHS = 1
    BATCH_SIZE = 16
    EMBEDDING_TYPE = 'built-in'
else:
    #EMBEDDING_TYPE = 'word2vec'
    #EMBEDDING_TYPE = 'glove'
    EMBEDDING_TYPE = 'glovefull'
    EMBEDDING_DIM = 300
    HIDDEN_DIM = 500
    BATCH_SIZE = 128
    USE_BILSTM = True
    LEARNING_RATE=1e-3
    do_freeze_embedding = True
    do_use_roberta_classifier = False

In [370]:
def maybe_gpu(v):
    return v.cuda() if USE_GPU else v

In [371]:
from torch.autograd import Variable
import torch.nn.functional as nnF

class LSTMSentiment(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, label_size,
                 use_gpu, batch_size, dropout=0.5, bidirectional=False, classifier_head=None):
        """Prepare individual layers"""
        super(LSTMSentiment, self).__init__()
        self.hidden_dim = hidden_dim
        self.use_gpu = use_gpu
        self.batch_size = batch_size
        self.dropout = dropout
        self.num_directions = 2 if bidirectional else 1
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, bidirectional=bidirectional)
        self.hidden2label = nn.Linear(hidden_dim*self.num_directions, label_size)
        self.hidden = self.init_hidden()
        self.classifier_head = classifier_head

    def init_hidden(self, batch_size=None):
        """Choose appropriate size and type of hidden layer"""
        if not batch_size:
            batch_size = self.batch_size
        #what = torch.randn
        what = torch.zeros
        # first is the hidden h
        # second is the cell c
        return (maybe_gpu(Variable(what(self.num_directions, batch_size, self.hidden_dim))),
                maybe_gpu(Variable(what(self.num_directions, batch_size, self.hidden_dim))))

    def classify(self, features):
        y = self.hidden2label(features)
        log_probs = nnF.log_softmax(y, dim=1)
        return log_probs

    def forward(self, sentence):
        """Use the layers of this model to propagate input and return class log probabilities"""
        if self.use_gpu:
            sentence = sentence.cuda()
        x = self.embeddings(sentence).permute(1,0,2)
        batch_size = x.shape[1]
        self.hidden = self.init_hidden(batch_size=batch_size)
        lstm_out, self.hidden = self.lstm(x, self.hidden)
        features = lstm_out[-1]
        if self.classifier_head:
            #unsqueeze: introduce dummy second dimension, so that classifier_head can drop it
            return self.classifier_head(torch.unsqueeze(features, 1))
        else:
            return self.classify(features)

Choose and load a word embedding that provides the feature input to the RNN/LSTM.

In [None]:
if 'glove' == EMBEDDING_TYPE:
    from torchtext.vocab import GloVe
    glove_vectors = GloVe(name="6B")
    EMBEDDING_DIM = glove_vectors.vectors.shape[1]
    use_embedding_directly = False
    if use_embedding_directly:
        pretrained_embeddings = maybe_gpu(glove_vectors.vectors)
    else:
        
        pretrained_embeddings = np.random.uniform(-0.25, 0.25, (len(vocab), EMBEDDING_DIM)).astype('f')
        pretrained_embeddings[0] = 0
        for word, wi in glove_vectors.stoi.items():
            try:
                pretrained_embeddings[word_to_idx[word]-1] = glove_vectors.__getitem__(word)
            except KeyError:
                pass
        pretrained_embeddings = maybe_gpu(torch.from_numpy(pretrained_embeddings))
elif 'glovefull' == EMBEDDING_TYPE:
    from torchtext.vocab import GloVe
    glove_vectors = GloVe(cache="/usr/shared/CMPT/big-data/dot_torch_shared/.vector_cache/")
    # set freeze to false if you want them to be trainable
    pretrained_embeddings = maybe_gpu(glove_vectors.vectors)
    #my_embeddings = nn.Embedding.from_pretrained(pretrained_embeddings, freeze=True)
elif 'word2vec' == EMBEDDING_TYPE:
    pretrained_embeddings = np.random.uniform(-0.25, 0.25, (len(vocab), EMBEDDING_DIM)).astype('f')
    pretrained_embeddings[0] = 0
    try:
        word2vec
    except:
        print('Load word embeddings...')
        import gensim
        word2vec = gensim.models.KeyedVectors.load_word2vec_format(
                         bdata / 'GoogleNews-vectors-negative300.bin.gz', binary=True)
        EMBEDDING_DIM = 300
    for word, wi in word2vec.key_to_index.items():
        try:
            pretrained_embeddings[word_to_idx[word]-1] = word2vec.vectors[wi]
        except KeyError:
            pass
    # text_field.vocab.load_vectors(wv_type='', wv_dim=300)
    pretrained_embeddings = maybe_gpu(torch.from_numpy(pretrained_embeddings))
else:
    if not do_use_roberta_model:
        print('Unknown embedding type {}'.format(EMBEDDING_TYPE))

## Model preparation LSTM
Initialize the RNN model, if the above configuration is set to use it.

In [341]:
num_classes = 2

if not do_use_roberta_model:
    lstm_model = LSTMSentiment(embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM,
                                vocab_size=len(vocab), label_size=num_classes,\
                                use_gpu=USE_GPU, batch_size=BATCH_SIZE, dropout=DROPOUT, bidirectional=USE_BILSTM)
    lstm_model.embeddings = nn.Embedding.from_pretrained(pretrained_embeddings, freeze=do_freeze_embedding)
    model = lstm_model

Alternately we can also use transform shipped with pre-trained model that does all of the above out-of-the-box

::

  text_transform = XLMR_BASE_ENCODER.transform()




## Dataset
torchtext provides several standard NLP datasets. For complete list, refer to documentation
at https://pytorch.org/text/stable/datasets.html. These datasets are build using composable torchdata
datapipes and hence support standard flow-control and mapping/transformation using user defined functions
and transforms. Below, we demonstrate how to use text and label processing transforms to pre-process the
SST-2 dataset.





In [314]:
from torchtext.datasets import SST2
from torch.utils.data import DataLoader

batch_size = BATCH_SIZE

train_datapipe = SST2(split="train")
dev_datapipe = SST2(split="dev")

# Transform the raw dataset using non-batched API (i.e apply transformation line by line)
train_datapipe = train_datapipe.map(lambda x: (text_transform(x[0]), x[1]))
train_datapipe = train_datapipe.batch(batch_size)
train_datapipe = train_datapipe.rows2columnar(["token_ids", "target"])
train_dataloader = DataLoader(train_datapipe, batch_size=None)

dev_datapipe = dev_datapipe.map(lambda x: (text_transform(x[0]), x[1]))
dev_datapipe = dev_datapipe.batch(batch_size)
dev_datapipe = dev_datapipe.rows2columnar(["token_ids", "target"])
dev_dataloader = DataLoader(dev_datapipe, batch_size=None)

In [315]:
# # # Alternately we can also use batched API
# train_datapipe = train_datapipe.batch(batch_size).rows2columnar(["text", "label"])
# train_datapipe = train_datapipe.map(lambda x: {"token_ids": text_transform(x["text"]), "target": label_transform(x["label"])})
# dev_datapipe = dev_datapipe.batch(batch_size).rows2columnar(["text", "label"])
# dev_datapipe = dev_datapipe.map(lambda x: {"token_ids": text_transform(x["text"]), "target": label_transform(x["label"])})

## Model preparation - RoBERTa

torchtext provides SOTA pre-trained models that can be used to fine-tune on downstream NLP tasks.
Below we use pre-trained XLM-R encoder with standard base architecture and attach a classifier head to fine-tune it
on SST-2 binary classification task. We shall use standard Classifier head from the library, but users can define
their own appropriate task head and attach it to the pre-trained encoder. For additional details on available pre-trained models,
please refer to documentation at https://pytorch.org/text/main/models.html





In [316]:
num_classes = 2

from torchtext.models import RobertaClassificationHead, XLMR_BASE_ENCODER

if do_use_roberta_model:
    input_dim = 768
    classifier_head = RobertaClassificationHead(num_classes=num_classes, input_dim=input_dim)
    model = XLMR_BASE_ENCODER.get_model(head=classifier_head)
else:
    model = lstm_model
    if do_use_roberta_classifier:
        feature_dim = model.hidden_dim + (USE_BILSTM * model.hidden_dim)
        classifier_head = RobertaClassificationHead(num_classes=num_classes, input_dim=feature_dim)
        model.classifier_head = classifier_head

model.to(DEVICE);

## Training methods

Let's now define the standard optimizer and training criteria as well as some helper functions
for training and evaluation. The methods below work for either choice of model.




In [317]:
import torchtext.functional as F
from torch.optim import AdamW

learning_rate = LEARNING_RATE
optim = AdamW(model.parameters(), lr=learning_rate)
criteria = nn.CrossEntropyLoss()


def train_step(input, target):
    model.train()
    output = model(input)
    loss = criteria(output, target)
    optim.zero_grad()
    loss.backward()
    optim.step()


def eval_step(input, target):
    output = model(input)
    loss = criteria(output, target).item()
    return float(loss), (output.argmax(1) == target).type(torch.float).sum().item()


def evaluate():
    model.eval()
    total_loss = 0
    correct_predictions = 0
    total_predictions = 0
    counter = 0
    with torch.no_grad():
        for batch in dev_dataloader:
            input = F.to_tensor(batch["token_ids"], padding_value=padding_idx).to(DEVICE)
            target = torch.tensor(batch["target"]).to(DEVICE)
            loss, predictions = eval_step(input, target)
            total_loss += loss
            correct_predictions += predictions
            total_predictions += len(target)
            counter += 1

    return total_loss / counter, correct_predictions / total_predictions

### The actual task (B1): Tensorboard instrumentation

To get you to work with the some of the basic tools that enable development and tuning of deep learning architectures, we would like you to use Tensorboard.

1. read up on how to instrument your code for profiling and visualization in [tensorboard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard), e.g. [at this tutorial](https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html)
1. [partly done] use the tensorboard `SummaryWriter` to keep track of training loss for each epoch, writing to a local `runs` folder (which is the default)
1. launch tensorboard and inspect the log folder, i.e. run `tensorboard --logdir runs` from the assignment folder

In [318]:
from torch.utils.tensorboard import SummaryWriter
import os
out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))

writer = SummaryWriter(comment='-{}lstm-em{}{}-hid{}-do{}-bs{}-lr{}'
                                .format('BI' if USE_BILSTM else '',
                                        EMBEDDING_TYPE, EMBEDDING_DIM,
                                        HIDDEN_DIM,
                                        DROPOUT, BATCH_SIZE, LEARNING_RATE))
print("Writing to {}\n".format(out_dir))
if not os.path.exists(out_dir):
    os.makedirs(out_dir)

Writing to /home/rri/A10/runs/1648791647



## Train

Now we have all the ingredients to train our classification model. Note that we are able to directly iterate
on our dataset object without using DataLoader. Our pre-process dataset  shall yield batches of data already,
thanks to the batching datapipe we have applied. For distributed training, we would need to use DataLoader to
take care of data-sharding.




In [330]:
num_epochs = EPOCHS

from tqdm import tqdm

torch.autograd.set_detect_anomaly(False)

trial = 0 # increment this if you manually decide to add more epochs to the current training
for e in range(EPOCHS*trial,EPOCHS*(trial+1)):
    for batch in tqdm(train_dataloader):
        input = F.to_tensor(batch["token_ids"], padding_value=padding_idx).to(DEVICE)
        target = torch.tensor(batch["target"]).to(DEVICE)
        train_step(input, target)

    loss, accuracy = evaluate()
    print("Epoch = [{}], loss = [{}], accuracy = [{}]".format(e, loss, accuracy))
    # TODO add loss and accuracy to the tensorboard writer
    writer.add_scalar('Loss',loss,e)
    writer.add_scalar('Accuracy',accuracy,e)

527it [00:11, 44.79it/s]


Epoch = [0], loss = [0.917794236115047], accuracy = [0.7993119266055045]


527it [00:11, 44.19it/s]


Epoch = [1], loss = [0.9936426622526986], accuracy = [0.7947247706422018]


527it [00:12, 43.57it/s]


Epoch = [2], loss = [0.8787091885294233], accuracy = [0.7912844036697247]


527it [00:12, 43.27it/s]


Epoch = [3], loss = [0.9760624766349792], accuracy = [0.7935779816513762]


527it [00:12, 43.04it/s]


Epoch = [4], loss = [0.9195316689355033], accuracy = [0.7947247706422018]


527it [00:12, 42.99it/s]


Epoch = [5], loss = [0.9802014146532331], accuracy = [0.7947247706422018]


527it [00:12, 42.75it/s]


Epoch = [6], loss = [1.11028425182615], accuracy = [0.7901376146788991]


527it [00:12, 42.70it/s]


Epoch = [7], loss = [1.0888057265962874], accuracy = [0.7912844036697247]


In [331]:
#TODO ensure that the test accuracy is visible in the saved notebook for submission

In [332]:
writer.close()

### Task B2: Tune the model (TODO)

After connecting the output of your model train and test performance with tensorboard. Change the model and training parameters above to improve the model performance. We would like to see variable plots of how validation accuracy evolves over a number of epochs for at least two different parameter choices, you can stop exploring when you exceed a model accuracy of 76%.

Show a tensorboard screenshot with performance plots that combine at least 2 different tuning attempts. Store the screenshot as `tensorboard.png`. Then keep the best performing parameters set in this notebook for submission and evaluate the comparison below with your best model. 

## Comparison with Vader (NLTK)
Vader is a rule-based sentiment analysis algorithm that performs quite well against more complex architectures. The test below is to see, whether LSTMs are able to beat its performance.

In [333]:
# get text data from torchtext dataloader
vocab_itos = vocab.get_itos()
text_data = []
for ba in dev_dataloader:
    text = ("".join(
            ["".join(
                vocab_itos[tid]) for tokens in ba["token_ids"] 
                for tid in tokens ])
                .replace("▁"," ")
                .replace("<s>","")
                .split("</s>"))
    text_and_target = list(zip(text, ba["target"]))
    text_data.extend(text_and_target)

In [334]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

lab_vpred = np.zeros((len(text_data), 2))
for k, (sentence, label) in enumerate(text_data):
    ss = sid.polarity_scores(sentence)
    lab_vpred[k,:] = (int(ss['compound']>0), int(label))

vader_acc = 1-abs(lab_vpred[:,0]-lab_vpred[:,1]).mean()
print('vader acc: {}'.format(vader_acc))
writer.add_scalar('Final/VaderAcc', vader_acc)

vader acc: 0.6594036697247707


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/rri/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Perform the model tuning and training in the previous task until you outperform the Vader algorithm by at least 7% in accuracy using the LSTM model.

## Submission

Save [this notebook](A10.ipynb) containing all cell output and upload your submission as one `A10.ipynb` file.
Also, include the screenshot of your tensorboard debugging session as `tensorboard.png`.