# Week 4 - NLP and Deep Learning

---

# Lecture 7. RNN 1

In assignments for this lecture we are going to implement an RNN POS tagger in Pytorch.

You can use the following function for data reading:

In [1]:
def read_conll_file(path):
    """
    read in conll file
    
    :param path: path to read from
    :returns: list with sequences of words and labels for each sentence
    """
    data = []
    current_words = []
    current_tags = []

    for line in open(path, encoding='utf-8'):
        line = line.strip()

        if line:
            if line[0] == '#':
                continue # skip comments
            tok = line.split('\t')

            current_words.append(tok[0])
            current_tags.append(tok[1])
        else:
            if current_words:  # skip empty lines
                data.append((current_words, current_tags))
            current_words = []
            current_tags = []

    # check for last one
    if current_tags != []:
        data.append((current_words, current_tags))
    return data

train_data = read_conll_file('pos-data/en_ewt-train.conll')
dev_data = read_conll_file('pos-data/en_ewt-dev.conll')

print(len(train_data))
print(len(dev_data))
print(max([len(x[0]) for x in train_data ]))

12543
2000
159


## 1. Prepare data for use in PyTorch

* a) Convert the data to a format that can be used in a Pytorch module. This means we require:

  * training data: matrix of number of instances (12543) by the maximum sentence length (159), filled with word indices
  * training labels: matrix of the same size, but filled with label indexes instead ( total of 17)
  * the same two sets for the dev data (note that no word indices can be added anymore)
  
A special `<PAD>` token can be used for padding, for sentences shorter as 91 words. For the unknown words in the test set, you can use the `<PAD>` token as well.

**HINT** It will be beneficial in the long run to make a function to convert your data to the right format, as you would have to do it for the train, dev and test sets, and for any other dataset you want to evaluate on.

In [2]:
import torch
from copy import deepcopy
torch.set_printoptions(sci_mode=False)

class Vocab():
    def __init__(self, pad_unk='<PAD>'):
        """
        A convenience class that can help store a vocabulary
        and retrieve indices for inputs.
        """
        self.pad_unk = pad_unk
        self.word2idx = {self.pad_unk: 0}
        self.idx2word = [self.pad_unk]

    def getIdx(self, word, add=False):
        if word not in self.word2idx:
            if add:
                self.word2idx[word] = len(self.idx2word)
                self.idx2word.append(word)
            else:
                return self.word2idx[self.pad_unk]
        return self.word2idx[word]

    def getWord(self, idx):
        return self.idx2word(idx)
    
    def __len__(self): # helpful utility shorthand
        return len(self.idx2word)

max_len= max([len(x[0]) for x in train_data ])

# Your implementation goes here:
def prepare(data, word_vocab=None, label_vocab=None):
    # Get dims
    rows = len(data)
    cols = max([len(x[0]) for x in data])
    wordmat = torch.zeros((rows, cols), dtype=int)
    labelmat = torch.zeros((rows, cols), dtype=int)

    # Create vocabs
    if word_vocab is None:
        allwords = [word for sentence in [datapoint[0] for datapoint in data] for word in sentence] # flattens a 2d list
        word_vocab = Vocab()
        [word_vocab.getIdx(word, True) for word in allwords]
    if label_vocab is None:
        alllabels = [label for sentence in [datapoint[1] for datapoint in data] for label in sentence]
        label_vocab = Vocab()
        [label_vocab.getIdx(label, True) for label in alllabels]
    
    # Fill matrix with numbers
    for i, (words, labels) in enumerate(data):
        assert len(words) == len(labels) # these two *should* be the exact same length, right?
        # pad lists to fit in the matrix
        words += [word_vocab.pad_unk] * (cols - len(words))
        labels += [label_vocab.pad_unk] * (cols - len(labels))
        # turn into numbers and tensors
        words_idx = torch.tensor([word_vocab.getIdx(word) for word in words], dtype=torch.int)
        labels_idx = torch.tensor([label_vocab.getIdx(label) for label in labels], dtype=torch.int)
        # replace the rows in the matrices
        wordmat[i] = words_idx
        labelmat[i] = labels_idx
    
    return wordmat, labelmat, word_vocab, label_vocab # return vocabs too so we can reuse them for dev data.

train_X_mat, train_Y_mat, word_vocab, label_vocab = prepare(deepcopy(train_data))
print(train_X_mat.shape)
# lets find the longest sentence and print that
index = [i for i in range(len(train_data)) if len(train_data[i][0]) == train_X_mat.shape[1]]
print(train_X_mat[index], train_Y_mat[index])
# as expected, every value is filled

torch.Size([12543, 159])
tensor([[13649,  1305, 13605,  2380,    38,  7528,   102,  4731,  4455,    12,
           107,    73,    99,  4321,    28, 13650,   107,   969,    38,   633,
            28, 13651,  1525,    19,  4455,   720,   102, 13652, 13653,    15,
            13,  8446, 12609,  8467, 13654,   559,  1927,   559,    12,    58,
           387,   468,    99,    91,  2100,    60,    13,  8942,  2878,   138,
         13655,   107, 13656,   151,  5030,    38,   431,   491,   538, 13605,
          4136,   468,    99,  3039,    17,    13, 12512,    19,   559,    66,
         13657,    19,  1305,   107, 13652,   559,     4,   559,    13,   178,
           558,   196,   372,    12,    46,  4051,   372,    38, 13658,   133,
           873,   559,    12,    66, 13620,   787, 13619,  3718,  1305, 13605,
          3415,   176,   102,   389,  9324,    19,  5434,   107, 13659,   485,
           114, 13605,   911,    13,  1458,    38,    13,   559, 13643, 13660,
           559, 13661,   19

* b) Until now, we have used a batch-size of 1 in our implemented models, meaning that the models weights are updated after each sentence. This is not very computationally efficient. Larger batch-sizes increase the training speed, and can also lead to better performance (more stable training). You can easily convert existing training data to batches, by splitting it up in chunks of `batch_size` sentences, like this (*Make sure you understand this code!*):

In [3]:
import torch
# 200 instances, 100 features/weights
train_y_feats = torch.zeros((200, 100))

batch_size = 32
num_batches = int(len(train_y_feats)/batch_size)

print(num_batches)

print(train_y_feats.shape)

tmp_feats_batches = train_y_feats[:batch_size*num_batches].view(num_batches,batch_size, 100)

# 6 batches with 32 instances with 100 features
print(tmp_feats_batches.shape)

print()
for feats_batch in tmp_feats_batches:
    print(feats_batch.shape)
    # Here you can call forward/calculate the loss etc.

6
torch.Size([200, 100])
torch.Size([6, 32, 100])

torch.Size([32, 100])
torch.Size([32, 100])
torch.Size([32, 100])
torch.Size([32, 100])
torch.Size([32, 100])
torch.Size([32, 100])


Note that this throws away a tiny part of the data (the last `len(tmp_feats)%batch_size`=6 samples), an alternative would be to pad, and ignore the padded part of the last batch for the loss. For the following assignments you can leave the remaining samples out (note that the dev set is dividable by 16 in this case). Furthermore, note that PyTorch supports a more advanced method for batching: [data loaders](https://pytorch.org/docs/stable/data.html), which we will not cover in this course (but you can use them for the final project).

Convert your training data and labels to batches of batch size 16

In [4]:
# Your implementation goes here:
batch_size = 16
num_batches = int(len(train_X_mat)/batch_size)

print(f"{num_batches} batches")
train_X_batches = train_X_mat[:batch_size*num_batches].view(num_batches, batch_size, train_X_mat.shape[1])
train_Y_batches = train_Y_mat[:batch_size*num_batches].view(num_batches, batch_size, train_X_mat.shape[1])
print(train_X_batches.shape)
print(f"Lost {train_X_mat.shape[0] - train_X_batches.shape[0]*batch_size} datapoints")

783 batches
torch.Size([783, 16, 159])
Lost 15 datapoints


In [5]:
train_X_batches[0]

tensor([[  1,   2,   3,  ...,   0,   0,   0],
        [ 25,  26,  27,  ...,   0,   0,   0],
        [ 41,   4,  42,  ...,   0,   0,   0],
        ...,
        [180, 181,  12,  ...,   0,   0,   0],
        [190, 182,  69,  ...,   0,   0,   0],
        [ 26, 202,  69,  ...,   0,   0,   0]])

## 2 Train an RNN

* a) Implement a POS tagger model in Pytorch using a [`torch.nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) layer for word representations and a [`torch.nn.RNN`](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) layer. You can use a [`torch.nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layer for prediction of the label. Train this tagger on the language identification data, and evaluate its performance. Note that during each training step, you now get the predictions and loss on a whole batch directly. Use the following hyperparameters: 5 epochs over the full training data, word embeddings dimension: 100, rnn dimension of 50, learning rate of 0.01 in an [Adam optimizer](https://pytorch.org/docs/stable/optim.html) and a [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html).

Hints:
* **Set batch_first to true!**, as explained on the [`torch.nn.RNN`](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) page. By default the RNN expects the input to be in the shape: `(seq_len, batch, rnn_dim)`, when it is set to true it should be: `(batch, seq_len, rnn_dim)`.
* Training an RNN is generally much slower compared to the machine learning models we implemented before on this data, so we suggest to start with only a sub-part of the data, like 100 or 1,000 sentences. It is also ok to use only 1,000 sentences for your final model (or use the HPC to train the full model).
* To calculate the cross entropy loss, we need the predictions to be in the first dimension. We can convert the predictions values from our model (16\*159\*18 for 1 batch) to a flatter representation (2544\*18) by using: `.view(BATCH_SIZE * max_len, -1)`. Of course, we also have to adapt the labels from 16\*159 to 2544\*1.

For more information on how to implement a Pytorch module, we refer to the code used to obtain the weights in the assignment of week 3 (`week4/train_ff.py`), and the following tutorial series: https://pytorch.org/tutorials/beginner/nlp/index.html (especially the 2nd and 4th tutorials are relevant). You can use the code below as a starting point

In [6]:
from torch import nn
import torch
torch.manual_seed(0)
DIM_EMBEDDING = 100
RNN_HIDDEN = 50
BATCH_SIZE = 16
LEARNING_RATE = 0.01
EPOCHS = 5

class TaggerModel(torch.nn.Module):
    def __init__(self, nwords, ntags):
        super(TaggerModel, self).__init__()
        self.embedding = nn.Embedding(nwords, DIM_EMBEDDING, padding_idx=0)
        self.rnn = nn.RNN(DIM_EMBEDDING, RNN_HIDDEN, batch_first=True)
        self.out = nn.Linear(RNN_HIDDEN, ntags)
        
    def forward(self, inputData):
        embedded = self.embedding(inputData)
        outputs, hidden = self.rnn(embedded)
        predictions = self.out(outputs)
        return predictions

        
model = TaggerModel(len(word_vocab), len(label_vocab))
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
loss_function = torch.nn.CrossEntropyLoss(ignore_index=0, reduction='sum')

for epoch in range(EPOCHS):
    model.train()
    # reset the gradient
    model.zero_grad()
    # loop over batches
    for i, (X, Y) in enumerate(zip(train_X_batches, train_Y_batches)):
        # Y.shape = [16, 159], 16 sentences, 159 "words" (most are pads)
        probs: torch.Tensor = model.forward(X.type(torch.int))
        # predicted_values.rhape = [16, 159, 18], 16 sentences, 159 "words", 18 possible labels per word (one prediction per label)
        preds, indices = probs.max(dim=2)
        # calculate loss (and print)
        Yflat = Y.view((BATCH_SIZE * max_len)) # flattes into 16 * 159
        probsflat = probs.view((BATCH_SIZE * max_len, -1)) # flattens into 16 * 159, 18
        loss = loss_function(probsflat, Yflat)
        loss.backward()
        
        # update
        optimizer.step()
        print(f"Epoch {epoch+1}/{EPOCHS}, batch {i+1}/{num_batches}. Current loss: {loss.item()}", end="\r")
    print()

# wth why did the loss increate with every epoch. Does it actually represent accuracy or something or did I accidentally lobotomise my model somehow

# set to evaluation mode
model.eval()

Epoch 1/5, batch 783/783. Current loss: 2912.4770507812575
Epoch 2/5, batch 783/783. Current loss: 3222.5026855468755
Epoch 3/5, batch 783/783. Current loss: 4062.1633300781255
Epoch 4/5, batch 783/783. Current loss: 4184.4956054687555
Epoch 5/5, batch 783/783. Current loss: 4378.6992187537555


TaggerModel(
  (embedding): Embedding(19671, 100, padding_idx=0)
  (rnn): RNN(100, 50, batch_first=True)
  (out): Linear(in_features=50, out_features=18, bias=True)
)

* b) Now evaluate the tagger on the dev data (`pos-data/en_ewt-dev.conll`) with accuracy (make sure to not count the padding tokens).

In [29]:
dev_X_mat, dev_Y_mat, _, _ = prepare(dev_data, word_vocab=word_vocab, label_vocab=label_vocab)
m = nn.Softmax(dim=2)
y_prob = m(model(dev_X_mat))
probs, preds = y_prob.max(dim=2)
print(preds.shape)
print(dev_Y_mat.shape)
y_pred = preds
y_true = dev_Y_mat

correct = sum(torch.sum(y_pred == y_true, dim=1))
print(correct / (2000 * 75))
# thats not great

torch.Size([2000, 75])
torch.Size([2000, 75])
tensor(0.0858)


# Lecture 8: Bi-LSTM for language classification

In assignments for this lecture we are going to implement an LSTM classifier in Pytorch including dropout layers, and train it for the task of topic classification.

You can use the following function for data reading:

In [None]:
def load_topics(path):
    text = []
    labels = []
    for lineIdx, line in enumerate(open(path)):
        tok = line.strip().split('\t')
        labels.append(tok[0])
        text.append(tok[1].split(' '))
    return text, labels

topic_train_text, topic_train_labels = load_topics('topic-data/train.txt')
topic_dev_text, topic_dev_labels = load_topics('topic-data/dev.txt')

## 3. Implement a Bi-LSTM in Pytorch

* a) Convert the data to a format that can be used in a Pytorch module. In this assignment, we can cap the size of an utterance, as each utterance only needs 1 label. Use a maximum length of 32 words, for longer sentences, only keep the first 32 words. A special `<PAD>` token can be used for padding for sentences shorter as 32 words. For the unknown words in the test set, you can use the `<PAD>` token as well.

**hint**: the shape of the training data should be 13,000 by 32

In [54]:
import torch


# Your implementation goes here:
"""
I'd like to take this moment to mention the following like, found in the week 3 exercises:
> # This is the definition of an FNN model in PyTorch, and can mostly be ignored for now.
> # We will focus on how to create Torch models in week 5
The name of this file is "week4.ipynb". We are now being told to create PyTorch models from scratch, a week before we're meant to be learning about them. I have no idea what I'm doing.
"""
# anyway, doesn't even mentino what data we're meant to be using here. Guess I'll have to assume it's `topic-data`?
train_y, train_x = read_conll_file("topic-data/train.txt")[0]
dev_y, dev_x = read_conll_file("topic-data/dev.txt")[0]
# split words
train_X = [sentence.split() for sentence in train_x]
dev_X = [sentence.split() for sentence in dev_x]
# Limit to 32 words
train_X = [words[:32] for words in train_X]
dev_X = [words[:32] for words in train_X]
# Pad
for i in range(len(train_X)):
    train_X[i] += ["<PAD>"] * (32 - len(train_X[i]))
for i in range(len(dev_X)):
    dev_X[i] += ["<PAD>"] * (32 - len(dev_X[i]))
# Build vocab
word_vocab = Vocab()
label_vocab = Vocab()
[word_vocab.getIdx(word, add=True) for words in train_X for word in words]
label_idxs = [label_vocab.getIdx(label, add=True) for label in train_y]

# Build tensors
train_X_mat = torch.zeros((len(train_X), 32), dtype=torch.int)
for i, words in enumerate(train_X):
    idxs = [word_vocab.getIdx(word) for word in words]
    train_X_mat[i] = torch.tensor(idxs, dtype=torch.int)
train_y_mat = torch.tensor(label_idxs)

print(train_X_mat)
print(train_X_mat.shape)
print(train_y_mat)
print(train_y_mat.shape)

tensor([[    1,     2,     3,  ...,     0,     0,     0],
        [   17,    18,    19,  ...,     0,     0,     0],
        [   31,    32,    33,  ...,     0,     0,     0],
        ...,
        [  418,   408,    57,  ...,     0,     0,     0],
        [25643,   308,   233,  ...,   138,  3846,   279],
        [25644,    40,  8349,  ...,  1638, 11782, 25648]], dtype=torch.int32)
torch.Size([13000, 32])
tensor([1, 1, 1,  ..., 3, 2, 2])
torch.Size([13000])


* b) Convert your input into batches of size 64, similar as you did in assignment 1b

In [58]:
batch_size = 64
num_batches = int(len(train_X_mat)/batch_size)

print(f"{num_batches} batches")
train_X_batches = train_X_mat[:batch_size*num_batches].view(num_batches, batch_size, train_X_mat.shape[1])
train_y_batches = train_y_mat[:batch_size*num_batches].view(num_batches, batch_size)
print(train_X_batches.shape)
print(train_y_batches.shape)

print(f"Lost {train_X_mat.shape[0] - train_X_batches.shape[0]*batch_size} datapoints")

203 batches
torch.Size([203, 64, 32])
torch.Size([203, 64])
Lost 8 datapoints


* c) Implement a classification model in Pytorch using a [`torch.nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) layer and a [`torch.nn.LSTM`](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) layer. Train this classification model on the language identification data, and evaluate its performance. Note that during each training step, you now get the predictions and loss on a whole batch directly. Use the following hyperparameters: 5 epochs over the full training data, word embeddings dimension: 100, lstm dimension of 50, learning rate of 0.01 in an [Adam optimizer](https://pytorch.org/docs/stable/optim.html) and a [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html).

Hints:
* see also the hints for assignment2a
* Set bidirectional=True for the LSTM layer (so that we are training a Bi-LSTM), note that the input dimensions of the next layer should then be lstm_dim*2. 
* We use words as inputs, and need only one label per sentence, so you should use the output of the last item from the forward layer, and the output of the first item for the backward layer.

In [60]:
# Hint: In torch, the BiLSTM returns a concatenation of the forward and 
# backward layer. Here is an example of how these can be extracted again
#     backward_out = bilstm_out[:,0,-size:].squeeze()
#     forward_out = bilstm_out[:,-1,:size].squeeze()

# I am taking *big* inspiration from https://github.com/bentrevett/pytorch-pos-tagging/blob/master/1_bilstm.ipynb
DIM_EMBEDDING = 100
LSTM_HIDDEN = 50
LEARNING_RATE = 0.01
EPOCHS = 5

class BiLSTMClassifier(torch.nn.Module):
    def __init__(self, nwords) -> None:
        super(BiLSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(nwords, DIM_EMBEDDING, padding_idx=0)
        self.lstm = nn.LSTM(DIM_EMBEDDING, LSTM_HIDDEN, bidirectional=True)
    
    def forward(self, inputData):
        embedded = self.embedding(inputData)
        outputs, (hidden, cell) = self.lstm(embedded)
        return outputs

model = BiLSTMClassifier(len(word_vocab))
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
loss_function = nn.CrossEntropyLoss(ignore_index=0, reduce='sum')

for epoch in range(EPOCHS):
    model.train()
    model.zero_grad()
    for i, (X, y) in enumerate(zip(train_X_batches, train_y_batches)):
        probs = model.forward(X.type(torch.int))
        # okay so the output here is the y value for every word from start to and + every word from end to start.
        # but 50 times each, because thats the output dimension. That means I have to turn a 32 * 100 matrix into a single value?
        # or at least into N values where N=number of labels. I have no idea how to do that. We were not told how to do that.
        # Therefore I give up. I'll ask about it during the next exercise class or smthn.
        print(probs.shape)
        print(X.shape)

model.eval()



torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([64, 32, 100])
torch.Size([64, 32])
torch.Size([6

BiLSTMClassifier(
  (embedding): Embedding(25649, 100, padding_idx=0)
  (lstm): LSTM(100, 50, bidirectional=True)
)


* d) Add a `torch.nn.Dropout` layer with a masking probability of 0.2 between the word embeddings and the LSTM layer and
  another dropout layer with a masking probability of 0.3 between the LSTM layer and the output layer. Evaluate the
  performance again, is the performance higher?, why would this be the case?

![](https://media1.tenor.com/m/pgQLrCGpKKYAAAAd/cat-blueberry.gif)