#  Assignment 4 (part 2)(CPSC 436N): LSTM-Based Part-Of-Speech (POS) Tagger

In Part2 of this assignment, we are also using the text file called __cmpt-hw2-3.txt__ to train a bidirectional stacked LSTM-based neural sequence labeling model to predict the part of speech tags of unknown input sentences.

## 1. import libs and packages needed for this assignment

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [5]:
import torch
import torch.nn as nn
from torch.nn.utils import clip_grad_norm_
import torch.nn.functional as F
from torch.autograd import Variable
import numpy as np
import math
from sklearn.metrics import f1_score

from tqdm import tqdm

## 2. Load the dataset (cmpt-hw2-3.txt)

We will use the same dataset as in (Part1), namely **cmpt-hw2-3.txt**. Please first upload this text file from your local computer to Colab by following these instructions:

https://colab.research.google.com/notebooks/io.ipynb

**Please note that the uploaded file will be deleted once the connection is terminated. So you need to upload the file everytime you connect to google colab.**

<br>

For convenience, we will load and save each line of the dataset as **a token list** and **a pos list**. For example, one line in our dataset:

*There_EX are_VBP also_RB plant_NN and_CC gift_NN shops_NNS ._.*

will be saved as:

__token list:__ ['there', 'are', 'also', 'plant', 'and', 'gift', 'shops', '.']

__pos list:__ ['EX', 'VBP', 'RB', 'NN', 'CC', 'NN', 'NNS', '.']


**Please note that we convert each token into its lower case.**

In [6]:
data_path = './cmpt-hw2-3.txt'

dataset = [] #initialize dataset list.
for line in open(data_path): #load data from the text file.
    text_list = []; pos_list = [];
    l = line.strip().split(' ') #split each line in the file by space.
    for w in l:
      w = w.split('_') #each token and its pos tag are connected by an underscore, here we split it by undescore.
      text_list.append(w[0].lower()) #add token to the text list for the current line.
      pos_list.append(w[1]) #add pos tag to the pos list for the current line.
    dataset.append((text_list, pos_list)) #add the processed line (text and pos list) into the dataset list.

## 3. Split the loaded dataset into training/dev/testing subsets

Here we follow the common scheme to split this dataset into training/dev/test set with ratio 80%-10%-10%. After shuffling the data to avoid any possible ordering bias, we take the first 80% samples as training set, then the next 10% samples as dev set, and the last 10% samples as testing set.

In [7]:
import random

random.shuffle(dataset)

training_data = dataset[0:math.floor(len(dataset)*0.8)]
dev_data = dataset[math.floor(len(dataset)*0.8):math.floor(len(dataset)*0.9)]
test_data = dataset[math.floor(len(dataset)*0.9):]

## 4. Construct word-to-index and tag-to-index mapping dictionary

We model POS tagging  as a sequence labeling task. The pipeline can be described as:

<br>

*a sequence of tokens -(mapping)-> a sequence of token indices --(input)--> POS Tagger --(output)--> a sequence of POS tag indices --(mapping)--> a sequence of POS tags*

<br>

So in this step, we want to construct two mapping dictionaries to assign a unique index to each token and tag respectively (see code below). These two mapping dictionaries will be used in the first and last mapping steps in the above pipeline.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

As in Part1, we also need to deal with **unknown tokens** in testing. The code below is implementing a possible way to adderess this problem. ***Q5: Please add comments to the lines with "COMMENT NEEDED". And describe in your words the implemented solution.***

In [8]:
class Word_Dictionary(object): #this class is used to generate and return the word-to-index (index-to-word) vocabulary dictionary.
    def __init__(self): #initializing.
        self.word2idx = {'_unk_':0} # This code is a place holder for later handling unknown tokens
        self.idx2word = {0:'_unk_'} # This code associates the '_unk_' key with index zero 
                                    # to allow us to use it as a threshold
        self.idx = 1
    
    def add_word(self, word): #add a new word into the dictionary and assign an unique index to it.
        if not word in self.word2idx:
            self.word2idx[word] = self.idx
            self.idx2word[self.idx] = word
            self.idx += 1
        
    def __len__(self): #return the number of words in this dictionary.
        return len(self.word2idx)


class Tag_Dictionary(object): #this class is used to generate and return the tag-to-index (index-to-tag) dictionary.
    def __init__(self): #initializing.
        self.tag2idx = {}
        self.idx2tag = {}
        self.idx = 0
    
    def add_tag(self, tag): #add a new tag into the dictionary and assign an unique index to it.
        if not tag in self.tag2idx:
            self.tag2idx[tag] = self.idx
            self.idx2tag[self.idx] = tag
            self.idx += 1
        
    def __len__(self): #return the number of tags in this dictionary.
        return len(self.tag2idx)


wd_dict = Word_Dictionary() #initialize the word-to-index dictionary.
tag_dict = Tag_Dictionary() #initialize the tag-to-index dictionary.
unknown_threshold = 1 # Since we used zero as the index for the '_unk_' token placeholder, 
                      # we need to start above zero when mapping known words to our word dictionary

word_count = {} #initialize the dictionary to save frequencies of words.

for sample in training_data: #fill the word frequency dictionary with training data.
    text, tags = sample
    for word in text:
      if word not in word_count.keys():
        word_count[word] = 1
      else:
        word_count[word] += 1

for sample in training_data: #fill the word-to-index and tag-to-index dictionary with training data.
    text, tags = sample
    for word in text:
      if word_count[word] > unknown_threshold:  # If the count of the current token is above one, add word to dict
                                                # Essentially, we use this threshold to check word frequency 
                                                # and ensure all words appear at least twice, 
                                                # filtering out words that only appear once
                                                # If they only appear once, the token will later be replaced with 'unknown'
                                                # during the data processing code below
        wd_dict.add_word(word)
    for tag in tags:
      tag_dict.add_tag(tag)

## 5. Implement the function for data processing

In this step, we implement a function processing our data into the format which is ready for neural POS tagger's training and testing.

More specifically, we convert the input token and pos tag list into tensors (i.e., vectors) containing their corresponding indexes.

In [9]:
def data_processing_for_lstm(sample, word_dict, tag_dict, unknown_threshold): #this function convert a sample into the format ready for our neural POS tagger.
    text, tags = sample
    word_ids = []; tag_ids = []

    for word in text:
      if word in word_dict.word2idx.keys() and word_count[word] > unknown_threshold: #map the token to its index if its frequency is over a threshold.
        word_ids.append(word_dict.word2idx[word])
      else: #map the token to the index of unknown token if its frequency is not over a threshold.
        word_ids.append(word_dict.word2idx['_unk_'])

    for tag in tags: #map pos tags into indices using the tag-to-index dictionary.
      tag_ids.append(tag_dict.tag2idx[tag])

    word_ids = torch.from_numpy(np.array(word_ids)) #convert the list of token ids into tensor format.
    tag_ids = torch.from_numpy(np.array(tag_ids)) #convert the list of tag ids into tensor format.

    return word_ids, tag_ids

## 6. The class of LSTM-based POS tagger

Now let's design the architecture of our bidirectional stacked LSTM-based POS tagger. It should consist of three layers:

* [Embedding layer](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html): projecting input token ids into its embedding space.
* [(Bi-)LSTM hidden state layer](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html)
* [Output layer](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html): converting the hidden states to POS predictions.

**Q6:** Please after carefully reading the pytorch links above, ***add comments to the lines with "COMMENT NEEDED".***

In [16]:
def zero_state(module, batch_size, device): #the function to initialize the states of the bidirectionsl LSTM.
    # * 2 is for the two directions
    return Variable(torch.zeros(module.num_layers * 2, batch_size, module.hidden)).to(device), \
           Variable(torch.zeros(module.num_layers * 2, batch_size, module.hidden)).to(device)

class LSTM_tagger(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden, num_layers, tag_size, device):
        super(LSTM_tagger, self).__init__()
        self.device = device #the device we run our model on.
        self.num_layers = num_layers # Number of recurrent layers. 
                                     # since we want a stacking LSTM, we set it to the desired number of layers
                                     # this means every LSTM following the initial one will take in 
                                     # outputs of the first LSTM to compute the final results.

        self.hidden = hidden #the dimension of hidden state.
        self.input_size = embed_size # the size of each embedding vector
        self.embed = nn.Embedding(vocab_size, embed_size) #the word embedding layer (convert words to embeddings).
        self.lstm = nn.LSTM(input_size=self.input_size,
                            hidden_size=self.hidden,
                            num_layers=self.num_layers,
                            dropout=0.5, # This value sets the dropout probability, 
                                       # it will be equal to 1 with probability p and 0 otherwise (Bernoulli(p))
                                       # so here we set it to zero as default, which means not introducing any dropout layers
                            

                            bidirectional=True) # Sets LSTM to be bidirectional so that we can do back propagation

        self.h2s = nn.Linear(hidden * 2, tag_size) #the linear output layer (convert from hidden states of LSTM to the list of tag scores).

    def forward(self, x):
        x = self.embed(x) #convert word list to embedding list for the input sample.
        x = x.view(len(x),1,-1)
        s = zero_state(self, 1, self.device) #initialize the states of LSTM.
        lstm_output, _ = self.lstm(x, s) #LSTM.
        outputs = self.h2s(lstm_output.view(len(x),-1)) # This is the linearly transformed output of the hidden states of the LSTM
                                                        # where all but the last dimension are the same shape as the input 
                                                        # producing a list of tag scores (?)

        tags_probs = F.log_softmax(outputs,dim=1) # Since the outputs will be a list of scores that come from summing the weighted features and bias,
                                                  # the output could be any real number and not a probability distribution
                                                  # so we use log softmax to normalize the values and give us a distribution

        return tags_probs

## 7. Initialize model and define parameters



In [17]:
embed_size = 128
intermediate_size = 256
num_layers = 2
vocab_size = wd_dict.__len__()
tag_size = tag_dict.__len__()

num_epochs = 10
learning_rate = 0.002
device = 0 if torch.cuda.is_available() else 'cpu' # Specify the device (CPU/GPU) to train/eval the model on
criterion = nn.CrossEntropyLoss() # We need loss to compute the updates to the weights

model = LSTM_tagger(vocab_size, embed_size, intermediate_size, num_layers, tag_size, device)
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) # The optimizer holds the current state 
                                                                   # and will update the parameters based on the computed gradients. 
                                                                   # We need this to update the weights to improve accuracy of the model.

## 8. Train the model

In [21]:
for epoch in range(num_epochs):
    avg_ppl = 0; avg_loss = 0;
    for i in tqdm(range(0, len(training_data))): #input one sample a time to train the model.
        inputs, targets = data_processing_for_lstm(training_data[i], wd_dict, tag_dict, unknown_threshold)  #prepare the sample for model training.
        
        #forward pass.
        outputs = model(inputs.to(device)) # Returns the probability distribution for the tags
        loss = criterion(outputs, targets.reshape(-1).to(device)) # give the crossEntropyLoss the weights and the size index 
                                                                  # to compute the loss between input and target at each step 

        avg_loss += loss.item() # Accumulate the total loss to average it later  
        
        #backward and optimize.
        optimizer.zero_grad()
        loss.backward()
        clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()
    
    #validation step.
    preds = []; targets = [];
    for i in range(0, len(dev_data)):
        dev_inputs, dev_targets = data_processing_for_lstm(dev_data[i], wd_dict, tag_dict, unknown_threshold)
        dev_outputs = model(dev_inputs.to(device))
        dev_pred = torch.argmax(dev_outputs, dim=-1)
        preds += dev_pred.detach().cpu().numpy().tolist()
        targets += dev_targets.tolist()
    
    print ('Epoch [{}/{}], Loss: {:.4f}, F1 score: {:5.2f}'
        .format(epoch+1, num_epochs, avg_loss/len(training_data), f1_score(targets, preds, average='macro')))

100%|██████████| 1776/1776 [00:33<00:00, 52.80it/s]


Epoch [1/10], Loss: 0.0566, F1 score:  0.93


100%|██████████| 1776/1776 [00:32<00:00, 54.82it/s]


Epoch [2/10], Loss: 0.0562, F1 score:  0.92


100%|██████████| 1776/1776 [00:32<00:00, 54.93it/s]


Epoch [3/10], Loss: 0.0578, F1 score:  0.89


100%|██████████| 1776/1776 [00:32<00:00, 55.31it/s]


Epoch [4/10], Loss: 0.0577, F1 score:  0.91


100%|██████████| 1776/1776 [00:31<00:00, 55.75it/s]


Epoch [5/10], Loss: 0.0601, F1 score:  0.92


100%|██████████| 1776/1776 [00:31<00:00, 55.66it/s]


Epoch [6/10], Loss: 0.0616, F1 score:  0.92


100%|██████████| 1776/1776 [00:31<00:00, 55.58it/s]


Epoch [7/10], Loss: 0.0564, F1 score:  0.92


100%|██████████| 1776/1776 [00:31<00:00, 56.51it/s]


Epoch [8/10], Loss: 0.0625, F1 score:  0.93


100%|██████████| 1776/1776 [00:31<00:00, 56.31it/s]


Epoch [9/10], Loss: 0.0598, F1 score:  0.92


100%|██████████| 1776/1776 [00:31<00:00, 56.56it/s]


Epoch [10/10], Loss: 0.0634, F1 score:  0.93


## 9. Test the model

In [22]:
test_preds = []; test_targets = [];
for i in tqdm(range(0, len(test_data))):
    test_input, test_target = data_processing_for_lstm(test_data[i], wd_dict, tag_dict, unknown_threshold)
    test_output = model(test_input.to(device))
    test_pred = torch.argmax(test_output, dim=-1)
    test_preds += test_pred.detach().cpu().numpy().tolist()
    test_targets += test_target.tolist()
print(f1_score(test_targets, test_preds, average='macro'))

100%|██████████| 222/222 [00:01<00:00, 127.27it/s]

0.881153838569011





**Q7:** Now train and test a different model with a dropout rate of 0.05. Report the performance and a plausible explanation for the different performance of the two tested models (if any).



Where d = 0

Performance: 0.9050487484286546

Where d = 0.5

Performance: 0.8930199021777645 

Performance after retraining without changing hyperparams: 0.881153838569011

The goal of adding dropout is to improve performance by preventing overfitting. Hence, I was expecting the performance with d = 0.5 to be better because the regularization parameter, p(1-p), is maximum at p = 0.5. 

However, the results here to not align with the principle of dropout. Perhaps this is a product of what is being randomly dropped, or perhaps the dataset is so small that dropping some units from the network during training is not helpful.

That said, the difference in performance is not extreme, at just over 0.01~2%, so the dropout value may not matter much anyway.
