# Introduction

### Due March 17th, 23:59

In this homework you will be implementing a LSTM model for POS tagging.

You are given the following files:
- `POS_NEMM.ipynb`: Notebook file for NEMM model (Optional)
- `POS_LTML.ipynb`: Notebook file for MTML model
- `train.txt`: Training set to train your model
- `test.txt`: Test set to report your model’s performance
- `tags.csv`: Treebank tag universe
- `sample_prediction.csv`: Sample file your prediction result should look like
- `utils/`: folder containing all utility code for the series of homeworks


### Deliverables (zip them all)

- pdf or html version of your final notebook
- Use the best model you trained, generate the prediction for test.txt, name the
output file prediction.csv (Be careful: the best model in your training set might not
be the best model for the test set).
- writeup.pdf: summarize the method you used and report their performance.
If you worked on the optional task, add the discussion. Add a short essay
discussing the biggest challenges you encounter during this assignment and
what you have learnt.

(**You are encouraged to add the writeup doc into your notebook
using markdown/html langauge, just like how this notes is prepared**)


# Load data

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import sys
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from scipy import sparse

# add utils folder to path
p = os.path.dirname(os.getcwd())
if p not in sys.path:
    sys.path = [p] + sys.path
    
from utils.hw5 import load_data, save_prediction, ignore_class_accuracy, whole_sentence_accuracy
from utils.general import show_keras_model

`tags` is a dictionary that maps the [Treebank tag](https://www.clips.uantwerpen.be/pages/mbsp-tags) to its numerical encoding. There are 45 tags in total, plus a special tag `START (tags[-1])` to indicate the beginning of a sentence. 

In [2]:
tags = list(pd.read_csv('tags.csv', index_col=0).tag_encode.keys())

train, train_label = load_data("train.txt")
train, dev, train_label, dev_label = train_test_split(train, train_label)
test, _ = load_data("test.txt")

print("Training set: %d" % len(train))
print("Dev set: %d" % len(dev))
print("Testing set: %d" % len(test))

Training set: 33539
Dev set: 11180
Testing set: 9955


# LSTM

In [12]:
from collections import Counter
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, InputLayer, Bidirectional, TimeDistributed, Embedding, Activation, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical

class POS_LSTMM:
    """
    To help you focus on the LSTM model, I have made most part of the code ready, make sure you
    read all the parts to understand how the code works. You only need to modify the prepare method 
    to add the RNN model.
    """
    def __init__(self, tag_vocab=tags, max_sent_len=40, 
                 voc_min_freq=5, **kwargs):
        """
        input: 
            tag_vocab: tag dictionary, you will less likely need to change this
            voc_min_freq: use this to truncate low frequency vocabulary
            max_sent_len: truncate/pad all sentences to this length
            
            kwargs: Use as needed to pass extra parameters
        """
        self.vocab = []
        self.reverse_vocab = {}
        self.tag_vocab = tag_vocab
        self.reverse_tag_vocab = {k:v for v, k in enumerate(tag_vocab)}
        self._voc_min_freq = voc_min_freq
        self._max_sent_len = max_sent_len

        """
        Feel free to add code here as you need
        """

    def collect_vocab(self, X):
        """
        Create vocabulary from all input data
        input:
            X: list of sentences
        """
        vocab = Counter([t for s in X for t in s])
        vocab = {k: v for k, v in vocab.items() if v > self._voc_min_freq}
        vocab = ["<PAD>", "<UNK>"] + sorted(vocab, key=lambda x: vocab[x], reverse=True)
        reverse_vocab = {k: v for v, k in enumerate(vocab)}
        
        return vocab, reverse_vocab
                
    def transform_X(self, X):
        """
        Translate input raw data X into trainable numerical data
        input:
            X: list of sentences
        """
        X_out = []
        
        default = self.reverse_vocab['<UNK>']
        for sent in X:
            X_out.append([self.reverse_vocab.get(t, default) for t in sent])
            
        X_out = pad_sequences(sequences=X_out, maxlen=self._max_sent_len, 
                              padding='post', truncating='post',
                              value=self.reverse_vocab['<PAD>'])
        
        return X_out
    
    def transform_Y(self, Y):
        """
        Translate input raw data Y into trainable numerical data
        input:
            y: list of list of tags
        """
        Y_out = [] 
        
        for labs in Y:
            Y_out.append([self.reverse_tag_vocab[lab] for lab in labs])
            
        Y_out = pad_sequences(sequences=Y_out, maxlen=self._max_sent_len, 
                              padding='post', truncating='post',
                              value=self.reverse_tag_vocab['<PAD>'])
        
        return Y_out
    
    def prepare(self, X, Y):
        """
        input:
            X: list of sentences
            y: list of list of tags
        """
        self.vocab, self.reverse_vocab = self.collect_vocab(X)
        X, Y = self.transform_X(X), self.transform_Y(Y)
        
        embedding_dim = 100
        lstm_node = 128

        """
        Write your own model here
        Hints:
            - Rember to use embedding layer at the beginning
            - Use Bidrectional LSTM to take advantage of both direction history  
        """
        model = Sequential()
        model.add(InputLayer(input_shape=(self._max_sent_len,)))
        model.add(Embedding(len(self.vocab), embedding_dim))
        model.add(Bidirectional(LSTM(lstm_node, return_sequences=True)))
        model.add(Dropout(0.2))
        model.add(TimeDistributed(Dense(len(self.tag_vocab))))
        model.add(Dropout(0.2))
        model.add(Activation('softmax'))

        """
        You can read the source code to understand how ignore_class_accuracy works.
        The reason of using this customized metric is because we have padded the training 
        data with lots of '<PAD>' tag. It's easy and useless to predict this tag, we need 
        to ignore this tag when calculate the accuracy.
        """
        model.compile(loss='categorical_crossentropy',
                      optimizer=Adam(0.001),
                      metrics=['accuracy', 
                               ignore_class_accuracy(self.reverse_tag_vocab['<PAD>']),
                               whole_sentence_accuracy(self.reverse_tag_vocab['<PAD>'])])

        self.model = model
        
        return self
        
        
    def fit(self, X, Y, batch_size=128, epochs=10):
        X, Y = self.transform_X(X), self.transform_Y(Y)
        self.model.fit(X, to_categorical(Y, num_classes=len(self.tag_vocab)),
                       batch_size=batch_size, 
                       epochs=epochs, validation_split=0.2)

        return self

    def predict(self, X):
        results = []
        X_new = self.transform_X(X)
        Y_pred = self.model.predict_classes(X_new)
    
        for i, y in enumerate(Y_pred):
            results.append(
                [self.tag_vocab[y[j]] for j in range(min(len(X[i]), len(X_new[i])))]
            )
            
        return results

This is a POS_LSTM model with two Dropout layer to prevent overfitting during model training. Two Dropout layers have the same hyperparameter as 20% dropout rate, which means at every training process there are 20% of nodes to be ignored and set to 0. 

In [7]:
lstm = POS_LSTMM().prepare(train, train_label)
lstm.model.summary()
# show_keras_model(lstm.model)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 40, 100)           803900    
_________________________________________________________________
bidirectional_2 (Bidirection (None, 40, 256)           234496    
_________________________________________________________________
dropout_2 (Dropout)          (None, 40, 256)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 40, 47)            12079     
_________________________________________________________________
dropout_3 (Dropout)          (None, 40, 47)            0         
_________________________________________________________________
activation_1 (Activation)    (None, 40, 47)            0         
Total params: 1,050,475
Trainable params: 1,050,475
Non-trainable params: 0
____________________________________________

In [8]:
lstm = POS_LSTMM().prepare(train, train_label)
lstm.fit(train, train_label)

Train on 26831 samples, validate on 6708 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<__main__.POS_LSTMM at 0x295b7980d68>

## Save your model prediction

In [9]:
prediction = lstm.predict(test)
save_prediction(prediction)