# HW 5 - Neural POS Tagger

In this exercise, you are going to build a set of deep learning models on part-of-speech (POS) tagging using Pytorch.

To complete this exercise, you will need to build deep learning models for POS tagging in Thai using NECTEC's ORCHID corpus. You will build one model for each of the following type:

- Neural POS Tagging (without pretrained weights)
- Neural POS Tagging with CRF (with and without pretrained weights)

Pretrained word embeddding are already given for you to use (albeit, a very bad one).

We also provide the code for data cleaning, preprocessing and some starter code for pytorch in this notebook but feel free to modify those parts to suit your needs. Feel free to use additional libraries (e.g. scikit-learn) as long as you have a model for each type mentioned above.

### Don't forget to change hardware accelrator to GPU in runtime on Google Colab ###

## 1. Setup and Preprocessing

We use POS data from [ORCHID corpus](https://www.researchgate.net/profile/Virach-Sornlertlamvanich/publication/2630580_Building_a_Thai_part-of-speech_tagged_corpus_ORCHID/links/02e7e514db19a98619000000/Building-a-Thai-part-of-speech-tagged-corpus-ORCHID.pdf), which is a POS corpus for Thai language.
A method used to read the corpus into a list of sentences with (word, POS) pairs have been implemented already. The example usage has shown below.
We also create a word vector for unknown word by random.

In [1]:
!wget https://www.dropbox.com/s/tuvrbsby4a5axe0/resources.zip
!unzip resources.zip

--2023-02-19 12:49:48--  https://www.dropbox.com/s/tuvrbsby4a5axe0/resources.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.18, 2620:100:6016:18::a27d:112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/tuvrbsby4a5axe0/resources.zip [following]
--2023-02-19 12:49:48--  https://www.dropbox.com/s/raw/tuvrbsby4a5axe0/resources.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc96cb8c41a09f63ff397e86c330.dl.dropboxusercontent.com/cd/0/inline/B2zQgwO2c1zVRvbhz03yvf5Sf2fJueaenaO49Ci_i8dU3awt48U18Ux7BAetY3BDwTTQEl7E584tP4ukFiL-oJ26wVXb-vMl2YF3lIh6T5jmbRYFsLR9V7z03HwLzsduOXh4C08qdpXt68Rqf7sTwFfg-Zmj_eDfTqXMT6DvX8fZrg/file# [following]
--2023-02-19 12:49:49--  https://uc96cb8c41a09f63ff397e86c330.dl.dropboxusercontent.com/cd/0/inline/B2zQgwO2c1zVRvbhz03yvf5Sf2fJueaenaO49Ci_i8dU3awt48U18Ux7BAetY3BDwTTQEl7E584tP

In [2]:
!pip install pytorch-crf

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
from data.orchid_corpus import get_sentences
import numpy as np
import numpy.random
np.random.seed(42)

In [4]:
yunk_emb =np.random.randn(32)
train_data = get_sentences('train')
test_data = get_sentences('test')
tags = set()
for x in train_data:
  for y in x:
    tags.add(y[1])
print(len(tags))
tags

47


{'ADVI',
 'ADVN',
 'ADVP',
 'ADVS',
 'CFQC',
 'CLTV',
 'CMTR',
 'CMTR@PUNC',
 'CNIT',
 'CVBL',
 'DCNM',
 'DDAC',
 'DDAN',
 'DDAQ',
 'DDBQ',
 'DIAC',
 'DIAQ',
 'DIBQ',
 'DONM',
 'EAFF',
 'EITT',
 'FIXN',
 'FIXV',
 'JCMP',
 'JCRG',
 'JSBR',
 'NCMN',
 'NCNM',
 'NEG',
 'NLBL',
 'NONM',
 'NPRP',
 'NTTL',
 'PDMN',
 'PNTR',
 'PPRS',
 'PREL',
 'PUNC',
 'RPRE',
 'VACT',
 'VATT',
 'VSTA',
 'XVAE',
 'XVAM',
 'XVBB',
 'XVBM',
 'XVMM'}

Next, we load pretrained weight embedding using pickle. The pretrained weight is a dictionary which map a word to its embedding.

In [5]:
import pickle
fp = open('basic_ff_embedding.pt', 'rb')
embeddings = pickle.load(fp)
fp.close()
embeddings

{'พุทธเจ้าพระองค์': array([-0.01917224, -0.00415204, -0.02412283, -0.04142096,  0.04691369,
         0.03376952, -0.00270034, -0.04676848,  0.03299177,  0.03790374,
         0.0432213 , -0.01537431, -0.02517369, -0.04052844, -0.01157572,
         0.00185845, -0.00034374,  0.03099574, -0.00553056,  0.03075998,
        -0.02743803, -0.03812069, -0.02771009, -0.00890391, -0.03464903,
        -0.03346384, -0.04095409,  0.03574741,  0.04473687,  0.0170097 ,
        -0.00490531,  0.01063981], dtype=float32),
 'จุ๊บุ': array([ 0.02896592,  0.02110482,  0.03715003,  0.02296479, -0.03441135,
         0.03496312,  0.03625641, -0.02355627, -0.03617386,  0.01206947,
         0.02429886, -0.02565069, -0.02642049, -0.03778682, -0.00951525,
        -0.0446926 , -0.02631601,  0.04875654,  0.04526813,  0.0079442 ,
         0.0340622 ,  0.00625456,  0.01675535,  0.01817935, -0.03839616,
        -0.04811118,  0.03423071,  0.015117  ,  0.00746933,  0.02313724,
         0.01740095,  0.02209598], dtype=floa

The given code below generates an indexed dataset(each word is represented by a number) for training and testing data. The index 0 is reserved for padding to help with variable length sequence. (Additionally, You can read more about padding here https://suzyahyah.github.io/pytorch/2019/07/01/DataLoader-Pad-Pack-Sequence.html)

## 2. Prepare Data

In [6]:
word_to_idx ={}
idx_to_word ={}
label_to_idx = {}
for sentence in train_data:
    for word,pos in sentence:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)+1
            idx_to_word[word_to_idx[word]] = word
        if pos not in label_to_idx:
            label_to_idx[pos] = len(label_to_idx)+1
word_to_idx['UNK'] = len(word_to_idx)

n_classes = len(label_to_idx.keys())+1

This section is tweaked a little from the demo, word2features will return word index instead of features, and sent2labels will return a sequence of word indices in the sentence.

In [7]:
def word2features(sent, i, emb):
    word = sent[i][0]
    if word in word_to_idx :
        return word_to_idx[word]
    else :
        return word_to_idx['UNK']

def sent2features(sent, emb_dict):
    return np.asarray([word2features(sent, i, emb_dict) for i in range(len(sent))])

def sent2labels(sent):
    return numpy.asarray([label_to_idx[label] for (word, label) in sent],dtype='int32')

def sent2tokens(sent):
    return [word for (word, label) in sent]

In [8]:
sent2features(train_data[100], embeddings)

array([ 29, 327,   5, 328])

Next we create train and test dataset, then we use pytorch to post-pad the sequence to max sequence with 0. Our labels are changed to a one-hot vector.

In [9]:
%%time
x_train = np.asarray([sent2features(sent, embeddings) for sent in train_data])
y_train = [sent2labels(sent) for sent in train_data]

x_test =  np.asarray([sent2features(sent, embeddings) for sent in test_data]) 
y_test = [sent2labels(sent) for sent in test_data]



CPU times: user 395 ms, sys: 5.32 ms, total: 400 ms
Wall time: 472 ms


In [10]:
import torch 
from torch.nn.utils.rnn import pad_sequence

x_train = [torch.LongTensor(sentence) for sentence in x_train]
y_train = [torch.LongTensor(sentence) for sentence in y_train] 
x_test = [torch.LongTensor(sentence) for sentence in x_test]

x_train = pad_sequence(x_train, batch_first=True)
y_train = pad_sequence(y_train, batch_first=True)
x_test = pad_sequence(x_test, batch_first=True)

maxlen = x_train.size(1)  

# Pad the sequence length of x_test to be maxlen 
remaining_len = x_train.size(1) - x_test.size(1)
remaining_mat = torch.zeros((x_test.size(0), remaining_len), dtype=torch.long) 
x_test = torch.cat((x_test, remaining_mat), dim=1) 


In [11]:
import torch 
from torch.utils.data import Dataset, DataLoader 

class POSTaggerDataset(Dataset): 
  def __init__(self, data, labels=None):  
    self.data = data 
    self.labels = labels

    if labels is not None: 
      assert len(data) == len(labels)  

  def __getitem__(self, idx):
    if self.labels is None: 
      return torch.LongTensor(self.data[idx])
    else: 
      return (
          torch.LongTensor(self.data[idx]), 
          torch.LongTensor(self.labels[idx])
      )

  def __len__(self):
    return len(self.data)

train_dataset = POSTaggerDataset(x_train, y_train) 
test_dataset = POSTaggerDataset(x_test)

num_workers = 2
batch_size = 64

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True) 
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=False)   

In [12]:
out = next(iter(train_dataloader)) 
out[0].size(), out[1].size()

(torch.Size([64, 102]), torch.Size([64, 102]))

## 3. Evaluate

Our output from pytorch is a distribution of problabilities on all possible label. outputToLabel will return an indices of maximum problability from output sequence.

evaluation_report is the same as in the demo. 

**Hint** 
1. ```categorical_accuracy``` is for evaluating training set.
2. ```evaluation_report(y_true, y_pred, get_only_acc=True)``` is for evaluating test set since ```y_train``` and ```y_test``` are in different formats.

In [13]:
def outputToLabel(yt,seq_len):
    out = []
    for i in range(0,len(yt)):
        if(i==seq_len):
            break
        out.append(np.argmax(yt[i]))
    return out

In [14]:
import pandas as pd
from IPython.display import display

# for validation part 
def categorical_accuracy(preds, y, tag_pad_idx=0):
  if len(preds.shape) == 2: 
    preds = preds.argmax(dim = 1, keepdim = True) # get the index of the max probability
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements]) 
  else: 
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = preds[non_pad_elements].eq(y[non_pad_elements]) 
  return correct.sum() / y[non_pad_elements].shape[0]

def evaluation_report(y_true, y_pred, get_only_acc=False):
    # retrieve all tags in y_true
    tag_set = set()
    for sent in y_true:
        for tag in sent:
            tag_set.add(tag)
    for sent in y_pred:
        for tag in sent:
            tag_set.add(tag)
    tag_list = sorted(list(tag_set))
    
    # count correct points
    tag_info = dict()
    for tag in tag_list:
        tag_info[tag] = {'correct_tagged': 0, 'y_true': 0, 'y_pred': 0}

    all_correct = 0
    all_count = sum([len(sent) for sent in y_true])
    for sent_true, sent_pred in zip(y_true, y_pred):
        for tag_true, tag_pred in zip(sent_true, sent_pred):
            if tag_true == tag_pred:
                tag_info[tag_true]['correct_tagged'] += 1
                all_correct += 1
            tag_info[tag_true]['y_true'] += 1
            tag_info[tag_pred]['y_pred'] += 1
    accuracy = (all_correct / all_count)

    # get only accuracy for testing 
    if get_only_acc: 
      return accuracy 

    accuracy *= 100 
 
            
    # summarize and make evaluation result
    eval_list = list()
    for tag in tag_list:
        eval_result = dict()
        eval_result['tag'] = tag
        eval_result['correct_count'] = tag_info[tag]['correct_tagged']
        precision = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_pred'])*100 if tag_info[tag]['y_pred'] else '-'
        recall = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_true'])*100 if (tag_info[tag]['y_true'] > 0) else 0
        eval_result['precision'] = precision
        eval_result['recall'] = recall
        eval_result['f_score'] = (2*precision*recall)/(precision+recall) if (type(precision) is float and recall > 0) else '-'
        
        eval_list.append(eval_result)

    eval_list.append({'tag': 'accuracy=%.2f' % accuracy, 'correct_count': '', 'precision': '', 'recall': '', 'f_score': ''})
    
    df = pd.DataFrame.from_dict(eval_list)
    df = df[['tag', 'precision', 'recall', 'f_score', 'correct_count']]
  
    display(df)


## 4. Train a model

The model is this section is separated to two groups

- Neural POS Tagger (4.1)
- Neural CRF POS Tagger (4.2)

## 4.1.1 Neural POS Tagger  (Example)

We create a simple Neural POS Tagger as an example for you. This model dosen't use any pretrained word embbeding so it need to use Embedding layer to train the word embedding from scratch.

In [15]:
!pip install torchinfo 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [16]:
import torchinfo 
from torch import nn 
from torch.nn import Embedding, Dropout, GRU, LSTM, Linear, CrossEntropyLoss 
from torch.optim import Adam


class BiGRU(nn.Module): 
  def __init__(self, word_to_idx):
    super(BiGRU, self).__init__() 
    self.embed = Embedding(len(word_to_idx), 32, padding_idx=0)
    self.bi_gru = GRU(32, 32, bidirectional=True, batch_first=True) 
    self.dropout = Dropout(0.2) 
    self.classifier = Linear(64, 48)
    
  def forward(self, x):
    x = self.embed(x) 
    x, _ = self.bi_gru(x) 
    x = self.dropout(x) 
    out = self.classifier(x)
    return out

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = BiGRU(word_to_idx) 
model.to(device) 

optimizer = Adam(model.parameters(), lr=1e-3) 
criterion = CrossEntropyLoss(ignore_index=0)
print(torchinfo.summary(model))

num_epochs = 10 

for epoch in range(1, num_epochs+1): 
  train_losses = [] 
  train_accs = [] 
  
  model.train() 
  for inputs, targets in train_dataloader: 
    optimizer.zero_grad() 

    inputs, targets = inputs.to(device), targets.to(device)
    pred = model(inputs)

    pred = pred.reshape(-1, pred.size(-1))
    targets = targets.reshape(-1)
    
    loss = criterion(pred, targets) 
    
    train_losses.append(loss.item())
    train_accs.append(categorical_accuracy(pred, targets)) 

    loss.backward() 
    optimizer.step() 

  model.eval() 
  y_pred = [] 
  for inputs in test_dataloader: 
    inputs = inputs.to(device)
    with torch.no_grad(): 
      pred = model(inputs)
      y_pred.append(pred.cpu().detach())
    
  y_pred = torch.cat(y_pred).numpy() 
  y_pred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))] 
  test_acc = evaluation_report(y_test, y_pred, get_only_acc=True)  

  print(f'epoch = {epoch:02d},\
    training loss = {sum(train_losses) / len(train_losses):.3f}, \
    training acc = {sum(train_accs) / len(train_accs):.3f}, \
    testing acc = {test_acc:.3f}')  

Layer (type:depth-idx)                   Param #
BiGRU                                    --
├─Embedding: 1-1                         480,608
├─GRU: 1-2                               12,672
├─Dropout: 1-3                           --
├─Linear: 1-4                            3,120
Total params: 496,400
Trainable params: 496,400
Non-trainable params: 0
epoch = 01,    training loss = 1.727,     training acc = 0.588,     testing acc = 0.758
epoch = 02,    training loss = 0.778,     training acc = 0.803,     testing acc = 0.839
epoch = 03,    training loss = 0.562,     training acc = 0.856,     testing acc = 0.868
epoch = 04,    training loss = 0.456,     training acc = 0.882,     testing acc = 0.883
epoch = 05,    training loss = 0.393,     training acc = 0.897,     testing acc = 0.894
epoch = 06,    training loss = 0.347,     training acc = 0.908,     testing acc = 0.904
epoch = 07,    training loss = 0.316,     training acc = 0.916,     testing acc = 0.909
epoch = 08,    training loss = 

In [17]:
y_pred = [] 

for inputs in test_dataloader: 
  inputs = inputs.to(device)
  with torch.no_grad(): 
    pred = model(inputs)
    y_pred.append(pred.cpu().detach())
  
y_pred = torch.cat(y_pred).numpy() 
y_pred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))] 
evaluation_report(y_test, y_pred) 


Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.891127,99.592944,99.741813,3670.0
1,2,91.322751,94.168283,92.723691,7767.0
2,3,89.737293,93.036888,91.357307,15713.0
3,4,99.930016,99.450549,99.689706,12851.0
4,5,86.956522,89.552239,88.235294,60.0
5,6,99.116998,86.015326,92.102564,449.0
6,7,97.301205,97.113997,97.207511,2019.0
7,8,66.559486,49.879518,57.024793,207.0
8,9,49.318182,58.967391,53.712871,217.0
9,10,58.888889,44.219309,50.510551,371.0


## 4.2 CRF Viterbi

Your next task is to incorporate Conditional random fields (CRF) to your model. <b>You do not need to use pretrained weight</b>.

To use the CRF layer, you need to use an extension repository for pytorch library, call torch2crf. If you want to see the detailed implementation, you should read the official pytorch tutorial of CRF (https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html). 

torch2crf link :  https://github.com/kmkurn/pytorch-crf

For inference, you should look at crf.py at the method call and view the input/output argmunets. 

link for documentation: https://pytorch-crf.readthedocs.io/en/stable/

link for source code : https://github.com/kmkurn/pytorch-crf/blob/master/torchcrf/__init__.py




### 4.2.1 CRF without pretrained weight
### #TODO 1
Incoperate CRF layer to your model in 4.1. CRF is quite complex compare to previous example model, so you should train it with more epoch, so it can converge.

To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

Do not forget to save this model weight. (Refer to https://pytorch.org/tutorials/beginner/saving_loading_models.html)

In [37]:
# INSERT YOUR CODE HERE
from torchcrf import CRF

print(len(label_to_idx)+1)

class BiGRU_with_CRF(nn.Module): 
  def __init__(self, word_to_idx):
    super(BiGRU_with_CRF, self).__init__() 
    self.embed = Embedding(len(word_to_idx), 32, padding_idx=0)
    self.bi_gru = GRU(32, 32, bidirectional=True, batch_first=True) 
    self.classifier = nn.Linear(64,48)
    self.crf = CRF(48) 
    
  def forward(self, x):
    x = self.embed(x) 
    x, _ = self.bi_gru(x) 
    out = self.classifier(x)
    return out

device = 'cuda' if torch.cuda.is_available() else 'cpu'

crf_model = BiGRU_with_CRF(word_to_idx) 
crf_model.to(device) 

optimizer = Adam(crf_model.parameters(), lr=1e-3) 
# criterion = CrossEntropyLoss(ignore_index=0)
print(torchinfo.summary(crf_model))

num_epochs = 20

for epoch in range(1, num_epochs+1): 
  train_losses = [] 
  train_accs = [] 
  
  crf_model.train() 
  for inputs, targets in train_dataloader: 
    optimizer.zero_grad() 

    inputs, targets = inputs.to(device), targets.to(device)
    pred = crf_model(inputs)

    loss = -crf_model.crf(pred, targets)

    pred = pred.reshape(-1, pred.size(-1))
    targets = targets.reshape(-1)


    train_losses.append(loss.item())
    train_accs.append(categorical_accuracy(pred, targets)) 

    loss.backward() 
    optimizer.step() 

  crf_model.eval() 
  y_pred = [] 
  for inputs in test_dataloader: 
    inputs = inputs.to(device)
    with torch.no_grad(): 
      pred = crf_model(inputs)
      y_pred.append(pred.cpu().detach())
    
  y_pred = torch.cat(y_pred).numpy() 
  y_pred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))] 
  test_acc = evaluation_report(y_test, y_pred, get_only_acc=True)  

  print(f'epoch = {epoch:02d},\
    training loss = {sum(train_losses) / len(train_losses):.3f}, \
    training acc = {sum(train_accs) / len(train_accs):.3f}, \
    testing acc = {test_acc:.3f}')  

48
Layer (type:depth-idx)                   Param #
BiGRU_with_CRF                           --
├─Embedding: 1-1                         480,608
├─GRU: 1-2                               12,672
├─Linear: 1-3                            3,120
├─CRF: 1-4                               2,400
Total params: 498,800
Trainable params: 498,800
Non-trainable params: 0
epoch = 01,    training loss = 4914.555,     training acc = 0.464,     testing acc = 0.701
epoch = 02,    training loss = 950.940,     training acc = 0.766,     testing acc = 0.813
epoch = 03,    training loss = 624.018,     training acc = 0.840,     testing acc = 0.860
epoch = 04,    training loss = 474.727,     training acc = 0.875,     testing acc = 0.882
epoch = 05,    training loss = 387.423,     training acc = 0.896,     testing acc = 0.895
epoch = 06,    training loss = 330.789,     training acc = 0.909,     testing acc = 0.903
epoch = 07,    training loss = 290.634,     training acc = 0.919,     testing acc = 0.908
epoch = 08

In [38]:
y_pred = [] 

for inputs in test_dataloader: 
  inputs = inputs.to(device)
  with torch.no_grad(): 
    pred = crf_model(inputs)
    y_pred.append(pred.cpu().detach())
  
y_pred = torch.cat(y_pred).numpy() 
y_pred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))] 
evaluation_report(y_test, y_pred) 

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,0,0.0,0.0,-,0.0
1,1,99.755169,99.511533,99.633202,3667.0
2,2,93.730178,93.161979,93.445215,7684.0
3,3,90.229212,95.79608,92.929351,16179.0
4,4,99.930081,99.543414,99.736373,12863.0
5,5,81.481481,98.507463,89.189189,66.0
6,6,98.253275,86.206897,91.836735,450.0
7,7,97.621359,96.729197,97.17323,2011.0
8,8,63.943662,54.698795,58.961039,227.0
9,9,57.03125,59.51087,58.244681,219.0


In [39]:
torch.save(crf_model.state_dict(), "crf_model.pt")

### #TODO 2
We would like you create a neural CRF postagger model  with the pretrained word embedding as an input and the word embedding is trainable (not fixed). To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

Please note that the given pretrained word embedding only have weights for the vocabuary in BEST corpus.

Optionally, you can use your own pretrained word embedding.

<B> Hint: You can get the embedding from get_embeddings function from embeddings/emb_reader.py . </b>

(You may want to read about Trainable parameter)

In [48]:
# INSERT YOUR CODE HERE
from embeddings.emb_reader import get_embeddings
embed = get_embeddings()

In [49]:
pretrained_data = []

for word in word_to_idx:
    if word not in embed: 
        pretrained_data.append(embed['<UNK>'])
    else:
        pretrained_data.append(embed[word])

pretrained_data

[array([ 0.14022678,  0.09092573, -0.1790319 ,  0.10448399,  0.00299431,
         0.2409007 ,  0.1816121 , -0.04165503, -0.06544363,  0.05947479,
         0.04661421, -0.33226913, -0.06150463, -0.1317102 , -0.07574838,
         0.02249195,  0.07715735,  0.04643886, -0.17077078,  0.1182287 ,
        -0.04673801, -0.08885621, -0.00981161, -0.10771678,  0.02951091,
         0.09110227, -0.00879415,  0.06625252,  0.07654156,  0.09935185,
        -0.01619297, -0.08740353, -0.09052251,  0.21771173,  0.11034786,
         0.03606765,  0.23416947,  0.12849383, -0.18925685, -0.00396164,
         0.25828162, -0.07934788, -0.10694123,  0.2146667 , -0.12551983,
         0.17596188,  0.1115311 ,  0.1988364 ,  0.1605845 ,  0.12886152,
        -0.18652754, -0.11023001, -0.17567527, -0.01010482, -0.09603874,
         0.052766  , -0.11988053,  0.17181544, -0.03366903,  0.17951967,
         0.17481324,  0.01515466,  0.05193159, -0.17966549], dtype=float32),
 array([-0.035163  ,  0.05414555, -0.05874871, 

In [50]:
pretrained_data = np.asarray(pretrained_data)
pretrained_data.shape

(15019, 64)

In [52]:
class CRF_with_pretrained(nn.Module): 
  def __init__(self, word_to_idx):
    super(CRF_with_pretrained, self).__init__() 
    self.embed = Embedding(len(word_to_idx), 64, padding_idx=0)
    self.embed.weight.data.copy_(torch.from_numpy(pretrained_data))
    self.bi_gru = GRU(64, 64, bidirectional=True, batch_first=True) 
    self.classifier = nn.Linear(128,48)
    self.crf = CRF(48)
    
  def forward(self, x):
    x = self.embed(x) 
    x, _ = self.bi_gru(x) 
    out = self.classifier(x)
    return out

device = 'cuda' if torch.cuda.is_available() else 'cpu'

crf_pretrained_model = CRF_with_pretrained(word_to_idx) 
crf_pretrained_model.to(device) 

optimizer = Adam(crf_pretrained_model.parameters(), lr=1e-3) 
print(torchinfo.summary(crf_model))

num_epochs = 20

for epoch in range(1, num_epochs+1): 
  train_losses = [] 
  train_accs = [] 
  
  crf_pretrained_model.train() 
  for inputs, targets in train_dataloader: 
    optimizer.zero_grad() 

    inputs, targets = inputs.to(device), targets.to(device)
    pred = crf_pretrained_model(inputs)

    loss = -crf_pretrained_model.crf(pred, targets)

    pred = pred.reshape(-1, pred.size(-1))
    targets = targets.reshape(-1)


    train_losses.append(loss.item())
    train_accs.append(categorical_accuracy(pred, targets)) 

    loss.backward() 
    optimizer.step() 

  crf_pretrained_model.eval() 
  y_pred = [] 
  for inputs in test_dataloader: 
    inputs = inputs.to(device)
    with torch.no_grad(): 
      pred = crf_pretrained_model(inputs)
      y_pred.append(pred.cpu().detach())
    
  y_pred = torch.cat(y_pred).numpy() 
  y_pred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))] 
  test_acc = evaluation_report(y_test, y_pred, get_only_acc=True)  

  print(f'epoch = {epoch:02d},\
    training loss = {sum(train_losses) / len(train_losses):.3f}, \
    training acc = {sum(train_accs) / len(train_accs):.3f}, \
    testing acc = {test_acc:.3f}')

Layer (type:depth-idx)                   Param #
BiGRU_with_CRF                           --
├─Embedding: 1-1                         480,608
├─GRU: 1-2                               12,672
├─Linear: 1-3                            3,120
├─CRF: 1-4                               2,400
Total params: 498,800
Trainable params: 498,800
Non-trainable params: 0
epoch = 01,    training loss = 3566.574,     training acc = 0.326,     testing acc = 0.623
epoch = 02,    training loss = 808.062,     training acc = 0.806,     testing acc = 0.884
epoch = 03,    training loss = 355.239,     training acc = 0.910,     testing acc = 0.910
epoch = 04,    training loss = 242.704,     training acc = 0.936,     testing acc = 0.923
epoch = 05,    training loss = 191.579,     training acc = 0.947,     testing acc = 0.929
epoch = 06,    training loss = 163.488,     training acc = 0.952,     testing acc = 0.931
epoch = 07,    training loss = 145.637,     training acc = 0.956,     testing acc = 0.933
epoch = 08,  

In [56]:
y_pred = []

for inputs in test_dataloader: 
  inputs = inputs.to(device)
  with torch.no_grad():
    pred = crf_pretrained_model(inputs)
    y_pred.append(pred.cpu().detach())
  
y_pred = torch.cat(y_pred).numpy() 
y_pred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))] 
evaluation_report(y_test, y_pred) 

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,0,0.0,0.0,-,0.0
1,1,99.72863,99.72863,99.72863,3675.0
2,2,95.193625,92.689137,93.924688,7645.0
3,3,90.932304,94.884244,92.866249,16025.0
4,4,99.96115,99.558892,99.759615,12865.0
5,5,91.666667,98.507463,94.964029,66.0
6,6,99.58159,91.187739,95.2,476.0
7,7,97.48184,96.825397,97.15251,2013.0
8,8,79.016393,58.072289,66.944444,241.0
9,9,57.208238,67.934783,62.111801,250.0


### #TODO 3
Compare the result between all neural tagger models in 4.1.x and provide a convincing reason and example for the result of these models (which model perform better, why?)

(If you use your own weight please state so in the answer)

<b>Write your answer here :</b> แบบ pretrained ดีกว่า เนื่องจากมี ebedding layer ที่ถูก train มาแล้วจึงทำให้ไม่ต้องเริ่มใหม่ตั้งแต่ต้น จะเห็นได้ได้ว่า loss ของ pretrained จะน้อยกว่าอย่างเห็นได้ชัดตั้งแต่ epoc แรก

### TODO 4

Upon inference, the model also returns its transition matrix, which is learned during training. Your task is to observe and report whether the returned matrix is sensible. You can provide some examples to support your argument.

#### **Hint** : The transition matrix must have the shape  of (num_class, num_class).
##### **Write your answer here**

In [83]:
print('XVBB :', label_to_idx['XVBB'])
print('VACT :', label_to_idx['VACT'])
print('NLBL :', label_to_idx['NLBL'])

XVBB : 28
VACT : 2
NLBL : 34


In [91]:
# INSERT YOUR CODE HERE IF NEEDED
print(crf_pretrained_model.crf.transitions.max())
print('preverb -> actual verb      :', crf_pretrained_model.crf.transitions[28,2])
print('number label -> actual verb :', crf_pretrained_model.crf.transitions[34,2])

tensor(0.2976, device='cuda:0', grad_fn=<MaxBackward1>)
preverb -> actual verb      : tensor(-0.0057, device='cuda:0', grad_fn=<SelectBackward0>)
number label -> actual verb : tensor(-0.0514, device='cuda:0', grad_fn=<SelectBackward0>)


สังเกตว่าจาก preverb ไป actual verb ได้คะแนนสูงว่า จาก number ไป actual verb ซึ่ง make sense