# HW 5 - Neural POS Tagger

In this exercise, you are going to build a set of deep learning models on part-of-speech (POS) tagging using Pytorch.

To complete this exercise, you will need to build deep learning models for POS tagging in Thai using NECTEC's ORCHID corpus. You will build one model for each of the following type:

- Neural POS Tagging (without pretrained weights)
- Neural POS Tagging with CRF (with and without pretrained weights)

Pretrained word embeddding are already given for you to use (albeit, a very bad one).

We also provide the code for data cleaning, preprocessing and some starter code for pytorch in this notebook but feel free to modify those parts to suit your needs. Feel free to use additional libraries (e.g. scikit-learn) as long as you have a model for each type mentioned above.

### Don't forget to change hardware accelrator to GPU in runtime on Google Colab ###

## 1. Setup and Preprocessing

We use POS data from [ORCHID corpus](https://www.researchgate.net/profile/Virach-Sornlertlamvanich/publication/2630580_Building_a_Thai_part-of-speech_tagged_corpus_ORCHID/links/02e7e514db19a98619000000/Building-a-Thai-part-of-speech-tagged-corpus-ORCHID.pdf), which is a POS corpus for Thai language.
A method used to read the corpus into a list of sentences with (word, POS) pairs have been implemented already. The example usage has shown below.
We also create a word vector for unknown word by random.

In [2]:
!wget https://www.dropbox.com/s/tuvrbsby4a5axe0/resources.zip
!unzip resources.zip

--2023-02-18 11:00:34--  https://www.dropbox.com/s/tuvrbsby4a5axe0/resources.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.18, 2620:100:6021:18::a27d:4112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/tuvrbsby4a5axe0/resources.zip [following]
--2023-02-18 11:00:35--  https://www.dropbox.com/s/raw/tuvrbsby4a5axe0/resources.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc593c485baf4576364f4c07d8c8.dl.dropboxusercontent.com/cd/0/inline/B2sXxo69D5-Z-181tICIYgC-R3QMu1doPTvSTSL4meK8HP7-O4ISnSbgFudFPoHtdwrImq0nNtCIGSpzxXKZhum-cja7dEJc2-1VSW8OAypWTjFZ7Xjd-nmxhkoFMbOUN7f_vw1QDk-xPeOXx7GSkDkx5kNdRBnXW1CS1_aMkvzGwA/file# [following]
--2023-02-18 11:00:35--  https://uc593c485baf4576364f4c07d8c8.dl.dropboxusercontent.com/cd/0/inline/B2sXxo69D5-Z-181tICIYgC-R3QMu1doPTvSTSL4meK8HP7-O4ISnSbgFudFPoHtdwrImq0nNt

In [1]:
!pip install pytorch-crf

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch-crf
  Downloading pytorch_crf-0.7.2-py3-none-any.whl (9.5 kB)
Installing collected packages: pytorch-crf
Successfully installed pytorch-crf-0.7.2


In [3]:
from data.orchid_corpus import get_sentences
import numpy as np
import numpy.random
np.random.seed(42)

In [4]:
yunk_emb =np.random.randn(32)
train_data = get_sentences('train')
test_data = get_sentences('test')
print(train_data[0])

[('การ', 'FIXN'), ('ประชุม', 'VACT'), ('ทาง', 'NCMN'), ('วิชาการ', 'NCMN'), ('<space>', 'PUNC'), ('ครั้ง', 'CFQC'), ('ที่ 1', 'DONM')]


Next, we load pretrained weight embedding using pickle. The pretrained weight is a dictionary which map a word to its embedding.

In [5]:
import pickle
fp = open('basic_ff_embedding.pt', 'rb')
embeddings = pickle.load(fp)
fp.close()

The given code below generates an indexed dataset(each word is represented by a number) for training and testing data. The index 0 is reserved for padding to help with variable length sequence. (Additionally, You can read more about padding here https://suzyahyah.github.io/pytorch/2019/07/01/DataLoader-Pad-Pack-Sequence.html)

## 2. Prepare Data

In [6]:
word_to_idx ={}
idx_to_word ={}
label_to_idx = {}
for sentence in train_data:
    for word,pos in sentence:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)+1
            idx_to_word[word_to_idx[word]] = word
        if pos not in label_to_idx:
            label_to_idx[pos] = len(label_to_idx)+1
word_to_idx['UNK'] = len(word_to_idx)

n_classes = len(label_to_idx.keys())+1

This section is tweaked a little from the demo, word2features will return word index instead of features, and sent2labels will return a sequence of word indices in the sentence.

In [7]:
def word2features(sent, i, emb):
    word = sent[i][0]
    if word in word_to_idx :
        return word_to_idx[word]
    else :
        return word_to_idx['UNK']

def sent2features(sent, emb_dict):
    return np.asarray([word2features(sent, i, emb_dict) for i in range(len(sent))])

def sent2labels(sent):
    return numpy.asarray([label_to_idx[label] for (word, label) in sent],dtype='int32')

def sent2tokens(sent):
    return [word for (word, label) in sent]

In [8]:
sent2features(train_data[100], embeddings)

array([ 29, 327,   5, 328])

Next we create train and test dataset, then we use pytorch to post-pad the sequence to max sequence with 0. Our labels are changed to a one-hot vector.

In [9]:
%%time
x_train = np.asarray([sent2features(sent, embeddings) for sent in train_data])
y_train = [sent2labels(sent) for sent in train_data]

x_test =  np.asarray([sent2features(sent, embeddings) for sent in test_data]) 
y_test = [sent2labels(sent) for sent in test_data]

CPU times: user 228 ms, sys: 4.39 ms, total: 232 ms
Wall time: 232 ms




In [10]:
import torch 
from torch.nn.utils.rnn import pad_sequence

x_train = [torch.LongTensor(sentence) for sentence in x_train]
y_train = [torch.LongTensor(sentence) for sentence in y_train] 
x_test = [torch.LongTensor(sentence) for sentence in x_test]

x_train = pad_sequence(x_train, batch_first=True)
y_train = pad_sequence(y_train, batch_first=True)
x_test = pad_sequence(x_test, batch_first=True)

maxlen = x_train.size(1)  

# Pad the sequence length of x_test to be maxlen 
remaining_len = x_train.size(1) - x_test.size(1)
remaining_mat = torch.zeros((x_test.size(0), remaining_len), dtype=torch.long) 
x_test = torch.cat((x_test, remaining_mat), dim=1) 


In [11]:
import torch 
from torch.utils.data import Dataset, DataLoader 

class POSTaggerDataset(Dataset): 
  def __init__(self, data, labels=None):  
    self.data = data 
    self.labels = labels

    if labels is not None: 
      assert len(data) == len(labels)  

  def __getitem__(self, idx):
    if self.labels is None: 
      return torch.LongTensor(self.data[idx])
    else: 
      return (
          torch.LongTensor(self.data[idx]), 
          torch.LongTensor(self.labels[idx])
      )

  def __len__(self):
    return len(self.data)

train_dataset = POSTaggerDataset(x_train, y_train) 
test_dataset = POSTaggerDataset(x_test)

num_workers = 2
batch_size = 64

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True) 
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=False)   

In [12]:
out = next(iter(train_dataloader)) 
out[0].size(), out[1].size()

(torch.Size([64, 102]), torch.Size([64, 102]))

## 3. Evaluate

Our output from pytorch is a distribution of problabilities on all possible label. outputToLabel will return an indices of maximum problability from output sequence.

evaluation_report is the same as in the demo. 

**Hint** 
1. ```categorical_accuracy``` is for evaluating training set.
2. ```evaluation_report(y_true, y_pred, get_only_acc=True)``` is for evaluating test set since ```y_train``` and ```y_test``` are in different formats.

In [13]:
def outputToLabel(yt,seq_len):
    out = []
    for i in range(0,len(yt)):
        if(i==seq_len):
            break
        out.append(np.argmax(yt[i]))
    return out

In [14]:
import pandas as pd
from IPython.display import display

# for validation part 
def categorical_accuracy(preds, y, tag_pad_idx=0):
  if len(preds.shape) == 2: 
    preds = preds.argmax(dim = 1, keepdim = True) # get the index of the max probability
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements]) 
  else: 
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = preds[non_pad_elements].eq(y[non_pad_elements]) 
  return correct.sum() / y[non_pad_elements].shape[0]

def evaluation_report(y_true, y_pred, get_only_acc=False):
    # retrieve all tags in y_true
    tag_set = set()
    for sent in y_true:
        for tag in sent:
            tag_set.add(tag)
    for sent in y_pred:
        for tag in sent:
            tag_set.add(tag)
    tag_list = sorted(list(tag_set))
    
    # count correct points
    tag_info = dict()
    for tag in tag_list:
        tag_info[tag] = {'correct_tagged': 0, 'y_true': 0, 'y_pred': 0}

    all_correct = 0
    all_count = sum([len(sent) for sent in y_true])
    for sent_true, sent_pred in zip(y_true, y_pred):
        for tag_true, tag_pred in zip(sent_true, sent_pred):
            if tag_true == tag_pred:
                tag_info[tag_true]['correct_tagged'] += 1
                all_correct += 1
            tag_info[tag_true]['y_true'] += 1
            tag_info[tag_pred]['y_pred'] += 1
    accuracy = (all_correct / all_count)

    # get only accuracy for testing 
    if get_only_acc: 
      return accuracy 

    accuracy *= 100 
 
            
    # summarize and make evaluation result
    eval_list = list()
    for tag in tag_list:
        eval_result = dict()
        eval_result['tag'] = tag
        eval_result['correct_count'] = tag_info[tag]['correct_tagged']
        precision = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_pred'])*100 if tag_info[tag]['y_pred'] else '-'
        recall = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_true'])*100 if (tag_info[tag]['y_true'] > 0) else 0
        eval_result['precision'] = precision
        eval_result['recall'] = recall
        eval_result['f_score'] = (2*precision*recall)/(precision+recall) if (type(precision) is float and recall > 0) else '-'
        
        eval_list.append(eval_result)

    eval_list.append({'tag': 'accuracy=%.2f' % accuracy, 'correct_count': '', 'precision': '', 'recall': '', 'f_score': ''})
    
    df = pd.DataFrame.from_dict(eval_list)
    df = df[['tag', 'precision', 'recall', 'f_score', 'correct_count']]
  
    display(df)


## 4. Train a model

The model is this section is separated to two groups

- Neural POS Tagger (4.1)
- Neural CRF POS Tagger (4.2)

## 4.1 Neural POS Tagger  (Example)

We create a simple Neural POS Tagger as an example for you. This model dosen't use any pretrained word embbeding so it need to use Embedding layer to train the word embedding from scratch.

In [15]:
!pip install torchinfo 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchinfo
  Downloading torchinfo-1.7.2-py3-none-any.whl (22 kB)
Installing collected packages: torchinfo
Successfully installed torchinfo-1.7.2


In [None]:
import torchinfo 
from torch import nn 
from torch.nn import Embedding, Dropout, GRU, LSTM, Linear, CrossEntropyLoss 
from torch.optim import Adam


class BiGRU(nn.Module): 
  def __init__(self, word_to_idx):
    super(BiGRU, self).__init__() 
    self.embed = Embedding(len(word_to_idx), 32, padding_idx=0)
    self.bi_gru = GRU(32, 32, bidirectional=True, batch_first=True) 
    self.dropout = Dropout(0.2) 
    self.classifier = Linear(64, 48) 
    
  def forward(self, x):
    x = self.embed(x) 
    x, _ = self.bi_gru(x) 
    x = self.dropout(x) 
    out = self.classifier(x)
    return out

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = BiGRU(word_to_idx) 
model.to(device) 

optimizer = Adam(model.parameters(), lr=1e-3) 
criterion = CrossEntropyLoss(ignore_index=0)
print(torchinfo.summary(model))

num_epochs = 10 

for epoch in range(1, num_epochs+1): 
  train_losses = [] 
  train_accs = [] 
  
  model.train() 
  for inputs, targets in train_dataloader: 
    optimizer.zero_grad() 

    inputs, targets = inputs.to(device), targets.to(device)

    pred = model(inputs)

    pred = pred.reshape(-1, pred.size(-1))
    targets = targets.reshape(-1)
    
    loss = criterion(pred, targets) 

    train_losses.append(loss.item())
    train_accs.append(categorical_accuracy(pred, targets)) 

    loss.backward() 
    optimizer.step() 

  model.eval() 
  y_pred = [] 
  for inputs in test_dataloader: 
    inputs = inputs.to(device)
    with torch.no_grad(): 
      pred = model(inputs)
      y_pred.append(pred.cpu().detach())
    
  y_pred = torch.cat(y_pred).numpy() 
  y_pred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))] 
  test_acc = evaluation_report(y_test, y_pred, get_only_acc=True)  

  print(f'epoch = {epoch:02d},\
    training loss = {sum(train_losses) / len(train_losses):.3f}, \
    training acc = {sum(train_accs) / len(train_accs):.3f}, \
    testing acc = {test_acc:.3f}')  

Layer (type:depth-idx)                   Param #
BiGRU                                    --
├─Embedding: 1-1                         480,608
├─GRU: 1-2                               12,672
├─Dropout: 1-3                           --
├─Linear: 1-4                            3,120
Total params: 496,400
Trainable params: 496,400
Non-trainable params: 0
epoch = 01,    training loss = 1.788,     training acc = 0.572,     testing acc = 0.750
epoch = 02,    training loss = 0.803,     training acc = 0.801,     testing acc = 0.826
epoch = 03,    training loss = 0.576,     training acc = 0.855,     testing acc = 0.852
epoch = 04,    training loss = 0.467,     training acc = 0.879,     testing acc = 0.869
epoch = 05,    training loss = 0.397,     training acc = 0.896,     testing acc = 0.881
epoch = 06,    training loss = 0.350,     training acc = 0.907,     testing acc = 0.890
epoch = 07,    training loss = 0.317,     training acc = 0.915,     testing acc = 0.895
epoch = 08,    training loss = 

In [None]:
y_pred = [] 

for inputs in test_dataloader: 
  inputs = inputs.to(device)
  with torch.no_grad(): 
    pred = model(inputs)
    y_pred.append(pred.cpu().detach())
  
y_pred = torch.cat(y_pred).numpy() 
y_pred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))] 
evaluation_report(y_test, y_pred) 


Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.727965,99.484396,99.606032,3666.0
1,2,93.503169,93.004365,93.2531,7671.0
2,3,91.090611,88.868494,89.965833,15009.0
3,4,99.859889,99.280297,99.56925,12829.0
4,5,95.238095,89.552239,92.307692,60.0
5,6,96.969697,85.823755,91.056911,448.0
6,7,98.007775,97.017797,97.510273,2017.0
7,8,24.618736,54.457831,33.908477,226.0
8,9,55.232558,51.630435,53.370787,190.0
9,10,51.211073,35.280095,41.778405,296.0


## 4.2 CRF Viterbi

Your next task is to incorporate Conditional random fields (CRF) to your model. <b>You do not need to use pretrained weight</b>.

To use the CRF layer, you need to use an extension repository for pytorch library, call torch2crf. If you want to see the detailed implementation, you should read the official pytorch tutorial of CRF (https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html). 

torch2crf link :  https://github.com/kmkurn/pytorch-crf

For inference, you should look at crf.py at the method call and view the input/output argmunets. 

link for documentation: https://pytorch-crf.readthedocs.io/en/stable/

link for source code : https://github.com/kmkurn/pytorch-crf/blob/master/torchcrf/__init__.py




### 4.2.1 CRF without pretrained weight
### #TODO 1
Incoperate CRF layer to your model in 4.1. CRF is quite complex compare to previous example model, so you should train it with more epoch, so it can converge.

To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

Do not forget to save this model weight. (Refer to https://pytorch.org/tutorials/beginner/saving_loading_models.html)

In [None]:
# INSERT YOUR CODE HERE

import torchinfo 
from torch import nn 
from torch.nn import Embedding, Dropout, GRU, LSTM, Linear, CrossEntropyLoss 
from torch.optim import Adam
from torchcrf import CRF

class BiGRU_CRF(nn.Module): 
  def __init__(self, word_to_idx):
    super(BiGRU_CRF, self).__init__() 
    self.embed = Embedding(len(word_to_idx), 32, padding_idx=0)
    self.bi_gru = GRU(32, 32, bidirectional=True, batch_first=True) 
    self.classifier = Linear(64, 48)
    
  def forward(self, x):
    x = self.embed(x) 
    x, _ = self.bi_gru(x)
    out = self.classifier(x)
    return out

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model_BiGRU_CRF = BiGRU_CRF(word_to_idx).to(device)

optimizer = Adam(model_BiGRU_CRF.parameters(), lr=1e-3)
criterion_CRF = CRF(48, batch_first=True).to(device)
print(torchinfo.summary(model_BiGRU_CRF))

Layer (type:depth-idx)                   Param #
BiGRU_CRF                                --
├─Embedding: 1-1                         480,608
├─GRU: 1-2                               12,672
├─Linear: 1-3                            3,120
Total params: 496,400
Trainable params: 496,400
Non-trainable params: 0


In [None]:
num_epochs = 20

for epoch in range(1, num_epochs+1): 
  train_losses = [] 
  train_accs = [] 
  
  model_BiGRU_CRF.train() 
  for inputs, targets in train_dataloader: 
    optimizer.zero_grad() 

    inputs, targets = inputs.to(device), targets.to(device)
    pred = model_BiGRU_CRF(inputs)

    loss = -criterion_CRF(pred, targets)

    pred = pred.reshape(-1, pred.size(-1))
    targets = targets.reshape(-1)

    train_losses.append(loss.item())
    train_accs.append(categorical_accuracy(pred, targets)) 

    loss.backward() 
    optimizer.step() 

  model_BiGRU_CRF.eval() 
  y_pred = [] 
  for inputs in test_dataloader: 
    inputs = inputs.to(device)
    with torch.no_grad(): 
      pred = model_BiGRU_CRF(inputs)
      y_pred.append(pred.cpu().detach())
    
  y_pred = torch.cat(y_pred).numpy() 
  y_pred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))] 
  test_acc = evaluation_report(y_test, y_pred, get_only_acc=True)  

  print(f'epoch = {epoch:02d},\
    training loss = {sum(train_losses) / len(train_losses):.3f}, \
    training acc = {sum(train_accs) / len(train_accs):.3f}, \
    testing acc = {test_acc:.3f}')  

epoch = 01,    training loss = 4881.409,     training acc = 0.445,     testing acc = 0.656
epoch = 02,    training loss = 1008.821,     training acc = 0.748,     testing acc = 0.791
epoch = 03,    training loss = 650.311,     training acc = 0.834,     testing acc = 0.843
epoch = 04,    training loss = 488.887,     training acc = 0.874,     testing acc = 0.868
epoch = 05,    training loss = 396.659,     training acc = 0.896,     testing acc = 0.884
epoch = 06,    training loss = 336.953,     training acc = 0.909,     testing acc = 0.897
epoch = 07,    training loss = 295.740,     training acc = 0.919,     testing acc = 0.904
epoch = 08,    training loss = 264.625,     training acc = 0.926,     testing acc = 0.909
epoch = 09,    training loss = 240.027,     training acc = 0.933,     testing acc = 0.913
epoch = 10,    training loss = 219.973,     training acc = 0.937,     testing acc = 0.916
epoch = 11,    training loss = 203.318,     training acc = 0.942,     testing acc = 0.919
epoch = 

In [None]:
torch.save(model_BiGRU_CRF.state_dict(), "model_BiGRU_CRF.pt")

In [None]:
model_BiGRU_CRF = BiGRU_CRF(word_to_idx).to(device)
model_BiGRU_CRF.load_state_dict(torch.load("model_BiGRU_CRF.pt"))

<All keys matched successfully>

In [None]:
y_pred = [] 

for inputs in test_dataloader: 
  inputs = inputs.to(device)
  with torch.no_grad(): 
    pred = model_BiGRU_CRF(inputs)
    y_pred.append(pred.cpu().detach())
  
y_pred = torch.cat(y_pred).numpy() 
y_pred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))] 
evaluation_report(y_test, y_pred) 



Unnamed: 0,tag,precision,recall,f_score,correct_count
0,0,0.0,0.0,-,0.0
1,1,99.891097,99.565807,99.728187,3669.0
2,2,94.803922,93.792435,94.295466,7736.0
3,3,90.911647,95.713186,93.250649,16165.0
4,4,99.891304,99.566631,99.728703,12866.0
5,5,95.652174,98.507463,97.058824,66.0
6,6,97.198276,86.398467,91.48073,451.0
7,7,96.380952,97.354497,96.865279,2024.0
8,8,65.449438,56.144578,60.440986,233.0
9,9,67.801858,59.51087,63.386397,219.0


### #TODO 2
We would like you create a neural CRF postagger model  with the pretrained word embedding as an input and the word embedding is trainable (not fixed). To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

Please note that the given pretrained word embedding only have weights for the vocabuary in BEST corpus.

Optionally, you can use your own pretrained word embedding.

<B> Hint: You can get the embedding from get_embeddings function from embeddings/emb_reader.py . </b>

(You may want to read about Trainable parameter)

In [19]:
from embeddings import emb_reader # load emb_reader.py from PATH
pretrained_embeddings = emb_reader.get_embeddings()

In [20]:
pretrained_embeddings_weight = [pretrained_embeddings['PAD']] + [ pretrained_embeddings[idx_to_word[idx]]
                                if idx_to_word[idx] in pretrained_embeddings
                                else pretrained_embeddings['<UNK>']
                                for idx in idx_to_word
                                ]
pretrained_embeddings_weight = torch.Tensor(pretrained_embeddings_weight)
print(pretrained_embeddings_weight.size())

torch.Size([15019, 64])


  pretrained_embeddings_weight = torch.Tensor(pretrained_embeddings_weight)


In [21]:
# INSERT YOUR CODE HERE

import torchinfo 
from torch import nn 
from torch.nn import Embedding, Dropout, GRU, LSTM, Linear, CrossEntropyLoss 
from torch.optim import Adam
from torchcrf import CRF

class BiGRU_CRF_pretrained(nn.Module): 
  def __init__(self, word_to_idx):
    super(BiGRU_CRF_pretrained, self).__init__() 
    self.embed = Embedding(len(word_to_idx), 64, padding_idx=0, _weight=pretrained_embeddings_weight)
    self.bi_gru = GRU(64, 32, bidirectional=True, batch_first=True) 
    self.classifier = Linear(64, 48)
    
  def forward(self, x):
    x = self.embed(x) 
    x, _ = self.bi_gru(x)
    out = self.classifier(x)
    return out

In [22]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model_BiGRU_CRF_pretrained = BiGRU_CRF_pretrained(word_to_idx).to(device)

optimizer = Adam(model_BiGRU_CRF_pretrained.parameters(), lr=1e-3)
criterion_CRF_pretrained = CRF(48, batch_first=True).to(device)
print(torchinfo.summary(model_BiGRU_CRF_pretrained))

Layer (type:depth-idx)                   Param #
BiGRU_CRF_pretrained                     --
├─Embedding: 1-1                         961,216
├─GRU: 1-2                               18,816
├─Linear: 1-3                            3,120
Total params: 983,152
Trainable params: 983,152
Non-trainable params: 0


In [None]:
num_epochs = 10 

for epoch in range(1, num_epochs+1): 
  train_losses = [] 
  train_accs = [] 
  
  model_BiGRU_CRF_pretrained.train() 
  for inputs, targets in train_dataloader: 
    optimizer.zero_grad() 

    inputs, targets = inputs.to(device), targets.to(device)
    pred = model_BiGRU_CRF_pretrained(inputs)

    loss = -criterion_CRF_pretrained(pred, targets)

    pred = pred.reshape(-1, pred.size(-1))
    targets = targets.reshape(-1)

    train_losses.append(loss.item())
    train_accs.append(categorical_accuracy(pred, targets)) 

    loss.backward() 
    optimizer.step() 

  model_BiGRU_CRF_pretrained.eval() 
  y_pred = [] 
  for inputs in test_dataloader: 
    inputs = inputs.to(device)
    with torch.no_grad(): 
      pred = model_BiGRU_CRF_pretrained(inputs)
      y_pred.append(pred.cpu().detach())
    
  y_pred = torch.cat(y_pred).numpy() 
  y_pred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))] 
  test_acc = evaluation_report(y_test, y_pred, get_only_acc=True)  

  print(f'epoch = {epoch:02d},\
    training loss = {sum(train_losses) / len(train_losses):.3f}, \
    training acc = {sum(train_accs) / len(train_accs):.3f}, \
    testing acc = {test_acc:.3f}')  

epoch = 01,    training loss = 4562.286,     training acc = 0.318,     testing acc = 0.672
epoch = 02,    training loss = 884.104,     training acc = 0.800,     testing acc = 0.857
epoch = 03,    training loss = 443.194,     training acc = 0.895,     testing acc = 0.897
epoch = 04,    training loss = 288.577,     training acc = 0.927,     testing acc = 0.916
epoch = 05,    training loss = 218.866,     training acc = 0.942,     testing acc = 0.922
epoch = 06,    training loss = 182.505,     training acc = 0.949,     testing acc = 0.925
epoch = 07,    training loss = 160.808,     training acc = 0.953,     testing acc = 0.927
epoch = 08,    training loss = 145.958,     training acc = 0.956,     testing acc = 0.928
epoch = 09,    training loss = 134.480,     training acc = 0.959,     testing acc = 0.931
epoch = 10,    training loss = 125.285,     training acc = 0.961,     testing acc = 0.931


In [None]:
torch.save(model_BiGRU_CRF_pretrained.state_dict(), "model_BiGRU_CRF_pretrained.pt")

In [23]:
model_BiGRU_CRF_pretrained = BiGRU_CRF_pretrained(word_to_idx).to(device)
model_BiGRU_CRF_pretrained.load_state_dict(torch.load("model_BiGRU_CRF_pretrained.pt"))

<All keys matched successfully>

In [None]:
y_pred = [] 

for inputs in test_dataloader: 
  inputs = inputs.to(device)
  with torch.no_grad(): 
    pred = model_BiGRU_CRF_pretrained(inputs)
    y_pred.append(pred.cpu().detach())
  
y_pred = torch.cat(y_pred).numpy() 
y_pred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))] 
evaluation_report(y_test, y_pred) 


Unnamed: 0,tag,precision,recall,f_score,correct_count
0,0,0.0,0.0,-,0.0
1,1,99.891127,99.592944,99.741813,3670.0
2,2,94.626426,93.513579,94.066711,7713.0
3,3,90.741679,95.405293,93.015067,16113.0
4,4,99.930151,99.644018,99.78688,12876.0
5,5,95.652174,98.507463,97.058824,66.0
6,6,100.0,87.547893,93.360572,457.0
7,7,97.12506,97.498797,97.31157,2027.0
8,8,73.381295,49.156627,58.874459,204.0
9,9,75.153374,66.576087,70.605187,245.0


### #TODO 3
Compare the result between all neural tagger models in 4.1.x and provide a convincing reason and example for the result of these models (which model perform better, why?)

(If you use your own weight please state so in the answer)

<b>Write your answer here :</b> <br>
Accuracy ของโมเดลแรก (BiGru with CrossEntropyLoss) นั้นมีค่าเท่ากับ 90.49 <br>
Accuracy ของโมเดลสอง (BiGru with CRF) นั้นมีค่าเท่ากับ 92.83 <br>
Accuracy ของโมเดลสาม (BiGru with pretrained Embedding and CRF) นั้นมีค่าเท่ากับ 93.08 <br>
จะเห็นว่าโมเดลที่สองมีคะแนนดีกว่าเพราะการใช้ CRF ทำให้โมเดลเห็นความสัมพันธ์ของคำรอบข้างมากขึ้น และโมเดลที่สามนั้นคะแนนดีที่สุดเพราะในชั้น embedding ที่ใช้ pretrained นั้นทำให้โมเดลเข้าใจความหมายของแต่ละ token ได้ดีกว่าสองโมเดลก่อนหน้า


### TODO 4

Upon inference, the model also returns its transition matrix, which is learned during training. Your task is to observe and report whether the returned matrix is sensible. You can provide some examples to support your argument.

#### **Hint** : The transition matrix must have the shape  of (num_class, num_class).
##### **Write your answer here** จาก transition matrix จะเห็นว่าคะแนนของการต่อกันของ "บรรลุ"(VSTA) กับ "วัตถุประสงค์"(NCMN) มีค่า 0.085 ในขณะที่คะแนนของการต่อกันของ  "วัตถุประสงค์"(NCMN) กับ "บรรลุ"(VSTA) มีค่า 0.059 หมายความว่า transition matrix นั้นสมเหตุสมผลแล้ว

In [146]:
# INSERT YOUR CODE HERE IF NEEDED
idx_to_label = {v:k for k,v in label_to_idx.items()}
print('label_to_idx:\n', label_to_idx)
print()

print('transition matrix:\n', criterion_CRF_pretrained.transitions)
print()

label_to_idx:
 {'FIXN': 1, 'VACT': 2, 'NCMN': 3, 'PUNC': 4, 'CFQC': 5, 'DONM': 6, 'JCRG': 7, 'NCNM': 8, 'CNIT': 9, 'NPRP': 10, 'NTTL': 11, 'XVAM': 12, 'VSTA': 13, 'RPRE': 14, 'ADVN': 15, 'JSBR': 16, 'DDAC': 17, 'XVBM': 18, 'XVMM': 19, 'DIBQ': 20, 'PREL': 21, 'VATT': 22, 'XVAE': 23, 'DCNM': 24, 'CMTR': 25, 'FIXV': 26, 'PPRS': 27, 'XVBB': 28, 'DIAC': 29, 'PDMN': 30, 'DDAN': 31, 'CLTV': 32, 'ADVP': 33, 'NLBL': 34, 'ADVI': 35, 'CMTR@PUNC': 36, 'JCMP': 37, 'ADVS': 38, 'DDBQ': 39, 'NEG': 40, 'PNTR': 41, 'EITT': 42, 'DDAQ': 43, 'NONM': 44, 'EAFF': 45, 'DIAQ': 46, 'CVBL': 47}

transition matrix:
 Parameter containing:
tensor([[ 0.0996, -0.0638,  0.0298,  ..., -0.0597,  0.0878, -0.0530],
        [ 0.0620,  0.0842, -0.0516,  ...,  0.0478, -0.0706, -0.0515],
        [ 0.0333, -0.0524, -0.0739,  ..., -0.0048,  0.0614, -0.0261],
        ...,
        [-0.0615, -0.0587,  0.0685,  ..., -0.0978, -0.0429, -0.0337],
        [-0.0540, -0.0957,  0.0096,  ..., -0.0417,  0.0515,  0.0089],
        [ 0.0441, -

In [121]:
def w2t(w1, w2):
  return torch.LongTensor([ word_to_idx[w1], word_to_idx[w2] ])

def t2tags(tokens):
  model_BiGRU_CRF_pretrained.eval()
  with torch.no_grad(): 
    tag_ids = model_BiGRU_CRF_pretrained(tokens.to(device)).cpu().detach().argmax(axis=1).numpy()
    tags = [idx_to_label[t] for t in tag_ids]
  return tags, tag_ids

def llh(tag_id1, tag_id2):
  return criterion_CRF_pretrained.transitions[tag_id1,tag_id2].cpu().detach().item()

def get_POS_score(w1, w2):
  tokens = w2t(w1, w2)
  tags, tag_ids = t2tags(tokens)
  print(tags)
  l1, l2 = tag_ids
  return llh(l1, l2)

In [144]:
get_POS_score('บรรลุ', 'วัตถุประสงค์')

['VSTA', 'NCMN']


0.08487164974212646

In [145]:
get_POS_score('วัตถุประสงค์', 'บรรลุ')

['NCMN', 'VSTA']


0.05870155245065689