# POS tagging

During labs, we have covered HMM, LSTM and BERT models, the goal of this assignment is to evaluate and compare these models on POS tagging task. The input is a text line and the outputs are POS tags for every word (token) in the input line.

- You should already have the code for the models from labs
- Use validation split to decide when to stop training
- Evaluate all models on test data
- You can use any PoS tagging dataset to train and test your models

Refer to:
- Lab 4 - HMM for Tagging
- Lab 10 - LSTM for Tagging
- Lab 5 - Hugging Face and BERT fine-tuning
- [Datasets](https://universaldependencies.org/)
- [Dataset from Labs 4 and 10](https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/train_pos.txt)


Grading:
- 30 points - HMM
- 30 points - BiLSTM
- 30 points - BERT (for masters only)
- 40 points - Evaluation and conclusions 


Remarks: 
- Use Python 3
- Max is 100 points for bachelors, 130 points for masters

# Hidden Markov Models for POS Tagging

You can use the Viterbi algorithm implementation from Lab 4.

In [1]:
!wget https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/train_pos.txt
!wget https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/test_pos.txt

--2023-04-23 15:10:52--  https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/train_pos.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1855828 (1.8M) [text/plain]
Saving to: ‘train_pos.txt.2’


2023-04-23 15:10:52 (31.5 MB/s) - ‘train_pos.txt.2’ saved [1855828/1855828]

--2023-04-23 15:10:52--  https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/test_pos.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 418682 (409K) [text/plain]
Saving to: ‘test_pos.txt.2’


2023-04-23 15:10:52 (11.

In [2]:
import numpy as np

def viterbi(y, A, B, Pi=None):
    """
    Return the MAP estimate of state trajectory of Hidden Markov Model.

    Parameters
    ----------
    y : array (T,)
        Observation state sequence. int dtype.
    A : array (K, K)
        State transition matrix. See HiddenMarkovModel.state_transition  for
        details.
    B : array (K, M)
        Emission matrix. See HiddenMarkovModel.emission for details.
    Pi: optional, (K,)
        Initial state probabilities: Pi[i] is the probability x[0] == i. If
        None, uniform initial distribution is assumed (Pi[:] == 1/K).

    Returns
    -------
    x : array (T,)
        Maximum a posteriori probability estimate of hidden state trajectory,
        conditioned on observation sequence y under the model parameters A, B,
        Pi.
    T1: array (K, T)
        the probability of the most likely path so far
    T2: array (K, T)
        the x_j-1 of the most likely path so far
    """
    # Cardinality of the state space
    K = A.shape[0]
    # Initialize the priors with default (uniform dist) if not given by caller
    Pi = Pi if Pi is not None else np.full(K, 1 / K)
    T = len(y)
    T1 = np.empty((K, T), 'd')
    T2 = np.empty((K, T), 'B')

    # Initilaize the tracking tables from first observation
    T1[:, 0] = Pi * B[:, y[0]]
    T2[:, 0] = 0

    # Iterate throught the observations updating the tracking tables
    for i in range(1, T):
        T1[:, i] = np.max(T1[:, i - 1] * A.T * B[np.newaxis, :, y[i]].T, 1)
        T2[:, i] = np.argmax(T1[:, i - 1] * A.T, 1)

    # Build the output, optimal model trajectory
    x = np.empty(T, 'B')
    x[-1] = np.argmax(T1[:, T - 1])
    for i in reversed(range(1, T)):
        x[i - 1] = T2[x[i], i]

    return x, T1, T2

In [3]:
import re

def tokenize_with_pos(dataset):
    tokens_with_pos = []
    sentence = []

    for line in dataset:
        if len(line.split()) == 2:
            word, tag = line.split()
            word = word.lower()
            sentence.append((word, tag))
        elif len(line.split()) == 0:
            tokens_with_pos.append(sentence)
            sentence = []
    return tokens_with_pos

# download the dataset https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/train_pos.txt
# read the dataset
with open("train_pos.txt", "r") as file:
    dataset = file.readlines()

# tokenize the dataset preserving the PoS tags information
tokenized_dataset = tokenize_with_pos(dataset)

In [4]:
from collections import Counter
from itertools import zip_longest
import numpy as np

# extract the following statistics A, B, Pi from the data:
# A - State transition matrix, the probabilities of a PoS Tag2 occuring after PoS Tag1 (matrix N_tags x N_tags)
# B - Emission matrix, the probabilities of a word corresponding to the given Pos Tag (matrix N_tags x N_words)
# Pi - Initial state probabilities, the probabilities of a tag starting a sentence (vector N_tags) 

# Count the occurrences of each tag
tag_counts = Counter()
for sentence in tokenized_dataset:
    for token in sentence:
        tag = token[1]
        tag_counts[tag] += 1

# Create a list of all possible tags
all_tags = list(tag_counts.keys())

# Count the occurrences of each tag pair
tag_pair_counts = Counter()
for sentence in tokenized_dataset:
    for pair in zip_longest(sentence, sentence[1:], fillvalue=("", ".")):
        tag1, tag2 = pair[0][1], pair[1][1]
        tag_pair_counts[(tag1, tag2)] += 1

# Create the state transition matrix A
A = np.zeros((len(all_tags), len(all_tags)))
for i, tag1 in enumerate(all_tags):
    for j, tag2 in enumerate(all_tags):
        A[i, j] = tag_pair_counts[(tag1, tag2)] / tag_counts[tag1]

# Count the occurrences of each word for each tag
word_tag_counts = {tag: Counter() for tag in all_tags}
for sentence in tokenized_dataset:
    for token in sentence:
        word, tag = token
        word_tag_counts[tag][word] += 1

# Create the emission matrix B
all_words = set()
for tag in all_tags:
    all_words.update(word_tag_counts[tag].keys())
all_words = list(all_words)

B = np.zeros((len(all_tags), len(all_words)))
for i, tag in enumerate(all_tags):
    for j, word in enumerate(all_words):
        B[i, j] = word_tag_counts[tag][word] / tag_counts[tag]

# Create the initial state probabilities vector Pi
initial_tags = [sentence[0][1] for sentence in tokenized_dataset]
initial_tag_counts = Counter(initial_tags)
Pi = np.zeros(len(all_tags))
for i, tag in enumerate(all_tags):
    Pi[i] = initial_tag_counts[tag] / len(tokenized_dataset)


In [5]:
sentence1 = "I am excited to go on vacation next week"
word_indices = [all_words.index(word.lower()) for word in sentence1.split()]
tags_indices, T1, T2 = viterbi(word_indices, A, B, Pi)
print("POS Sequence:", [all_tags[x] for x in tags_indices])

POS Sequence: ['PRP', 'VBP', 'VBN', 'TO', 'VB', 'IN', 'NN', 'JJ', 'NN']


## Evaluate HMM

In [6]:
# Step 1: Read and tokenize the test dataset
with open("test_pos.txt", "r") as file:
    test_dataset = file.readlines()

test_tokenized_dataset = tokenize_with_pos(test_dataset)

# Step 2: Predict the POS tags for each sentence in the test dataset
predicted_tags = []
true_tags = []

for sentence in test_tokenized_dataset:
    words = [token[0] for token in sentence]
    true_tags_sentence = [token[1] for token in sentence]
    
    word_indices = [all_words.index(word) if word in all_words else -1 for word in words]
    
    known_word_indices = [i for i in word_indices if i != -1]
    
    if known_word_indices:  # Only predict if there are known words in the sentence
        tags_indices, _, _ = viterbi(known_word_indices, A, B, Pi)
        predicted_tags_sentence = [all_tags[x] for x in tags_indices]
    else:
        predicted_tags_sentence = []

    # Insert default tags for unseen words
    for i, index in enumerate(word_indices):
        if index == -1:
            predicted_tags_sentence.insert(i, "None")
    
    predicted_tags.extend(predicted_tags_sentence)
    true_tags.extend(true_tags_sentence)

# Step 3: Compare the predicted POS tags with the ground truth POS tags
correct_count = sum(p == t for p, t in zip(predicted_tags, true_tags))
total_count = len(true_tags)

# Step 4: Calculate evaluation metrics such as accuracy
accuracy = correct_count / total_count
print("Accuracy:", accuracy)

Accuracy: 0.864596745256137


# LSTM for POS Tagging

Use a 2-layer BiLSTM from pytorch as we did in Lab 10
- nn.LSTM(..., num_layers=2, bidirectional=True)

In [7]:
! wget https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/train_pos.txt
! wget https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/test_pos.txt

--2023-04-23 15:11:38--  https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/train_pos.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1855828 (1.8M) [text/plain]
Saving to: ‘train_pos.txt.3’


2023-04-23 15:11:38 (32.2 MB/s) - ‘train_pos.txt.3’ saved [1855828/1855828]

--2023-04-23 15:11:38--  https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/test_pos.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 418682 (409K) [text/plain]
Saving to: ‘test_pos.txt.3’


2023-04-23 15:11:39 (11.

In [8]:
import torch
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split

class PoSDataset(Dataset):
    def __init__(self, sentences, tags):
        self.sentences = sentences
        self.tags = tags

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, index):
        return self.sentences[index], self.tags[index]


def read_data(file_path):
    with open(file_path, "r") as f:
        content = f.read()
    sentences_raw = content.split(" \n")[:-1]
    sentences, tags = [], []
    for sent in sentences_raw:
        words, pos_tags = [], []
        for word_tag in sent.split("\n"):
            if word_tag:
                word, tag = word_tag.split(" ")
                words.append(word)
                pos_tags.append(tag)
        sentences.append(words)
        tags.append(pos_tags)
    return sentences, tags

In [9]:
train_sentences, train_tags = read_data("train_pos.txt")
test_sentences, test_tags = read_data("test_pos.txt")

In [10]:
def create_vocab(sentences, tags):
    vocab = {"<PAD>": 0, "<UNK>": 1}
    tags_vocab = {"<PAD>": 0}

    for sent, pos_tags in zip(sentences, tags):
        for word, tag in zip(sent, pos_tags):
            if word not in vocab:
                vocab[word] = len(vocab)
            if tag not in tags_vocab:
                tags_vocab[tag] = len(tags_vocab)

    return vocab, tags_vocab

In [11]:
vocab, tags_vocab = create_vocab(train_sentences, train_tags)

In [12]:
def tokenize_and_pad(sentences, tags, vocab, tags_vocab, seq_len):
    tokenized_sentences = []
    tokenized_tags = []

    for sent, pos_tags in zip(sentences, tags):
        tokenized_sent = [vocab.get(word, vocab["<UNK>"]) for word in sent[:seq_len]]
        tokenized_tag = [tags_vocab[tag] for tag in pos_tags[:seq_len]]

        # Padding
        tokenized_sent += [vocab["<PAD>"]] * (seq_len - len(tokenized_sent))
        tokenized_tag += [tags_vocab["<PAD>"]] * (seq_len - len(tokenized_tag))

        tokenized_sentences.append(tokenized_sent)
        tokenized_tags.append(tokenized_tag)

    return tokenized_sentences, tokenized_tags


def collate_fn(batch):
    batch_input, batch_output = [], []
    for x in batch:
        batch_input.append(x[0])
        batch_output.append(x[1])

    batch_input = torch.tensor(batch_input, dtype=torch.int)
    batch_output = torch.tensor(batch_output, dtype=torch.long)
    return batch_input, batch_output

In [13]:
# Find the longest sentence in the dataset
seq_len = max([len(sent) for sent in train_sentences])
SEQ_LEN = seq_len

train_sentences, train_tags = tokenize_and_pad(train_sentences, train_tags, vocab, tags_vocab, seq_len)
test_sentences, test_tags = tokenize_and_pad(test_sentences, test_tags, vocab, tags_vocab, seq_len)

train_dataset = PoSDataset(train_sentences, train_tags)
test_dataset = PoSDataset(test_sentences, test_tags)

dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)

In [14]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class LSTMTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, bidirectional=True)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(2 * hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(sentence.shape[0], sentence.shape[1], -1))
        tag_space = self.hidden2tag(lstm_out.view(sentence.shape[0] * sentence.shape[1], -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

In [15]:
EMBEDDING_DIM = 256
HIDDEN_DIM = 256
VOCAB_SIZE = len(vocab)
TARGET_SIZE = len(tags_vocab)

In [16]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, VOCAB_SIZE, TARGET_SIZE)

# define loss function and optimizer
optimizer = optim.Adam(model.parameters())
criterion = nn.NLLLoss()

# make model instance and send it to training device
model = model.to(device)
criterion = criterion.to(device)

In [17]:
def accuracy_calculator(preds, y):
    return (preds == y).sum() / len(y)

def train(model, dataloader, optimizer, criterion, device):
    epoch_loss = 0
    epoch_acc = 0

    model.train()
    
    for text, tags in dataloader:
        text = text.to(device)
        tags = tags.to(device)

        # initialize optimizer
        optimizer.zero_grad()

        # predict tags and compute loss
        predictions = model(text).view(-1, TARGET_SIZE)
        loss = criterion(predictions, tags.view(-1))

        # Calculate accuracy
        _, preds = torch.max(predictions, dim=1)
        acc = accuracy_calculator(preds, tags.view(-1))
        
        # backpropagate loss and optimize weights
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item() * tags.shape[0]
        epoch_acc += acc.item() * tags.shape[0]
        
    return epoch_loss / len(dataloader.dataset), epoch_acc / len(dataloader.dataset)



In [18]:
epochs = 5

for epoch in range(epochs):
    train_loss, train_acc = train(model, dataloader, optimizer, criterion, device)
    print(f'Epoch: {epoch+1}, Train [Loss:  {train_loss:.3f}  Acc: {train_acc*100:.2f}]')



Epoch: 1, Train [Loss:  0.512  Acc: 86.56]
Epoch: 2, Train [Loss:  0.160  Acc: 95.52]
Epoch: 3, Train [Loss:  0.097  Acc: 97.17]
Epoch: 4, Train [Loss:  0.067  Acc: 97.91]
Epoch: 5, Train [Loss:  0.050  Acc: 98.38]


## Evaluate LSTM

In [19]:
def evaluate_model(model, data_batches, criterion, device):
    eval_loss = 0
    eval_acc = 0
    
    model.eval()
    
    with torch.no_grad():
        for text, tags in data_batches:
            text = text.to(device)
            tags = tags.to(device)

            predictions = model(text).view(-1, TARGET_SIZE)
            loss = criterion(predictions, tags.view(-1))

            _, preds = torch.max(predictions, dim=1)
            acc = accuracy_calculator(preds, tags.view(-1))

            eval_loss += loss.item() * tags.shape[0]
            eval_acc += acc.item() * tags.shape[0]
    
    return eval_loss / len(data_batches.dataset), eval_acc / len(data_batches.dataset)

In [20]:
test_loss, test_acc = evaluate_model(model, test_dataloader, criterion, device)
print(f'Accuracy on test data: {test_acc*100:.2f}%')

Accuracy on test data: 97.11%


# BERT for POS Tagging

You can fine-tune a pretrained model from HuggingFace or train a model from zero. You **don't** need to implement the model from scratch.

Refer to the fine-tuning BERT part of Lab 5.

In [21]:
# !pip install transformers

In [22]:
!wget https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/train_pos.txt
!wget https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/test_pos.txt
!pip install transformers

--2023-04-23 15:12:14--  https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/train_pos.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1855828 (1.8M) [text/plain]
Saving to: ‘train_pos.txt.4’


2023-04-23 15:12:14 (15.6 MB/s) - ‘train_pos.txt.4’ saved [1855828/1855828]

--2023-04-23 15:12:14--  https://raw.githubusercontent.com/Gci04/AML-DS-2021/main/data/PosTagging/test_pos.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 418682 (409K) [text/plain]
Saving to: ‘test_pos.txt.4’


2023-04-23 15:12:14 (11.

In [23]:
!pip install pytorch-pretrained-bert

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [24]:
import os
from tqdm import tqdm_notebook as tqdm
import numpy as np
import torch
import torch.nn as nn
from torch.utils import data
import torch.optim as optim
from pytorch_pretrained_bert import BertTokenizer

In [25]:
def read_data(file_path):
    with open(file_path, "r") as f:
        content = f.read()
    sentences_raw = content.split(" \n")[:-1]
    sentences, tags = [], []
    for sent in sentences_raw:
        words, pos_tags = [], []
        for word_tag in sent.split("\n"):
            if word_tag:
                word, tag = word_tag.split(" ")
                words.append(word)
                pos_tags.append(tag)
        sentences.append(words)
        tags.append(pos_tags)
    return sentences, tags

train_sentences, train_tags = read_data("train_pos.txt")
test_sentences, test_tags = read_data("test_pos.txt")

In [26]:
tags = list(set(tag for sent in train_tags for tag in sent))
# By convention, the 0'th slot is reserved for padding.
tags = ["<pad>"] + tags

tag2idx = {tag:idx for idx, tag in enumerate(tags)}
idx2tag = {idx:tag for idx, tag in enumerate(tags)}

tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)

In [27]:
class PosDataset(data.Dataset):
    def __init__(self, sentences, tags):
        sents, tags_li = [], [] # list of lists
        for words, tags_ in zip(sentences, tags):
            sents.append(["[CLS]"] + words + ["[SEP]"])
            tags_li.append(["<pad>"] + tags_ + ["<pad>"])
        self.sents, self.tags_li = sents, tags_li

    def __len__(self):
        return len(self.sents)

    def __getitem__(self, idx):
        words, tags = self.sents[idx], self.tags_li[idx] # words, tags: string list

        # We give credits only to the first piece.
        x, y = [], [] # list of ids
        is_heads = [] # list. 1: the token is the first piece of a word
        for w, t in zip(words, tags):
            tokens = tokenizer.tokenize(w) if w not in ("[CLS]", "[SEP]") else [w]
            xx = tokenizer.convert_tokens_to_ids(tokens)

            is_head = [1] + [0]*(len(tokens) - 1)

            t = [t] + ["<pad>"] * (len(tokens) - 1)  # <PAD>: no decision
            yy = [tag2idx[each] for each in t]  # (T,)

            x.extend(xx)
            is_heads.extend(is_head)
            y.extend(yy)

        assert len(x)==len(y)==len(is_heads), "len(x)={}, len(y)={}, len(is_heads)={}".format(len(x), len(y), len(is_heads))

        # seqlen
        seqlen = len(y)

        # to string
        words = " ".join(words)
        tags = " ".join(tags)
        return words, x, is_heads, tags, y, seqlen

In [28]:
def pad(batch):
    '''Pads to the longest sample'''
    f = lambda x: [sample[x] for sample in batch]
    words = f(0)
    is_heads = f(2)
    tags = f(3)
    seqlens = f(-1)
    maxlen = np.array(seqlens).max()

    f = lambda x, seqlen: [sample[x] + [0] * (seqlen - len(sample[x])) for sample in batch] # 0: <pad>
    x = f(1, maxlen)
    y = f(-2, maxlen)


    f = torch.LongTensor

    return words, f(x), is_heads, tags, f(y), seqlens

In [29]:
from pytorch_pretrained_bert import BertModel

class Net(nn.Module):
    def __init__(self, vocab_size=None):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-cased')

        self.fc = nn.Linear(768, vocab_size)
        self.device = device

    def forward(self, x, y):
        '''
        x: (N, T). int64
        y: (N, T). int64
        '''
        x = x.to(device)
        y = y.to(device)
        
        if self.training:
            self.bert.train()
            encoded_layers, _ = self.bert(x)
            enc = encoded_layers[-1]
        else:
            self.bert.eval()
            with torch.no_grad():
                encoded_layers, _ = self.bert(x)
                enc = encoded_layers[-1]
        
        logits = self.fc(enc)
        y_hat = logits.argmax(-1)
        return logits, y, y_hat

In [30]:
def train(model, iterator, optimizer, criterion):
    model.train()
    for i, batch in enumerate(iterator):
        words, x, is_heads, tags, y, seqlens = batch
        _y = y # for monitoring
        optimizer.zero_grad()
        logits, y, _ = model(x, y) # logits: (N, T, VOCAB), y: (N, T)

        logits = logits.view(-1, logits.shape[-1]) # (N*T, VOCAB)
        y = y.view(-1)  # (N*T,)

        loss = criterion(logits, y)
        loss.backward()

        optimizer.step()

        if i%10==0: # monitoring
            print("step: {}, loss: {}".format(i, loss.item()))





In [31]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = Net(vocab_size=len(tag2idx))
model.to(device)
model = nn.DataParallel(model)

In [32]:

train_dataset = PosDataset(train_sentences, train_tags)
eval_dataset = PosDataset(test_sentences, test_tags)

train_iter = data.DataLoader(dataset=train_dataset,
                             batch_size=8,
                             shuffle=True,
                             num_workers=1,
                             collate_fn=pad)
test_iter = data.DataLoader(dataset=eval_dataset,
                             batch_size=8,
                             shuffle=False,
                             num_workers=1,
                             collate_fn=pad)

optimizer = optim.Adam(model.parameters(), lr = 0.0001)

criterion = nn.CrossEntropyLoss(ignore_index=0)



In [33]:
train(model, train_iter, optimizer, criterion)


step: 0, loss: 3.8752777576446533
step: 10, loss: 1.6787902116775513
step: 20, loss: 0.8217227458953857
step: 30, loss: 0.3105090260505676
step: 40, loss: 0.3609287738800049
step: 50, loss: 0.20232824981212616
step: 60, loss: 0.21270792186260223
step: 70, loss: 0.1429995447397232
step: 80, loss: 0.19636942446231842
step: 90, loss: 0.1384022831916809
step: 100, loss: 0.11066035181283951
step: 110, loss: 0.1303025782108307
step: 120, loss: 0.05210668966174126
step: 130, loss: 0.11403147876262665
step: 140, loss: 0.057761166244745255
step: 150, loss: 0.1569521278142929
step: 160, loss: 0.1303919404745102
step: 170, loss: 0.09457496553659439
step: 180, loss: 0.08607296645641327
step: 190, loss: 0.09879419207572937
step: 200, loss: 0.11601632088422775
step: 210, loss: 0.07898501306772232
step: 220, loss: 0.16456657648086548
step: 230, loss: 0.08946497738361359
step: 240, loss: 0.07853136956691742
step: 250, loss: 0.10468841344118118
step: 260, loss: 0.0906369537115097
step: 270, loss: 0.073

## Evaluate BERT

In [34]:
def eval(model, iterator):
    model.eval()

    Words, Is_heads, Tags, Y, Y_hat = [], [], [], [], []
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            words, x, is_heads, tags, y, seqlens = batch

            _, _, y_hat = model(x, y)  # y_hat: (N, T)

            Words.extend(words)
            Is_heads.extend(is_heads)
            Tags.extend(tags)
            Y.extend(y.numpy().tolist())
            Y_hat.extend(y_hat.cpu().numpy().tolist())

    ## gets results and save
    with open("result", 'w') as fout:
        for words, is_heads, tags, y_hat in zip(Words, Is_heads, Tags, Y_hat):
            y_hat = [hat for head, hat in zip(is_heads, y_hat) if head == 1]
            preds = [idx2tag[hat] for hat in y_hat]
            assert len(preds)==len(words.split())==len(tags.split())
            for w, t, p in zip(words.split()[1:-1], tags.split()[1:-1], preds[1:-1]):
                fout.write("{} {} {}\n".format(w, t, p))
            fout.write("\n")
            
    ## calc metric
    y_true =  np.array([tag2idx[line.split()[1]] for line in open('result', 'r').read().splitlines() if len(line) > 0])
    y_pred =  np.array([tag2idx[line.split()[2]] for line in open('result', 'r').read().splitlines() if len(line) > 0])

    acc = (y_true==y_pred).astype(np.int32).sum() / len(y_true)

    print("Accuracy=%.2f"%acc)

In [35]:
eval(model, test_iter)

Accuracy=0.98


# Conclusion

Write your opinions and conclusions about the application of HMM, LSTM and BERT to PoS Tagging
- discuss the results
- pros and cons of each model
- 4-6 sentences

Answer: The application of HMM, LSTM, and BERT to PoS Tagging yields varying results, with BERT achieving the highest accuracy at 98%, followed by LSTM at 97%, and HMM at 86%. The HMM's relatively lower accuracy can be attributed to its simplicity and inability to capture long-term dependencies, but it benefits from lower computational requirements and faster training times. In contrast, LSTM successfully addresses the long-term dependency issue and offers improved accuracy, but at the cost of increased computational complexity. BERT, as a pre-trained model, provides the best accuracy and generalization, although it comes with the highest computational demands and may require fine-tuning for specific tasks.