<a href="https://colab.research.google.com/github/RafalCer/Machine_learning/blob/main/PyTorch_implementation_and_extensions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Note**: This textbook was submitted for the third assignment of the Machine Learning for Language Technology course at Uppsala University. The original code implementing data pre-processing and classifiers from PyTorch was prepared by the teacher with reference to the PyTorch tutorials. 

# **INTRODUCTION**

This notebook addresses the third Machine Learning Assignment. It firstly introduces the data pre-processing, followed by a Tagger class. The third part of the document addresses the 7 extensions along with discussion:



1.   Implementation of GRU
2.   Bi-directional LSTM
3.  Dataloader class
4.  Regularization and weight decay
5.  Tagging a morphologically rich language
6.  Cross-language transferring
7.  Language-specific tagset

I split the analysis into several sections that also address the extensions.

I would argue that learning to work with PyTorch was a real challenge. This was, however, also the most useful part. This assignment also helped to grasp the recurrent neural networks in a greater detail. The absolutely most difficult part was the extension involving data augomentation. It was difficult to find useful examples on the internet and even more difficult to try and implement some augumentation described in the literature.

# **IMPORTING AND PARSING THE DATA**

Importing the whole UD data set to enable one use whichever language.

In [None]:
import tarfile
import os 

os.chdir('/content')
!wget -N https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3687{/ud-treebanks-v2.8.tgz,/ud-documentation-v2.8.tgz,/ud-tools-v2.8.tgz}

tar = tarfile.open('ud-treebanks-v2.8.tgz')
tar.extractall()
tar.close()

--2021-05-20 00:30:52--  https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3687/ud-treebanks-v2.8.tgz
Resolving lindat.mff.cuni.cz (lindat.mff.cuni.cz)... 195.113.20.140
Connecting to lindat.mff.cuni.cz (lindat.mff.cuni.cz)|195.113.20.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 430576293 (411M) [application/x-gzip]
Saving to: ‘ud-treebanks-v2.8.tgz’


2021-05-20 00:31:16 (18.2 MB/s) - ‘ud-treebanks-v2.8.tgz’ saved [430576293/430576293]

--2021-05-20 00:31:16--  https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3687/ud-documentation-v2.8.tgz
Reusing existing connection to lindat.mff.cuni.cz:443.
HTTP request sent, awaiting response... 200 OK
Length: 94335511 (90M) [application/x-gzip]
Saving to: ‘ud-documentation-v2.8.tgz’


2021-05-20 00:31:21 (19.4 MB/s) - ‘ud-documentation-v2.8.tgz’ saved [94335511/94335511]

--2021-05-20 00:31:21--  https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3687/ud-too

Parsing the downloaded data sets with a custom *conllu_parser* function.

In [None]:
import os

def conllu_parser(file_path, file_name):
  sentences = []
  tags = []
  os.chdir(file_path)
  with open(file_name, 'r', encoding='utf-8') as infile:
    prev_index = 0 
    for line in infile:
      line = line.split('\t')
      if line[0][0] == '#' or line[0][0] == '\n':
        continue
      elif ',' in line[0] or '-' in line[0] or '.' in line[0]:
        continue
      if prev_index > int(line[0]):
        sentences.append(sentence_)
        tags.append(sentence_tags)
      if line[0] == '1':
        sentence_ = []
        sentence_tags = []
      if line[0].isalnum():
        try:
          sentence_.append(line[1])
          sentence_tags.append(line[3])
        except IndexError:
          continue
      prev_index = int(line[0])

  return sentences, tags
############################## LT ALKSNIS ######################################
X_lt_train_alksnis, y_lt_train_alksnis = conllu_parser('/content/ud-treebanks-v2.8/UD_Lithuanian-ALKSNIS', 'lt_alksnis-ud-train.conllu')
X_lt_test_alksnis, y_lt_test_alksnis = conllu_parser('/content/ud-treebanks-v2.8/UD_Lithuanian-ALKSNIS', 'lt_alksnis-ud-test.conllu')

############################## PL LFG ##########################################
X_pl_train_lfg, y_pl_train_lfg = conllu_parser('/content/ud-treebanks-v2.8/UD_Polish-LFG', 'pl_lfg-ud-train.conllu')
X_pl_test_lfg, y_pl_test_lfg = conllu_parser('/content/ud-treebanks-v2.8/UD_Polish-LFG', 'pl_lfg-ud-test.conllu')

############################# SE TALBANKEN #####################################
X_se_train_talbanken, y_se_train_talbanken = conllu_parser('/content/ud-treebanks-v2.8/UD_Swedish-Talbanken', 'sv_talbanken-ud-train.conllu')
X_se_test_talbanken, y_se_test_talbanken = conllu_parser('/content/ud-treebanks-v2.8/UD_Swedish-Talbanken', 'sv_talbanken-ud-test.conllu')

################################### EN ESL #####################################
X_en_train_esl, y_en_train_esl = conllu_parser('/content/ud-treebanks-v2.8/UD_English-ESL', 'en_esl-ud-train.conllu')
X_en_test_esl, y_en_test_esl = conllu_parser('/content/ud-treebanks-v2.8/UD_English-ESL', 'en_esl-ud-test.conllu')

################################### EN ESL #####################################
X_en_train_ewt, y_en_train_ewt = conllu_parser('/content/ud-treebanks-v2.8/UD_English-EWT', 'en_ewt-ud-train.conllu')
X_en_test_ewt, y_en_test_ewt = conllu_parser('/content/ud-treebanks-v2.8/UD_English-EWT', 'en_ewt-ud-test.conllu')



# **DATASET and DATALOADER as DataProcessing class**

The DataProcessing class is a mixture of Dataset and DataLoader. The class pads_and_encodes the data as well as iterates through the batches. It is briefly discussed in the extension 3 section.

In [None]:
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

class DataProcessing(Dataset, DataLoader):
  def __init__(self, X_, y_):
    tokens = {token for sentence in X_ for token in sentence}
    idx2token = list(tokens) 
    idx2token.insert(0, '<UNK>')
    idx2token.append('<PAD>')
    self.idx2token = idx2token
    self.token2idx = {token:idx for idx, token in enumerate(idx2token)}

    tags = {tag for tags in y_ for tag in tags}
    idx2tag = list(tags)
    idx2tag.append('<PAD>')
    self.idx2tag = idx2tag
    self.tag2idx = {tag:idx for idx, tag in enumerate(idx2tag)}

  def pad_and_encode(self, X_, y_=None):
    self.max_sentence_length = np.max([len(sentence) for sentence in X_])
    padded_sentences = torch.zeros(len(X_), self.max_sentence_length,
                                  dtype=torch.long)
    padded_sentences[:] = self.token2idx['<PAD>']

    if y_:
      padded_labels = torch.zeros(len(X_), self.max_sentence_length, 
                                  dtype=torch.long)
      padded_labels[:] = self.tag2idx['<PAD>']
    for i, (sentence, tags) in enumerate(zip(X_, y_)):              
      for j, token in enumerate(sentence):
        if token in self.token2idx.keys():
          padded_sentences[i, j] = self.token2idx[token]
        else:
          padded_sentences[i, j] = self.token2idx['<UNK>']
      for j, tag in enumerate(tags):
        padded_labels[i, j] = self.tag2idx[tag]

    return padded_sentences, padded_labels

  def batch_iterator(self, sentences, labels, batch_size=64):
    for i in range(0, len(sentences), batch_size):
      X, y = self.pad_and_encode(sentences[i:min(i+batch_size, len(sentences))],
                            labels[i:min(i+batch_size, len(sentences))])
      if torch.cuda.is_available():
        yield (X.cuda(), y.cuda())
      else:
        yield (X, y)

processor = DataProcessing(X_lt_train_alksnis, y_lt_train_alksnis)

# **THE MODEL**

I created an outer class called Tagger where I initiate the rnn classifier. This made the code look messy, yet the model arguably became more user friendly, as will be briefly mentioned in the extensions. Additionally, such layout is somewhat closer to that of the Sklearn API.

In [None]:
import torch
import torch.nn as nn
import numpy as np
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import matplotlib.pyplot as plt

class Tagger():
  def __init__(self, bidirectional_bool=False, rnn_type='LSTM', data_storage=None,
               word_embedding_dim=32, 
                   rnn_hidden_dim=64,
                   vocabulary_size=len(processor.token2idx),
                   tagset_size=len(processor.tag2idx)-1,):
    
    self.rnn_type = rnn_type
    self.data_storage = data_storage

    class rnn(nn.Module):
      def __init__(self, word_embedding_dim, rnn_hidden_dim, vocabulary_size, tagset_size):
        super(rnn, self).__init__()   

        self.rnn_hidden_dim_ = rnn_hidden_dim                                     # This simply stores the parameters
        self.vocabulary_size_ = vocabulary_size
        self.tagset_size_ = tagset_size

        self._word_embedding = nn.Embedding(num_embeddings=vocabulary_size,         # Creates the vector space for the input words
                                            embedding_dim=word_embedding_dim, 
                                            padding_idx=data_storage.token2idx['<PAD>'])
        
        if rnn_type == 'LSTM': 
          self._lstm = nn.LSTM(input_size=word_embedding_dim,                         # The LSTM takes an embedded sentence as input, and outputs 
                              hidden_size=rnn_hidden_dim,                           # vectors with dimensionality lstm_hidden_dim.
                              batch_first=True, bidirectional=bidirectional_bool)
                
        elif rnn_type == 'GRU': 
          self._lstm = nn.GRU(input_size=word_embedding_dim, 
                              hidden_size=rnn_hidden_dim,
                              batch_first=True, bidirectional=bidirectional_bool)
          
        elif rnn_type == 'RNN': 
          self._lstm = nn.GRU(input_size=word_embedding_dim, 
                              hidden_size=rnn_hidden_dim,
                              batch_first=True, bidirectional=bidirectional_bool)
          
        if bidirectional_bool:
          self._fc = nn.Linear(self.rnn_hidden_dim_*2, tagset_size)     # The linear layer maps from the RNN output space to tag space
        else:
          self._fc = nn.Linear(self.rnn_hidden_dim_, tagset_size)   

        self._softmax = nn.LogSoftmax(dim=1)                                        # Softmax of outputting PDFs over tags
        
        self.training_loss_ = list()                                                # For plotting
        self.training_accuracy_ = list()

        if torch.cuda.is_available():                                               # Move the model to the GPU (if we have one)
          self.cuda()

      def forward(self, padded_sentences):
        """The forward pass through the network"""
        batch_size, max_sentence_length = padded_sentences.size()

        embedded_sentences = self._word_embedding(padded_sentences)                 # Sentences encoded as integers are mapped to vectors    

        sentence_lengths = (padded_sentences!=data_storage.token2idx['<PAD>']).sum(dim=1)        # Find the length of sentences
        sentence_lengths = sentence_lengths.long().cpu()                            # Ensure the correct format
        X = nn.utils.rnn.pack_padded_sequence(embedded_sentences, sentence_lengths, # Pack the embedded data
                                              batch_first=True, enforce_sorted=False)
        lstm_out, _ = self._lstm(X)                                                 # Run the LSTM layer
        X, _ = nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True)         # Unpack the output from the LSTM

        X = X.contiguous().view(-1, X.shape[2])                                     # The output from the LSTM layer is flattened
        tag_space = self._fc(X)                                                     # Fully connected layer
        tag_scores = self._softmax(tag_space) 
                                              # Softmax is applied to normalise the outputs
        return tag_scores.view(batch_size, max_sentence_length, self.tagset_size_)


    self.model = rnn(word_embedding_dim=32, rnn_hidden_dim=64, vocabulary_size=len(processor.token2idx), tagset_size=len(processor.tag2idx)-1)
    
################################ TRAIN #########################################

  def train(self, X_train_, y_train_, no_epochs, batch_size=256, regularizer=None):
    loss_function = nn.NLLLoss(ignore_index=self.data_storage.tag2idx['<PAD>'])                       # A loss function that fits our choice of output layer and data. Th loss function will ignore the padding index in the targets.
    if regularizer:
      optimizer = torch.optim.Adam(self.model.parameters(), lr=0.01,  weight_decay=regularizer)  
    else:
      optimizer = torch.optim.Adam(self.model.parameters(), lr=0.01)                       # We give the optimiser the parameters to work with, note that we can choose to only give some parameters

    self.batch_size = batch_size                                                               # Define the size of each batch
    for epoch in range(int(no_epochs)):                                                          # Times to loop over the full dataset
      with tqdm(self.data_storage.batch_iterator(X_train_, y_train_, batch_size=batch_size), 
                total=len(X_train_)//batch_size+1, unit="batch", desc="Epoch %i" % epoch) as batches:
        for inputs, targets in batches:                                             # Loop once over the training data
          self.model.zero_grad()                                                         # Reset gradients
          scores = self.model(inputs)                                                    # Forward pass
          loss = loss_function(scores.view(-1, self.model.tagset_size_), targets.view(-1))               
          loss.backward()                                                           # Backpropagate the error
          optimizer.step()                                                          # Run the optimizer to change the weights w.r.t the loss
          predictions = scores.argmax(dim=2, keepdim=True).squeeze()                # Calculate the batch training accuracy
          mask = targets!=self.data_storage.tag2idx['<PAD>']                                          # Create a mask for ignoring <PAD> in the targets
          correct = (predictions[mask] == targets[mask]).sum().item()               # Item pulls the value from the GPU automatically (if needed)
          accuracy = correct / mask.sum().item()*100
          self.model.training_accuracy_.append(accuracy)                                 # Save the accuracy for plotting
          self.model.training_loss_.append(loss.item())                                  # Save the loss for plotting
          batches.set_postfix(loss=loss.item(), accuracy=accuracy)                  # Update the progress bar

################################# TEST #########################################
  
  def test(self, X_test_, y_test_):
    with torch.no_grad():                                                           # Do not use the following forward passes to calculate a gradient
      n_correct = 0
      n_total = 0
      for inputs, targets in self.data_storage.batch_iterator(X_test_, y_test_, batch_size=self.batch_size): # Loop once over the test data
        scores = self.model(inputs)                                                      # Runs the test data through the model
        predictions = scores.argmax(dim=2, keepdim=True).squeeze()                  # Finds the predictions
        mask = targets!=self.data_storage.tag2idx['<PAD>']                                            # Create a mask for ignoring <PAD> in the targets
        n_correct += (predictions[mask] == targets[mask]).sum().item()              # Sums the number of correct predictions
        n_total += mask.sum().item()
    print("Test accuracy %.1f%%" % (100*n_correct/n_total))

############################## PREDICT #########################################

  def predict(self, X_):

    self.model.to('cpu')
    scores = self.model(X_)                                                      # Runs the test data through the model
    predictions = scores.argmax(dim=2, keepdim=True).squeeze()                  # Finds the predictions

    tags_ = np.zeros(predictions.size(), dtype=object)
    decoded_sentences = np.zeros(predictions.size(), dtype=object)
    tagged_sentences = []

    for i_1, sentence in enumerate(predictions):
      for i_2, tag in enumerate(sentence):
        tags_[i_1, i_2] = self.data_storage.idx2tag[tag]

    for i_1, sentence in enumerate(X_):
      for i_2, word in enumerate(sentence):
        decoded_sentences[i_1, i_2] = self.data_storage.idx2token[word]

    for i_1, sentence in enumerate(decoded_sentences):
      tagged_sentence_ = []
      for i_2, word in enumerate(sentence):
        if word != '<PAD>':
          tagged_sentence_.append((word, tags_[i_1, i_2]))
      tagged_sentences.append(tagged_sentence_)

    print(tagged_sentences)
        

# **DIMENSIONALITIES**

My tests showed that higher number of word embedding dimensions as well as hidden states tends to lead to better results for all languages on all models. For instance, as shown below, 100 embedding dimensions along with 60 hidden dimensions signifficantly outperforms a 50-50 combination. 

The improvement was only seen on values below 130, as more than 130 dimensions often resulted in either no improvement or sometimes even a worse accuracy.

In [None]:
####### 50 embedding dimensions, 50 hidden dimensions ##########

pos_tagger = Tagger(rnn_type = 'GRU', word_embedding_dim=50,
                   rnn_hidden_dim=50,
                   vocabulary_size=len(processor.token2idx),
                   tagset_size=len(processor.tag2idx)-1, bidirectional_bool=False,
                  data_storage = processor)

In [None]:
pos_tagger.train(X_lt_train_alksnis, y_lt_train_alksnis, no_epochs=10)
pos_tagger.test(X_lt_test_alksnis, y_lt_test_alksnis)

Epoch 0: 100%|██████████| 10/10 [00:00<00:00, 16.71batch/s, accuracy=56.2, loss=1.57]
Epoch 1: 100%|██████████| 10/10 [00:00<00:00, 16.40batch/s, accuracy=63, loss=1.23]
Epoch 2: 100%|██████████| 10/10 [00:00<00:00, 17.21batch/s, accuracy=68.5, loss=1.02]
Epoch 3: 100%|██████████| 10/10 [00:00<00:00, 16.83batch/s, accuracy=74.8, loss=0.846]
Epoch 4: 100%|██████████| 10/10 [00:00<00:00, 16.89batch/s, accuracy=80.3, loss=0.667]
Epoch 5: 100%|██████████| 10/10 [00:00<00:00, 16.95batch/s, accuracy=85.3, loss=0.499]
Epoch 6: 100%|██████████| 10/10 [00:00<00:00, 17.14batch/s, accuracy=90.1, loss=0.362]
Epoch 7: 100%|██████████| 10/10 [00:00<00:00, 16.66batch/s, accuracy=93.9, loss=0.243]
Epoch 8: 100%|██████████| 10/10 [00:00<00:00, 16.46batch/s, accuracy=96.6, loss=0.156]
Epoch 9: 100%|██████████| 10/10 [00:00<00:00, 17.32batch/s, accuracy=97.7, loss=0.0988]


Test accuracy 56.5%


In [None]:
####### 100 embedding dimensions, 60 hidden dimensions ##########

pos_tagger = Tagger(rnn_type = 'GRU', word_embedding_dim=100,
                   rnn_hidden_dim=60,
                   vocabulary_size=len(processor.token2idx),
                   tagset_size=len(processor.tag2idx)-1, bidirectional_bool=False,
                  data_storage = processor)

In [None]:
pos_tagger.train(X_lt_train_alksnis, y_lt_train_alksnis, no_epochs=10)
pos_tagger.test(X_lt_test_alksnis, y_lt_test_alksnis)

Epoch 0: 100%|██████████| 10/10 [00:00<00:00, 16.82batch/s, accuracy=53.5, loss=1.57]
Epoch 1: 100%|██████████| 10/10 [00:00<00:00, 17.20batch/s, accuracy=64, loss=1.22]
Epoch 2: 100%|██████████| 10/10 [00:00<00:00, 16.95batch/s, accuracy=69.1, loss=1.01]
Epoch 3: 100%|██████████| 10/10 [00:00<00:00, 17.24batch/s, accuracy=76.3, loss=0.812]
Epoch 4: 100%|██████████| 10/10 [00:00<00:00, 17.28batch/s, accuracy=82.6, loss=0.622]
Epoch 5: 100%|██████████| 10/10 [00:00<00:00, 16.94batch/s, accuracy=87.9, loss=0.448]
Epoch 6: 100%|██████████| 10/10 [00:00<00:00, 17.50batch/s, accuracy=92.2, loss=0.309]
Epoch 7: 100%|██████████| 10/10 [00:00<00:00, 17.03batch/s, accuracy=96.8, loss=0.198]
Epoch 8: 100%|██████████| 10/10 [00:00<00:00, 17.23batch/s, accuracy=98.2, loss=0.127]
Epoch 9: 100%|██████████| 10/10 [00:00<00:00, 17.14batch/s, accuracy=99.2, loss=0.0788]


Test accuracy 72.7%


In [None]:
pos_tagger.test(X_lt_test_alksnis, y_lt_test_alksnis)

Test accuracy 74.2%


In [None]:
####### 140 embedding dimensions, 140 hidden dimensions ##########

pos_tagger = Tagger(rnn_type = 'GRU', word_embedding_dim=140,
                   rnn_hidden_dim=140,
                   vocabulary_size=len(processor.token2idx),
                   tagset_size=len(processor.tag2idx)-1, bidirectional_bool=False,
                  data_storage = processor)

In [None]:
pos_tagger.train(X_lt_train_alksnis, y_lt_train_alksnis, no_epochs=10)
pos_tagger.test(X_lt_test_alksnis, y_lt_test_alksnis)

Epoch 0: 100%|██████████| 10/10 [00:00<00:00, 17.39batch/s, accuracy=50.2, loss=1.58]
Epoch 1: 100%|██████████| 10/10 [00:00<00:00, 16.90batch/s, accuracy=64, loss=1.2]
Epoch 2: 100%|██████████| 10/10 [00:00<00:00, 16.85batch/s, accuracy=70, loss=0.994]
Epoch 3: 100%|██████████| 10/10 [00:00<00:00, 16.82batch/s, accuracy=77.7, loss=0.804]
Epoch 4: 100%|██████████| 10/10 [00:00<00:00, 17.07batch/s, accuracy=81.6, loss=0.623]
Epoch 5: 100%|██████████| 10/10 [00:00<00:00, 17.29batch/s, accuracy=87.6, loss=0.458]
Epoch 6: 100%|██████████| 10/10 [00:00<00:00, 17.19batch/s, accuracy=91.3, loss=0.329]
Epoch 7: 100%|██████████| 10/10 [00:00<00:00, 17.38batch/s, accuracy=94.5, loss=0.228]
Epoch 8: 100%|██████████| 10/10 [00:00<00:00, 17.17batch/s, accuracy=96.6, loss=0.152]
Epoch 9: 100%|██████████| 10/10 [00:00<00:00, 16.96batch/s, accuracy=97.7, loss=0.0995]


Test accuracy 63.3%


# **BRIEF QUALITATIVE ANALYSIS**

I implemented the function *predict* that outputs tuples of tokens and their respective tags to be able to go through some of the data and look at some real examples in the classification. 

In [None]:
X_lt_test_padded, y_lt_test_padded = processor.pad_and_encode(X_lt_test_alksnis, y_lt_test_alksnis)

pos_tagger.predict(X_lt_test_padded)

[[('<UNK>', 'PUNCT'), ('galvos', 'NOUN'), ('<UNK>', 'NOUN')], [('<UNK>', 'PUNCT'), ('<UNK>', 'PUNCT'), ('<UNK>', 'NOUN'), ('<UNK>', 'NOUN')], [('<UNK>', 'PUNCT'), ('gali', 'VERB'), ('būti', 'AUX'), ('galvos', 'VERB'), ('<UNK>', 'PUNCT'), ('?', 'PUNCT')], [('1', 'NUM'), ('.', 'PUNCT'), ('<UNK>', 'PUNCT'), ('.', 'PUNCT')], [('Tai', 'DET'), ('tokie', 'DET'), ('galvos', 'NOUN'), ('<UNK>', 'PUNCT'), (',', 'PUNCT'), ('kai', 'SCONJ'), ('nėra', 'VERB'), ('akivaizdžių', 'VERB'), ('ligų', 'NOUN'), ('ir', 'CCONJ'), ('<UNK>', 'NOUN'), ('<UNK>', 'NOUN'), (',', 'PUNCT'), ('<UNK>', 'NOUN'), ('juos', 'PRON'), ('sukelti', 'VERB'), ('.', 'PUNCT')], [('Jie', 'PRON'), ('linkę', 'VERB'), ('kartotis', 'VERB'), ('.', 'PUNCT')], [('Šiai', 'NOUN'), ('grupei', 'NOUN'), ('priklauso', 'VERB'), ('trys', 'NUM'), ('rūšys', 'NOUN'), ('–', 'PUNCT'), ('<UNK>', 'PUNCT'), (',', 'PUNCT'), ('<UNK>', 'NOUN'), ('<UNK>', 'NOUN'), ('ir', 'CCONJ'), ('<UNK>', 'NOUN'), ('galvos', 'NOUN'), ('<UNK>', 'NOUN'), ('.', 'PUNCT')], [('2'

As expected, most errors occur with polysemous words such as *galvos* (verb - thinik, noun - heads). However, the model also misstaged some rather obvious words, such as *kur* and *trys* which only have one sense (preposition and numeral respectively). 

Overall, I did not manage to see any distinctive patterns that would link a particular part of speech to errors in Lithuanian. Something that did somewhat surprise me though was how often the <UNK> placeholder appeared in the tagged data. I did not manage to get to the bottom of this matter, but my best guess would be that every word that did not appear in the training data was assigned the index corresponding to <UNK>, which I then decoded manually. 

# **BASELINE CLASSIFIER**

To objectively evaluate the models to be tested out, I firstly had to define a baseline. I chose to use DummyClassifier available in sklearn model. The model is designed specifically for this purpsoe. 

I had to change the data slightly. Tensors are not accepted, so I transformed them into flat numpy arrays. Hence, the data were stored in 1 dimensional arrays.

I used a simple majority baseline, assigning the most frequent labels to their respective tags.

The result of such a tagger are pretty horrible, but they constitute a nice baseline to work from.

In [None]:
from sklearn.dummy import DummyClassifier
import numpy as np

X_lt_train_alksnis, y_lt_train_alksnis = conllu_parser('/content/ud-treebanks-v2.8/UD_Lithuanian-ALKSNIS', 'lt_alksnis-ud-train.conllu')
X_lt_test_alksnis, y_lt_test_alksnis = conllu_parser('/content/ud-treebanks-v2.8/UD_Lithuanian-ALKSNIS', 'lt_alksnis-ud-test.conllu')

flat_x = [item for sublist in X_lt_train_alksnis for item in sublist] # pervadint, nes nupyzdinta
flat_y = [item for sublist in y_lt_train_alksnis for item in sublist] # pervadint, nes nupyzdinta

X_np = np.array(flat_x)
y_np = np.array(flat_y)

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_np, y_np)
print(dummy_clf.score(X_np, y_np))

0.31340807928109515


# **EXTENSION 1**

I added the option to use GRU layers in the classifiers by simply passing 'GRU' as an argument to *rnn_type* variable of the Tagger class. 

The performance of GRU and LSTM models does not vary significantly on identical parameters, as both models tend to achieve a 70% accuracy on the Lithuanian Alksnis dataset. Nevertheless, GRU almost always outperforms the LSTM model by at least 0.5% percent. 

Lastly, it is important to note that GRU is less computationally demanding. Hence, it allows to use more dimensions and embeddings, thus leading to better results without signifficant increase in computational power. It therefore appears to be an overall more viable choice for the task of PoS tagging.

In [None]:
pos_tagger = Tagger(rnn_type = 'LSTM', word_embedding_dim=50,
                   rnn_hidden_dim=80,
                   vocabulary_size=len(processor.token2idx),
                   tagset_size=len(processor.tag2idx)-1, bidirectional_bool=False,
                  data_storage = processor)

pos_tagger.train(X_lt_train_alksnis, y_lt_train_alksnis, no_epochs=10)

print('''

PERFORMANCE WITH *LSTM* ON LITHUANIAN UD:

''')

pos_tagger.test(X_lt_test_alksnis, y_lt_test_alksnis)


Epoch 0: 100%|██████████| 10/10 [00:00<00:00, 16.63batch/s, accuracy=49.9, loss=1.67]
Epoch 1: 100%|██████████| 10/10 [00:00<00:00, 16.51batch/s, accuracy=61.6, loss=1.32]
Epoch 2: 100%|██████████| 10/10 [00:00<00:00, 16.61batch/s, accuracy=67.7, loss=1.11]
Epoch 3: 100%|██████████| 10/10 [00:00<00:00, 16.80batch/s, accuracy=74.2, loss=0.914]
Epoch 4: 100%|██████████| 10/10 [00:00<00:00, 16.64batch/s, accuracy=79, loss=0.729]
Epoch 5: 100%|██████████| 10/10 [00:00<00:00, 16.85batch/s, accuracy=83.5, loss=0.548]
Epoch 6: 100%|██████████| 10/10 [00:00<00:00, 16.71batch/s, accuracy=88.9, loss=0.396]
Epoch 7: 100%|██████████| 10/10 [00:00<00:00, 16.99batch/s, accuracy=91.9, loss=0.273]
Epoch 8: 100%|██████████| 10/10 [00:00<00:00, 16.76batch/s, accuracy=96.1, loss=0.181]
Epoch 9: 100%|██████████| 10/10 [00:00<00:00, 16.96batch/s, accuracy=98.5, loss=0.119]




PERFORMANCE WITH *LSTM* ON LITHUANIAN UD:


Test accuracy 70.1%


In [None]:
pos_tagger = Tagger(rnn_type = 'GRU', word_embedding_dim=50,
                   rnn_hidden_dim=80,
                   vocabulary_size=len(processor.token2idx),
                   tagset_size=len(processor.tag2idx)-1, bidirectional_bool=False,
                  data_storage = processor)

pos_tagger.train(X_lt_train_alksnis, y_lt_train_alksnis, no_epochs=10)

print('''

PERFORMANCE WITH *GRU* ON LITHUANIAN UD:

''')


pos_tagger.test(X_lt_test_alksnis, y_lt_test_alksnis)

Epoch 0: 100%|██████████| 10/10 [00:00<00:00, 17.18batch/s, accuracy=48.8, loss=1.64]
Epoch 1: 100%|██████████| 10/10 [00:00<00:00, 16.96batch/s, accuracy=63.2, loss=1.26]
Epoch 2: 100%|██████████| 10/10 [00:00<00:00, 17.07batch/s, accuracy=68.3, loss=1.06]
Epoch 3: 100%|██████████| 10/10 [00:00<00:00, 17.28batch/s, accuracy=74, loss=0.856]
Epoch 4: 100%|██████████| 10/10 [00:00<00:00, 17.06batch/s, accuracy=80.1, loss=0.649]
Epoch 5: 100%|██████████| 10/10 [00:00<00:00, 17.03batch/s, accuracy=86.3, loss=0.469]
Epoch 6: 100%|██████████| 10/10 [00:00<00:00, 17.26batch/s, accuracy=90.8, loss=0.323]
Epoch 7: 100%|██████████| 10/10 [00:00<00:00, 17.25batch/s, accuracy=95.5, loss=0.21]
Epoch 8: 100%|██████████| 10/10 [00:00<00:00, 17.58batch/s, accuracy=97.3, loss=0.133]
Epoch 9: 100%|██████████| 10/10 [00:00<00:00, 16.88batch/s, accuracy=98.1, loss=0.0826]




PERFORMANCE WITH *GRU* ON LITHUANIAN UD:


Test accuracy 71.1%


# **EXTENSION 2**

I added the option to use bidirectional embeddings with all available types of rnn layers by setting the *bidirectional_bool* variable of the Tagger class to *True*.

In general, it increased the performance for both LSTM and GRU, as the models appeared to converge faster and even reach a higher accuracy on the training set. An interesting observation is that bidirectional embeddings led to more overfitting of the model. As a result, the model's performance on the test set decreased with the use of bidirectional embeddings. 

Higher dimensionality partially solved the issue, yet still the best performance was observed with unidirectional model. 

In [None]:
pos_tagger = Tagger(rnn_type = 'GRU', word_embedding_dim=50,
                   rnn_hidden_dim=80,
                   vocabulary_size=len(processor.token2idx),
                   tagset_size=len(processor.tag2idx)-1, bidirectional_bool=False,
                  data_storage = processor)

pos_tagger.train(X_lt_train_alksnis, y_lt_train_alksnis, no_epochs=10)

print('''

PERFORMANCE WITH GRU ON LITHUANIAN UD, *UNIDIRECTIONAL ENCODINGS*:

''')


pos_tagger.test(X_lt_test_alksnis, y_lt_test_alksnis)

Epoch 0: 100%|██████████| 10/10 [00:00<00:00, 17.31batch/s, accuracy=55.4, loss=1.62]
Epoch 1: 100%|██████████| 10/10 [00:00<00:00, 17.31batch/s, accuracy=62.7, loss=1.27]
Epoch 2: 100%|██████████| 10/10 [00:00<00:00, 17.37batch/s, accuracy=69.1, loss=1.05]
Epoch 3: 100%|██████████| 10/10 [00:00<00:00, 17.49batch/s, accuracy=76.4, loss=0.859]
Epoch 4: 100%|██████████| 10/10 [00:00<00:00, 17.42batch/s, accuracy=81.7, loss=0.664]
Epoch 5: 100%|██████████| 10/10 [00:00<00:00, 17.44batch/s, accuracy=85.8, loss=0.488]
Epoch 6: 100%|██████████| 10/10 [00:00<00:00, 17.39batch/s, accuracy=90, loss=0.342]
Epoch 7: 100%|██████████| 10/10 [00:00<00:00, 17.09batch/s, accuracy=94.2, loss=0.221]
Epoch 8: 100%|██████████| 10/10 [00:00<00:00, 17.18batch/s, accuracy=97.7, loss=0.141]
Epoch 9: 100%|██████████| 10/10 [00:00<00:00, 17.54batch/s, accuracy=98.7, loss=0.087]




PERFORMANCE WITH GRU ON LITHUANIAN UD, *UNIDIRECTIONAL ENCODINGS*:


Test accuracy 74.5%


In [None]:
pos_tagger = Tagger(rnn_type = 'GRU', word_embedding_dim=50,
                   rnn_hidden_dim=80,
                   vocabulary_size=len(processor.token2idx),
                   tagset_size=len(processor.tag2idx)-1, bidirectional_bool=True,
                  data_storage = processor)

pos_tagger.train(X_lt_train_alksnis, y_lt_train_alksnis, no_epochs=10)

print('''

PERFORMANCE WITH *GRU* ON LITHUANIAN UD, *BIDIRECTIONAL ENCODINGS*:

''')


pos_tagger.test(X_lt_test_alksnis, y_lt_test_alksnis)

Epoch 0: 100%|██████████| 10/10 [00:00<00:00, 14.53batch/s, accuracy=58.2, loss=1.45]
Epoch 1: 100%|██████████| 10/10 [00:00<00:00, 14.51batch/s, accuracy=67.2, loss=1.07]
Epoch 2: 100%|██████████| 10/10 [00:00<00:00, 14.67batch/s, accuracy=76.6, loss=0.777]
Epoch 3: 100%|██████████| 10/10 [00:00<00:00, 13.94batch/s, accuracy=84.3, loss=0.56]
Epoch 4: 100%|██████████| 10/10 [00:00<00:00, 14.58batch/s, accuracy=89.8, loss=0.382]
Epoch 5: 100%|██████████| 10/10 [00:00<00:00, 14.44batch/s, accuracy=92.9, loss=0.249]
Epoch 6: 100%|██████████| 10/10 [00:00<00:00, 14.63batch/s, accuracy=95.6, loss=0.159]
Epoch 7: 100%|██████████| 10/10 [00:00<00:00, 14.46batch/s, accuracy=98.1, loss=0.091]
Epoch 8: 100%|██████████| 10/10 [00:00<00:00, 14.52batch/s, accuracy=99.8, loss=0.0497]
Epoch 9: 100%|██████████| 10/10 [00:00<00:00, 14.47batch/s, accuracy=100, loss=0.0259]




PERFORMANCE WITH *GRU* ON LITHUANIAN UD, *BIDIRECTIONAL ENCODINGS*:


Test accuracy 65.5%


In [None]:
pos_tagger = Tagger(rnn_type = 'GRU', word_embedding_dim=100,
                   rnn_hidden_dim=100,
                   vocabulary_size=len(processor.token2idx),
                   tagset_size=len(processor.tag2idx)-1, bidirectional_bool=True,
                  data_storage = processor)

pos_tagger.train(X_lt_train_alksnis, y_lt_train_alksnis, no_epochs=10)

print('''

PERFORMANCE WITH *GRU* ON LITHUANIAN UD, *BIDIRECTIONAL ENCODINGS*, HIGHER DIMENSIONS:

''')


pos_tagger.test(X_lt_test_alksnis, y_lt_test_alksnis)

Epoch 0: 100%|██████████| 10/10 [00:00<00:00, 14.83batch/s, accuracy=58.5, loss=1.49]
Epoch 1: 100%|██████████| 10/10 [00:00<00:00, 14.49batch/s, accuracy=67.2, loss=1.05]
Epoch 2: 100%|██████████| 10/10 [00:00<00:00, 14.69batch/s, accuracy=75.3, loss=0.785]
Epoch 3: 100%|██████████| 10/10 [00:00<00:00, 14.27batch/s, accuracy=83.4, loss=0.579]
Epoch 4: 100%|██████████| 10/10 [00:00<00:00, 14.94batch/s, accuracy=88.9, loss=0.391]
Epoch 5: 100%|██████████| 10/10 [00:00<00:00, 14.84batch/s, accuracy=93.2, loss=0.253]
Epoch 6: 100%|██████████| 10/10 [00:00<00:00, 14.61batch/s, accuracy=97.1, loss=0.155]
Epoch 7: 100%|██████████| 10/10 [00:00<00:00, 14.41batch/s, accuracy=98.9, loss=0.0923]
Epoch 8: 100%|██████████| 10/10 [00:00<00:00, 14.85batch/s, accuracy=99.4, loss=0.0506]
Epoch 9: 100%|██████████| 10/10 [00:00<00:00, 14.47batch/s, accuracy=99.7, loss=0.0279]




PERFORMANCE WITH *GRU* ON LITHUANIAN UD, *BIDIRECTIONAL ENCODINGS*, HIGHER DIMENSIONS:


Test accuracy 71.6%


# **EXTENSION 3**


The Dataset and DataLoader were implemented under a single class of **DataLoader**, which can be seen in the second section of this notebook, namely *DATASET and DATALOADER as DataProcessing class*. The class contains 2 useful functions, *pad_and_encode* as well as *batch_iterator* which are used by the PoS tagger. 

Overall, the class made it easier to keep the data somewhat organized.

# **EXTENSION 4**



I added the option to use regularizer in the form of weight decay by passing a *regularizer* argument to the *fit* function of the Tagger class. According to the literature, this is supposed to increase the generalisability of the model; however, this was not really the case for my model. 

To test this, I had to increase the number of epochs in the training to at least 100, since the regularizer slowed down the convergence of the model by a lot. In addition, no value above 0.01 worked in my case. Any value above 0.01 led to either the model being stuck on the same accuracy and loss, or even an increase in loss and thus lower accuracy.

As can be seen below, even with 100 epochs the model did not achieve a reasonable result. Nevertheless, the loss was decreasing more or less steadily.

There are two possible explanations: The more likely one is that I implemented the decay incorrectly (or used unsuitable values). My other theory is though that it does work; however, it requires much more time and possibly data to converge. Should the model be given the time to converge, it could possibly outperform the previously discussed models and parameters.

In [None]:
pos_tagger = Tagger(rnn_type = 'LSTM', word_embedding_dim=100,
                   rnn_hidden_dim=100,
                   vocabulary_size=len(processor.token2idx),
                   tagset_size=len(processor.tag2idx)-1, bidirectional_bool=True,
                  data_storage = processor)

pos_tagger.train(X_lt_train_alksnis, y_lt_train_alksnis, no_epochs=100, regularizer=0.01)

print('''

PERFORMANCE WITH BIDIRECTIONAL *LSTM* WITH WEIGHT DECAY ON LITHUANIAN UD:

''')

pos_tagger.test(X_lt_test_alksnis, y_lt_test_alksnis)

Epoch 0: 100%|██████████| 10/10 [00:00<00:00, 14.23batch/s, accuracy=55.6, loss=1.64]
Epoch 1: 100%|██████████| 10/10 [00:00<00:00, 14.48batch/s, accuracy=60.3, loss=1.41]
Epoch 2: 100%|██████████| 10/10 [00:00<00:00, 14.57batch/s, accuracy=60.1, loss=1.41]
Epoch 3: 100%|██████████| 10/10 [00:00<00:00, 14.62batch/s, accuracy=60.9, loss=1.4]
Epoch 4: 100%|██████████| 10/10 [00:00<00:00, 14.73batch/s, accuracy=60.3, loss=1.42]
Epoch 5: 100%|██████████| 10/10 [00:00<00:00, 14.84batch/s, accuracy=60.1, loss=1.39]
Epoch 6: 100%|██████████| 10/10 [00:00<00:00, 14.54batch/s, accuracy=60.3, loss=1.4]
Epoch 7: 100%|██████████| 10/10 [00:00<00:00, 14.64batch/s, accuracy=60.9, loss=1.38]
Epoch 8: 100%|██████████| 10/10 [00:00<00:00, 14.76batch/s, accuracy=61.2, loss=1.38]
Epoch 9: 100%|██████████| 10/10 [00:00<00:00, 14.82batch/s, accuracy=61.2, loss=1.39]
Epoch 10: 100%|██████████| 10/10 [00:00<00:00, 14.60batch/s, accuracy=61.6, loss=1.38]
Epoch 11: 100%|██████████| 10/10 [00:00<00:00, 14.77bat



PERFORMANCE WITH BIDIRECTIONAL *LSTM* WITH WEIGHT DECAY ON LITHUANIAN UD:


Test accuracy 50.4%


# **EXTENSION 5**

In extension 7, I compare the performance of a PoS tagger on Lithuanian with UD and language specific tagset Jablonskis. A crucial observation is that it took a lot of training data to achieve great results. In fact, I had to use 80% of the whole MATAS corpus, which consists of 5 genres: 

*   Documents (14%)
*   Fiction (19%)
*   Periodicals (36%)
*   Scientific texts (24%)
*   Transcripts(7%)

and a total of 144,047 sentences. This is likely due to the morphological richness of the language, since there are countless number of word forms.

Thus, when using a large corpora with 5 genres, namely the MATAS corpus, the accuracy achieved by the model reaches 95%. However, as seen from some of the examples above, the performance is much worse when it comes to a smaller data set, such as Alknsis, which is tagged with UD tagset and consists of 3,643 sentences.

**HYPOTHESIS**: This is not the case for English, as English is an Analytic language. Instead of word forms, grammatical cases are often expressed by prepositions. Hence, in theory, a model should require less data to achieve great results when it comes to tagging English sentences. 

To test that, I decided to use a small UD corpus, namely English ESL, which consists of only 5,124 sentences. To my surprise, the resulsts were horrible. However, I quickly learned that this is the case due to some evident error with the document, as all the text was replaced with underscores (_) throughout the corpus. Check the **en_esl-ud-train.conllu** document for reference.

In [None]:
processor = DataProcessing(X_en_train_esl, y_en_train_esl)

pos_tagger = Tagger(rnn_type = 'GRU', word_embedding_dim=50,                                     
                   rnn_hidden_dim=60,                                        
                   vocabulary_size=len(processor.token2idx),                           
                   tagset_size=len(processor.tag2idx)-1, bidirectional_bool=False,
                  data_storage = processor)

pos_tagger.train(X_en_train_esl, y_en_train_esl, no_epochs=10)

pos_tagger.test(X_en_train_esl, y_en_train_esl)

Epoch 0: 100%|██████████| 17/17 [00:00<00:00, 18.76batch/s, accuracy=17.1, loss=2.42]
Epoch 1: 100%|██████████| 17/17 [00:00<00:00, 18.80batch/s, accuracy=16.5, loss=2.4]
Epoch 2: 100%|██████████| 17/17 [00:00<00:00, 19.01batch/s, accuracy=17.2, loss=2.4]
Epoch 3: 100%|██████████| 17/17 [00:00<00:00, 19.39batch/s, accuracy=17.2, loss=2.39]
Epoch 4: 100%|██████████| 17/17 [00:00<00:00, 18.74batch/s, accuracy=17.6, loss=2.39]
Epoch 5: 100%|██████████| 17/17 [00:00<00:00, 18.87batch/s, accuracy=18.1, loss=2.39]
Epoch 6: 100%|██████████| 17/17 [00:00<00:00, 19.23batch/s, accuracy=18, loss=2.39]
Epoch 7: 100%|██████████| 17/17 [00:00<00:00, 18.99batch/s, accuracy=17.8, loss=2.39]
Epoch 8: 100%|██████████| 17/17 [00:00<00:00, 18.98batch/s, accuracy=18, loss=2.39]
Epoch 9: 100%|██████████| 17/17 [00:00<00:00, 19.38batch/s, accuracy=17.8, loss=2.39]


Test accuracy 19.2%


I therefore shifted to another English corpus available on UD, namely English EWT. The size of this corpus is 16,622 sentences. Three times larger than the last one, yet still more than 8 times smaller than MATAS.

The dataset consists of five genres:



*   weblogs
*   newsgroups
*   emails
*   reviews
*   Yahoo! answers

all of which were used for both training and testing.

As seen from the results below, the corpus helped me confirm the hypothesis, as the model was able to achieve the accuracy of over 88% with a dataset that is more than 8 times smaller. To the best of my knowledge, this is mainly because a sythetic, morphologically rich language such as Lithuanian requires a lot more data to train a model.

In [None]:
processor = DataProcessing(X_en_train_ewt, y_en_train_ewt)

pos_tagger = Tagger(rnn_type = 'GRU', word_embedding_dim=50,                                     
                   rnn_hidden_dim=60,                                        
                   vocabulary_size=len(processor.token2idx),                           
                   tagset_size=len(processor.tag2idx)-1, bidirectional_bool=False,
                  data_storage = processor)

pos_tagger.train(X_en_train_ewt, y_en_train_ewt, no_epochs=10)

pos_tagger.test(X_en_test_ewt, y_en_test_ewt)

Epoch 0: 100%|██████████| 47/47 [00:02<00:00, 19.13batch/s, accuracy=74.6, loss=0.842]
Epoch 1: 100%|██████████| 47/47 [00:02<00:00, 19.00batch/s, accuracy=84.7, loss=0.493]
Epoch 2: 100%|██████████| 47/47 [00:02<00:00, 19.13batch/s, accuracy=88.6, loss=0.348]
Epoch 3: 100%|██████████| 47/47 [00:02<00:00, 18.99batch/s, accuracy=91.2, loss=0.263]
Epoch 4: 100%|██████████| 47/47 [00:02<00:00, 19.04batch/s, accuracy=92.8, loss=0.211]
Epoch 5: 100%|██████████| 47/47 [00:02<00:00, 18.93batch/s, accuracy=93.8, loss=0.179]
Epoch 6: 100%|██████████| 47/47 [00:02<00:00, 18.86batch/s, accuracy=94.7, loss=0.154]
Epoch 7: 100%|██████████| 47/47 [00:02<00:00, 18.72batch/s, accuracy=95.3, loss=0.135]
Epoch 8: 100%|██████████| 47/47 [00:02<00:00, 18.69batch/s, accuracy=95.9, loss=0.121]
Epoch 9: 100%|██████████| 47/47 [00:02<00:00, 19.04batch/s, accuracy=96.2, loss=0.107]


Test accuracy 88.4%


Lastly, to ensure that this was not simply due to some corpus or language feature, I tested one more analytic language - Swedish. I used the Talbanken UD corpus, which consists of only 6,000 sentencs.

The performance of the model on the Swedish dataset ended up beingsurprising, as even with such a small corpus, the result was usaully well above 80%. This only goes to prove that an analytic language requires less data. 

In [None]:
processor = DataProcessing(X_se_train_talbanken, y_en_train_ewt)

pos_tagger = Tagger(rnn_type = 'GRU', word_embedding_dim=50,                                     
                   rnn_hidden_dim=60,                                        
                   vocabulary_size=len(processor.token2idx),                           
                   tagset_size=len(processor.tag2idx)-1, bidirectional_bool=False,
                  data_storage = processor)

pos_tagger.train(X_se_train_talbanken, y_se_train_talbanken, no_epochs=10)

pos_tagger.test(X_se_test_talbanken, y_se_test_talbanken)

Epoch 0: 100%|██████████| 17/17 [00:00<00:00, 20.26batch/s, accuracy=47.9, loss=1.56]
Epoch 1: 100%|██████████| 17/17 [00:00<00:00, 20.58batch/s, accuracy=65.4, loss=1.03]
Epoch 2: 100%|██████████| 17/17 [00:00<00:00, 20.52batch/s, accuracy=76.1, loss=0.737]
Epoch 3: 100%|██████████| 17/17 [00:00<00:00, 20.69batch/s, accuracy=82.6, loss=0.54]
Epoch 4: 100%|██████████| 17/17 [00:00<00:00, 20.16batch/s, accuracy=87.4, loss=0.399]
Epoch 5: 100%|██████████| 17/17 [00:00<00:00, 20.68batch/s, accuracy=90.4, loss=0.306]
Epoch 6: 100%|██████████| 17/17 [00:00<00:00, 20.65batch/s, accuracy=92.1, loss=0.244]
Epoch 7: 100%|██████████| 17/17 [00:00<00:00, 20.48batch/s, accuracy=93.6, loss=0.195]
Epoch 8: 100%|██████████| 17/17 [00:00<00:00, 20.45batch/s, accuracy=94.7, loss=0.161]
Epoch 9: 100%|██████████| 17/17 [00:00<00:00, 20.69batch/s, accuracy=95.3, loss=0.134]


Test accuracy 82.3%


As the very las experiment, I decided to also try Polish. I used the LFG corpus consisting of 17,200 sentences. Polish is also a synthetic language. I therefore expected a very similar performance to Lithuanian.

It was somewhat surprising that the model handled the Polish data really well, achieving a 85% accuracy. It should be noted though that the corpus is the second largest used in my experience; hence, the decent performance might be influenced by the amount of data. 

In [None]:
processor = DataProcessing(X_pl_train_lfg, y_pl_train_lfg)

pos_tagger = Tagger(rnn_type = 'GRU', word_embedding_dim=50,                                     
                   rnn_hidden_dim=60,                                        
                   vocabulary_size=len(processor.token2idx),                           
                   tagset_size=len(processor.tag2idx)-1, bidirectional_bool=False,
                  data_storage = processor)

pos_tagger.train(X_pl_train_lfg, y_pl_train_lfg, no_epochs=10)

pos_tagger.test(X_pl_test_lfg, y_pl_test_lfg)

Epoch 0: 100%|██████████| 54/54 [00:01<00:00, 38.64batch/s, accuracy=61.9, loss=1.1]
Epoch 1: 100%|██████████| 54/54 [00:01<00:00, 38.75batch/s, accuracy=72.7, loss=0.807]
Epoch 2: 100%|██████████| 54/54 [00:01<00:00, 38.96batch/s, accuracy=80.3, loss=0.565]
Epoch 3: 100%|██████████| 54/54 [00:01<00:00, 38.78batch/s, accuracy=89.8, loss=0.351]
Epoch 4: 100%|██████████| 54/54 [00:01<00:00, 38.93batch/s, accuracy=94.8, loss=0.197]
Epoch 5: 100%|██████████| 54/54 [00:01<00:00, 38.42batch/s, accuracy=97.1, loss=0.109]
Epoch 6: 100%|██████████| 54/54 [00:01<00:00, 39.10batch/s, accuracy=98.4, loss=0.0606]
Epoch 7: 100%|██████████| 54/54 [00:01<00:00, 38.90batch/s, accuracy=99.5, loss=0.0308]
Epoch 8: 100%|██████████| 54/54 [00:01<00:00, 38.52batch/s, accuracy=99.7, loss=0.019]
Epoch 9: 100%|██████████| 54/54 [00:01<00:00, 38.69batch/s, accuracy=99.8, loss=0.017]


Test accuracy 85.7%


# **EXTENSION 6**

Going into this assignment, I had the impression that Lithuanian would have very limited data for the models to train on. Hence, I expected the models to do poorly on data in Lithuanian and though about an interesting way of addressing this offered by Yang, Salakhutdinov, and Cohen in 2017. The model proposed by the authors has a rather unique and stunning feature - it is versatile and therefore able to transfer across domains and even languages. The authors do this by adjusting the parameters of the model in accordance with the source and target tasks, combining those that are shared and thus overlap.

The authors record an improvement of at least 8% for cross-domain transfers (Penn Tree Bank PoS task -> Twitter corpus) and even a 3% increase in a cross-language transfer (Spanish -> English). This might not seem like a significant increase, but it is firstly stunning how pattrns learned from a one language can contribute to the performance of the model on another.

Neither of the models I've implemented or tested is able to transfer across languages. It can arguably be applied across domains, but there is no option to tailor the shared or task specific features. That is something I would like to implement in the future with the knowledge I gained throghout the course.

**REFERENCE**: Yang, R. Salakhutdinov, and W. W. Cohen. Transfer learning for sequence tagging with hierarchicalrecurrent networks.ArXiv, abs/1703.06345, 2017.

# **EXTENSION 7**

UD Lithuanian versus Jablonskis Lithuanian tagset

As shown and discussed below, the language-specific tagset outperforms UD, which was against my expectations. I quickly learned, however, that the reason behind that is simply the fact that the model had a much larger training dataset for Jablonskis.

In [None]:
import tarfile
import os
from sklearn.model_selection import train_test_split

os.chdir('/content')
!wget -N https://clarin.vdu.lt/xmlui/bitstream/handle/20.500.11821/33/MATAS-v1.0.zip
!unzip -q 'MATAS-v1.0.zip'

X_MATAS = []
y_MATAS = []

os.chdir('/content/MATAS-v1.0/CONLLU')
for file in os.listdir():
    X_, y_ = conllu_parser('/content/MATAS-v1.0/CONLLU', file)
    X_MATAS.extend(X_)
    y_MATAS.extend(y_)

X_train_MATAS, X_test_MATAS, y_train_MATAS, y_test_MATAS = train_test_split(X_MATAS, y_MATAS, test_size=0.2, random_state=0)



In [None]:
processor = DataProcessing(X_MATAS, y_MATAS)

pos_tagger = Tagger(rnn_type = 'GRU', word_embedding_dim=50,                                       # Dimensionality of the work embedding
                   rnn_hidden_dim=60,                                          # Dimensionality of the hidden state in the LSTM
                   vocabulary_size=len(processor.token2idx),                              # The vocabulary incudes both the 'padding' and 'unknown' symbols
                   tagset_size=len(processor.tag2idx)-1, bidirectional_bool=False,
                  data_storage = processor)

pos_tagger.train(X_train_MATAS, y_train_MATAS, no_epochs=5)


Epoch 0: 100%|██████████| 448/448 [00:23<00:00, 18.75batch/s, accuracy=92.4, loss=0.272]
Epoch 1: 100%|██████████| 448/448 [00:23<00:00, 19.03batch/s, accuracy=97.2, loss=0.113]
Epoch 2: 100%|██████████| 448/448 [00:23<00:00, 18.98batch/s, accuracy=98.4, loss=0.0657]
Epoch 3: 100%|██████████| 448/448 [00:23<00:00, 19.00batch/s, accuracy=98.9, loss=0.0478]
Epoch 4: 100%|██████████| 448/448 [00:23<00:00, 19.06batch/s, accuracy=99.1, loss=0.0361]


In [None]:
pos_tagger.test(X_test_MATAS, y_test_MATAS)

Test accuracy 95.1%


As seen above, the tagger reaches a really impressive score with the Lithuanian specific tag set Jablonskis, trained on the MATAS corpus. It should be noted, however, that the difference in the results is highly affected by the difference in the sizes of the training sets. The training set for Jablonskis is much larger than the one for UD, thus inevitably leading to better results on the test set.

This also relates to the **extension 6**, as it proves that morphologically rich languages, such as Lithuanian, require much larger training sets to achieve reasonable results in the task of PoS tagging. 

In [None]:
print(f'Size of Lithuanian-specific Jablonskis training set: {len(X_train_MATAS)}')
print(f'Size of Lithuanian UD training set: {len(X_lt_train_alksnis)}')

Size of Lithuanian-specific MATAS training set: 114621
Size of Lithuanian UD training set: 2337
