# L4: Part-of-speech tagging

Part-of-speech tagging is the task of labelling the words (tokens) of a sentence with parts-of-speech such as noun, adjective, and verb. In this lab you will implement the simple, autoregressive fixed-window tagger that was presented in Lecture&nbsp;4.2.

In [164]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
print(torch.__version__)

cuda
1.13.1+cu116


## The data set

The data set for the lab is the English Web Treebank from the [Universal Dependencies Project](http://universaldependencies.org), a corpus containing more than 16,000 sentences (254,000&nbsp;tokens) annotated with, among other things, parts-of-speech. The Universal Dependencies Project distributes its data in the [CoNLL-U format](https://universaldependencies.org/format.html), but for this lab we have converted the data into a simpler format: words and their part-of-speech tags are separated by tabs, sentences are separated by empty lines. The code in the next cell defines a container class for data with this format.

In [50]:
class Dataset():

    def __init__(self, filename):
        self.filename = filename

    def __iter__(self):
        tmp = []
        with open(self.filename, 'rt', encoding='utf-8') as lines:
            for line in lines:
                line = line.rstrip()
                if line:
                    tmp.append(tuple(line.split('\t')))
                else:
                    yield tmp
                    tmp = []

We load the training data and the development data for this lab:

In [51]:
train_data = Dataset('train.txt')
dev_data = Dataset('dev.txt')

Both data sets consist of **tagged sentences**. On the Python side of things, a tagged sentence is represented as a list of string pairs, where the first component of each pair represents a word token and the second component represents the word’s tag. The possible tags are listed and exemplified in the [Annotation Guidelines](http://universaldependencies.org/u/pos/all.html) of the Universal Dependencies Project. Run the next code cell to see an example of a tagged sentence.

In [165]:
#[x[0] for x in list(train_data)[42]]
list(train_data)[42]

[('There', 'PRON'),
 ('has', 'AUX'),
 ('been', 'VERB'),
 ('talk', 'NOUN'),
 ('that', 'SCONJ'),
 ('the', 'DET'),
 ('night', 'NOUN'),
 ('curfew', 'NOUN'),
 ('might', 'AUX'),
 ('be', 'AUX'),
 ('implemented', 'VERB'),
 ('again', 'ADV'),
 ('.', 'PUNCT')]

## Tagger interface

The tagger that you will implement in this lab follows a simple interface with just one method:

In [53]:
class Tagger(object):

    def predict(self, sentence):
        raise NotImplementedError

The single method of this interface has the following specification:

**predict** (*self*, *sentence*)

> Returns the list of predicted tags (a list of strings) for a single *sentence* (a list of string tokens).

One trivial implementation of this interface is a tagger that always predicts the same tag for every word, independently of the input:

In [54]:
class ConstantTagger(Tagger):
    
    def __init__(self, the_tag):
        self.the_tag = the_tag
    
    def predict(self, words):
        return [self.the_tag] * len(words)

## Problem 1: Implement an evaluation function

Your first task is to implement a function that computes the accuracy of a tagger on gold-standard data.

In [55]:
def accuracy(tagger, gold_data):
    # TODO: Replace the next line with your own code
    nr_correct = 0
    nr_words = 0

    for sentence in gold_data:
        words = [tokens[0] for tokens in sentence]
        
        nr_words += len(words)

        correct_tags = [tokens[1] for tokens in sentence]
        predicted_tags = tagger.predict(words)

        for i in range(len(words)):
            if predicted_tags[i] == correct_tags[i]:
                nr_correct += 1

    acc = nr_correct / nr_words

    return acc

Your implementation should conform to the following specification:

**accuracy** (*tagger*, *gold_data*)

> Computes the accuracy of the *tagger* on the gold-standard data *gold_data* (an iterable of tagged sentences) and returns it as a float. Recall that the accuracy is defined as the percentage of tokens to which the tagger assigns the correct tag (as per the gold standard).

### 🤞 Test your code

Test your code by computing the accuracy on the development set of a trivial tagger that tags each word as a noun. The expected value is 16.69%.

In [166]:
tagger = ConstantTagger('NOUN')

accuracy(tagger, dev_data)

0.1668919993637665

## Problem 2: Implement a baseline

Before you start working on the tagger as such, we ask you to first implement a simple baseline:

> Tag each input word with the most frequent tag for that word in the training data. If an input word does not occur in the training data, tag it with the overall most frequent tag in the training data. Break ties by choosing that tag which comes first in the alphabetical order.

To implement the baseline, you need to implement both a class `BaselineTagger` and a function `train_baseline`. A `BaselineTagger` has two fields: a dictionary mapping each word in the training data to the most frequent tag for that word, and a string representing the fallback tag (overall most frequent tag in the training data). Both of these fields are set in the `train_baseline` function.

In [97]:
class BaselineTagger(Tagger):

    def __init__(self):
        self.most_frequent = {}
        self.fallback = None

    def predict(self, words):
        # TODO: Replace the next line with your own code
        prediction = []
        for word in words:
            if word in self.most_frequent:
                prediction.append(self.most_frequent[word])
            else:
                prediction.append(self.fallback)
                
        return prediction

def train_baseline(train_data):
    # TODO: Replace the next line with your own code
    temp_frequent = {}
    total_tag_freq = {}

    for sentence in train_data:
        for word, tag in sentence:
            if tag not in total_tag_freq:
                total_tag_freq[tag] = 1
            else:
                total_tag_freq[tag] += 1

            if word not in temp_frequent:
                tag_freq = {}
                tag_freq[tag] = 1
                temp_frequent[word] = tag_freq
            else:
                if tag not in temp_frequent[word]:
                    temp_frequent[word][tag] = 1
                else:
                    temp_frequent[word][tag] += 1
    
    for word in temp_frequent:
        temp_frequent[word] = max(temp_frequent[word], key=temp_frequent[word].get)

    tagger = BaselineTagger()
    tagger.most_frequent = temp_frequent
    tagger.fallback = max(total_tag_freq, key=total_tag_freq.get)
    
    return tagger

### 🤞 Test your code

Test your implementation by computing the accuracy of the baseline tagger on the development data. The expected value is 85.61%.

In [167]:
tagger = train_baseline(train_data)

accuracy(tagger, dev_data)

0.8564100524892636

## Problem 3: Create the vocabularies

As in previous labs, you will need an explicit representation of your vocabulary. Here we actually have two vocabularies: one for the words and one for the tags. Both should be represented as dictionaries that map words/tags to a contiguous range of integers, starting at zero.

The next cell contains skeleton code for a function `make_vocabs` that constructs the two vocabularies from gold-standard data. The code cell also defines a name for the ‘unknown word’ (`UNK`) and for an additional pseudoword that you will use as a placeholder for undefined values (`PAD`).

In [59]:
PAD = '<pad>'
UNK = '<unk>'

def make_vocabs(gold_data):
    # TODO: Replace the next line with your own code
    vocab = {PAD: 0, UNK: 1}
    tags = {PAD: 0}
    for sentence in gold_data:
        for pair in sentence:
            word = pair[0]
            tag = pair[1]
            
            if word not in vocab:
                vocab[word] = len(vocab)
            
            if tag not in tags:
                tags[tag] = len(tags)
                    
    return vocab, tags

Complete the code according to the following specification:

**make_vocabs** (*gold_data*)

> Returns a pair of dictionaries mapping the unique words and tags in the gold-standard data *gold_data* (an iterable over tagged sentences) to contiguous ranges of integers starting at zero. The word dictionary contains the pseudowords `PAD` (index&nbsp;0) and `UNK` (index&nbsp;1); the tag dictionary contains `PAD` (index&nbsp;0).

### 🤞 Test your code

Test your implementation by computing the total number of unique words and tags in the training data (including the pseudowords). The expected values are 19,674&nbsp;words and 18&nbsp;tags.

In [168]:
vocab, tags = make_vocabs(train_data)
print(len(vocab))
print(len(tags))


19674
18


## Problem 4: Fixed-window tagger

Your main task in this lab is to implement a complete, autoregressive part-of-speech tagger based on the fixed-window architecture. This implementation has four parts: the fixed-window model; a tagger that uses the fixed-window model to make predictions; a function that generates training examples for the tagger; and the training function.

**⚠️ We expect that solving this problem will take you the longest time in this lab.**

### Problem 4.1: Implement the fixed-window model

The architecture of the fixed-window model is presented in Lecture&nbsp;4.2. An input to the network takes the form of a $k$-dimensional vector of word ids and/or tag ids. Each integer $i$ is mapped to an $e_i$-dimensional embedding vector. These vectors are concatenated to form a vector of length $e_1 + \cdots + e_k$, and sent through a feed-forward network with a single hidden layer and a rectified linear unit (ReLU).

#### Default features

We ask you to implement a fixed-window model with the following features ($k=4$):

0. current word
1. previous word
2. next word
3. tag predicted for the previous word

Whenever the value of a feature is undefined, you should use the special value `PAD`.

#### Embedding specifications

To make your implementation of the fixed-window model useful for a range of different applications (including the parser that you will build in lab&nbsp;5), it should support other feature sets than the default model. To this end, the constructor of your model should accept a list of what we call *embedding specifications*. An embedding specification is a triple $(m, n, e)$ consisting of three integers. Such a triple specifies that the model should include $m$ instances of an embedding from $n$ items to vectors of size $e$. All of the $m$ instances are to share their weights. In this lab, the embeddings will be embeddings for words and tags. For example, to instantiate the default feature model, you would initialise the model with the following specifications:

``
[(3, num_words, word_dim), (1, num_tags, tag_dim)]
``

This specifies that the model should use 3 instances of an embedding from *num_words* words to vectors of length *word_dim*, and 1 instance of an embedding from *num_tags* tags to vectors of length *tag_dim*. All 3 instances of the word embedding would share their weights. If you rather wanted to have word embeddings with separate weights, you would initialise the model with the following specifications:

``
[(1, num_words, word_dim), (1, num_words, word_dim), (1, num_words, word_dim), (1, num_tags, tag_dim)]
``

We recommend that you initialize the weights of each embedding with values drawn from $\mathcal{N}(0, 10^{-2})$.

#### Hyperparameters

The network architecture introduces a number of hyperparameters. The following choices are reasonable defaults:

* width of each word embedding: 50
* width of each tag embedding: 10
* size of the hidden layer: 100

The next cell contains skeleton code for the implementation of the fixed-window model.

In [169]:
import torch
import torch.nn as nn

class FixedWindowModel(nn.Module):

    def __init__(self, embedding_specs, hidden_dim, output_dim):
        super().__init__()
        # TODO: Add your code here
        # Extract embedding_specs
        emb_spec_words = embedding_specs[0]
        emb_spec_tags = embedding_specs[1]

        n_words = emb_spec_words[0]
        vocab_size = emb_spec_words[1]
        word_dim = emb_spec_words[2]

        n_tags = emb_spec_tags[0]
        tags_size = emb_spec_tags[1]
        tag_dim = emb_spec_tags[2]

        # Create embeddings
        self.embeddings = nn.ModuleDict([
                        ['word_embs', nn.Embedding(vocab_size, word_dim, padding_idx=0)],
                        ['tag_embs', nn.Embedding(tags_size, tag_dim, padding_idx=0)]])

        # Create hidden layers
        self.hidden = nn.Linear(n_words * word_dim + n_tags * tag_dim, hidden_dim) # 3 * 50 + 1 * 10,

        # Create RELU
        self.activation = nn.ReLU()

        # Create output layers
        self.output = nn.Linear(hidden_dim, output_dim)

    def forward(self, features):
        # TODO: Replace the next line with your own code
        batch_size = len(features)
        
        # Extract words and tags 
        words = features[:,:-1]
        tags = features[:,-1]

        # Get the word and tag embeddings
        word_embs = self.embeddings['word_embs'](words) # 3 * 50
        tag_embs = self.embeddings['tag_embs'](tags) # 1 * 10
        
        concat_words = word_embs.view(batch_size, -1)
        
        concat_embs = torch.cat([concat_words, tag_embs], dim=1)

        hidden = self.hidden(concat_embs)

        relu = self.activation(hidden)

        output = self.output(relu)

        return output

Your implementation should meet the following specification:

**__init__** (*self*, *embedding_specs*, *hidden_dim*, *output_dim*)

> A fixed-window model is initialized with a list of specifications for the embeddings the network should use (*embedding_specs*), the size of the hidden layer (*hidden_dim*), and the size of the output layer (*output_dim*).

**forward** (*self*, *features*)

> Computes the network output for a given feature representation *features*. This is a tensor of shape $B \times k$ where $B$ is the batch size (number of samples in the batch) and $k$ is the total number of embeddings specified upon initialisation. For example, for the default feature model, $k=4$, as this model includes 3 (weight-sharing) word embeddings and 1 tag embedding.

#### 💡 Hint on the implementation

You will have to construct embeddings based on the embedding specifications. It is natural to store these embeddings in a list- or dictionary-valued attribute of the `FixedWindowModel` object. However, in order to expose the embeddings to the auto-differentiation magic of PyTorch (so that their weights are updated during training), you must instead store them in an [`nn.ModuleList`](https://pytorch.org/docs/stable/nn.html#torch.nn.ModuleList) or [`nn.ModuleDict`](https://pytorch.org/docs/stable/nn.html#torch.nn.ModuleDict).

In [170]:
vocab_words, vocab_tags =  make_vocabs(train_data)

word_dim=50 
tag_dim=10 
hidden_dim=100

embedding_specs = [(3, len(vocab_words), word_dim), (1, len(vocab_tags), tag_dim)]

fwm_model = FixedWindowModel(embedding_specs, hidden_dim, len(vocab_tags))

features = torch.tensor([[200, 7, 17, 5],
                      [234, 78, 98, 12]])
                      
fwm_model.forward(features).shape

torch.Size([2, 18])

### Problem 4.2: Implement the tagger

The next step is to implement the tagger itself. The tagger will use the simple algorithm that was presented in Lecture&nbsp;4.2: It processes an input sentence from left to right, and at each position, predicts the tag for the current word based on the features extracted from the current feature window.

In [79]:
class FixedWindowTagger(Tagger):

    def __init__(self, vocab_words, vocab_tags, word_dim=50, tag_dim=10, hidden_dim=100):
        embedding_specs = [(3, len(vocab_words), word_dim), (1, len(vocab_tags), tag_dim)]
        self.model = FixedWindowModel(embedding_specs, hidden_dim, len(vocab_tags)).to(device)
        # TODO: Replace the next line with your own code
        self.vocab_words = vocab_words
        self.vocab_tags = vocab_tags

    def featurize(self, words, i, pred_tags):
        # TODO: Replace the next line with your own code
        feature = []
        if len(words) == 1:
            feature = [words[i], 0, 0, 0]

        elif i == 0: # first word
            # Wi, PAD, PAD, PAD
            feature = [words[i], words[i+1], 0, 0]
        elif i == len(words)-1: # last word
            # Wi, Wi+1, PAD, PAD
            feature = [words[i], 0, words[i-1], pred_tags[i-1]]
        else:
            # Wi, Wi+1, Wi-1, Ti-1
            feature = [words[i], words[i+1], words[i-1], pred_tags[i-1]]
        return torch.tensor([feature]).to(device)

    def predict(self, words):
        # TODO: Replace the next line with your own code

        # find word indexes for given words
        words_idxs = []
        for word in words:
            if not word in self.vocab_words:
                words_idxs.append(self.vocab_words[UNK])
            else:
                words_idxs.append(self.vocab_words[word])

        # predict tags
        pred_tags_idxs = [0] * len(words)
        for i in range(0, len(words_idxs)):
            feature = self.featurize(words_idxs, i, pred_tags_idxs)
            pred_tags = self.model.forward(feature)
            # Find tag index with highest probability
            pred_tags_idxs[i] = torch.argmax(pred_tags).item()
        
        # convert tag indexes
        pred_tags = []
        for tag_idx in pred_tags_idxs:
            tag = [k for k, v in self.vocab_tags.items() if v == tag_idx][0]
            pred_tags.append(tag)
        
        return pred_tags

Complete the skeleton code by implementing the methods of this interface:

**__init__** (*self*, *vocab_words*, *vocab_tags*, *word_dim* = 50, *tag_dim* = 10, *hidden_dim* = 100)

> Creates a new fixed-window model of appropriate dimensions and sets up any other data structures that you consider relevant. The parameters *vocab_words* and *vocab_tags* are the word vocabulary and tag vocabulary. The parameters *word_dim* and *tag_dim* specify the embedding width for the word embeddings and tag embeddings.

**featurize** (*self*, *words*, *i*, *pred_tags*)

> Extracts features from the specified tagger configuration according to the default feature model. The configuration is specified in terms of the words in the input sentence (*words*, a list of word ids), the position of the current word (*i*), and the list of already predicted tags (*pred_tags*, a list of tag ids). Returns a tensor that can be fed to the fixed-window model.

**predict** (*self*, *words*)

> Processes the input sentence *words* (a list of string tokens) and makes calls to the fixed-window model to predict the tag of each word. Returns the list of the predicted tags (strings).

In [171]:
vocab_words, vocab_tags =  make_vocabs(train_data)
fwt = FixedWindowTagger(vocab_words, vocab_tags)

sentence = list(train_data)[42]
words = []
tags = []
for word_tag_pair in sentence:
    words.append(word_tag_pair[0])
    tags.append(word_tag_pair[1])

pred = fwt.predict(words)
print('pred',pred)
print('tags', tags)

pred ['INTJ', 'DET', 'X', 'PROPN', 'PRON', 'ADP', 'SCONJ', 'PRON', 'ADP', 'NOUN', 'PRON', 'X', 'PART']
tags ['PRON', 'AUX', 'VERB', 'NOUN', 'SCONJ', 'DET', 'NOUN', 'NOUN', 'AUX', 'AUX', 'VERB', 'ADV', 'PUNCT']


### Problem 4.3: Generate the training examples

Your next task is to implement a function that generates the training examples for the tagger. You will train the tagger as usual, using minibatch training.

In [144]:
def training_examples(vocab_words, vocab_tags, gold_data, tagger, batch_size=100):
    batch = []
    gold_label = []
    sentence_idx = 0
    for sentence in gold_data:
        sentence_idx += 1
        all_words_idx = []
        all_tags_idx = []

        for word, tag in sentence:
            all_words_idx.append(vocab_words[word])
            all_tags_idx.append(vocab_tags[tag])

        for i in range(0, len(all_words_idx)):
            batch.append(tagger.featurize(all_words_idx, i, all_tags_idx))
            gold_label.append(all_tags_idx[i])

            # Yield batch
            if len(batch) == batch_size:
                batch_tensor = torch.Tensor(batch_size, 4).long().to(device)
                bx = torch.cat(batch, out=batch_tensor).to(device)
                by = torch.Tensor(gold_label).long().to(device)
                yield bx, by
                batch = []
                gold_label = []

        # Yield remaining batch
        if sentence_idx == len(list(gold_data))-1:
            remainder = len(batch)
            batch_tensor = torch.Tensor(remainder, 4).long().to(device)
            bx = torch.cat(batch, out=batch_tensor).to(device)
            by = torch.Tensor(gold_label).long().to(device)
            yield bx, by

Your code should comply with the following specification:

**training_examples** (*vocab_words*, *vocab_tags*, *gold_data*, *tagger*, *batch_size* = 100)

> Iterates through the given *gold_data* (an iterable of tagged sentences), encodes it into word ids and tag ids using the specified vocabularies *vocab_words* and *vocab_tags*, and then yields batches of training examples for gradient-based training. Each batch contains *batch_size* examples, except for the last batch, which may contain fewer examples. Each example in the batch is created by a call to the `featurize` function of the *tagger*.

In [118]:
vocab_words, vocab_tags =  make_vocabs(train_data)
fwt = FixedWindowTagger(vocab_words, vocab_tags)

train = training_examples(vocab_words, vocab_tags, train_data, fwt, batch_size=100)
#len(list(train_data))

torch.Size([100, 4])

### Problem 4.4: Training loop

What remains to be done is the implementation of the training loop. This should be a straightforward generalization of the training loops that you have seen so far. Complete the skeleton code in the cell below:

In [119]:
def var_init(model, std=0.01):
    for name, param in model.named_parameters():
        param.data.normal_(mean=0.0, std=std)

In [162]:
import torch.optim as optim
import torch.nn.functional as F
from tqdm import tqdm

def train_fixed_window(train_data, n_epochs=1, batch_size=100, lr=1e-2):
    # TODO: Replace the next line with your own code
    vocab_words, vocab_tags =  make_vocabs(train_data)

    tagger = FixedWindowTagger(vocab_words, vocab_tags)
    
    # Initialize embedding weights
    #var_init(tagger.model)

    # Load word embeddings from GloVe
    glove = torch.load('glove.pt').to(device)
    tagger.model.embeddings['word_embs'] = nn.Embedding.from_pretrained(glove, freeze=False)

    optimizer = optim.Adam(tagger.model.parameters(), lr=lr)

    nr_words = 0

    for sentence in train_data:
        words = [tokens[0] for tokens in sentence]
        nr_words += len(words)

    try:    
        for epoch in range(n_epochs):
            # Begin training
            with tqdm(total=nr_words) as pbar:
                batch = 0
                tagger.model.train()
                for bx, by in training_examples(vocab_words, vocab_tags, train_data, fwt, batch_size):
                    curr_batch_size = len(bx)

                    score = tagger.model.forward(bx)
                    optimizer.zero_grad()
                    loss = F.cross_entropy(score, by)
                    loss.backward()
                    optimizer.step()

                    pbar.set_postfix(loss=(loss.item()), batch=batch+1)
                    pbar.update(curr_batch_size)
                    batch += 1
                
    except KeyboardInterrupt:
        pass
    
    return tagger

Here is the specification of the training function:

**train_fixed_window** (*train_data*, *n_epochs* = 1, *batch_size* = 100, *lr* = 1e-2)

> Trains a fixed-window tagger from a set of training data *train_data* (an iterable over tagged sentences) using minibatch gradient descent and returns it. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*.

The next code cell trains a tagger and evaluates it on the development data:

In [146]:
tagger = train_fixed_window(train_data)
print('{:.4f}'.format(accuracy(tagger, dev_data)))

100%|█████████▉| 204559/204585 [16:34<00:00, 205.74it/s, batch=2046, loss=0.289]  


0.8789


#### GloVe embedding

In [163]:
tagger_glove = train_fixed_window(train_data)
print('{:.4f}'.format(accuracy(tagger_glove, dev_data)))
print(tagger.model.embeddings['word_embs'])

Init Embedding(19674, 50, padding_idx=0)
Pre Embedding(19674, 50)


100%|█████████▉| 204559/204585 [15:29<00:00, 220.14it/s, batch=2046, loss=0.237] 


0.8854
Embedding(19674, 50, padding_idx=0)


**⚠️ Your submitted notebook must contain output demonstrating at least 88% accuracy on the development set.**

## Problem 5: Pre-trained embeddings (reflection)

Many neural systems for natural language processing use pre-trained word embeddings, either to augment or to replace randomly initialised task-based embeddings. In this problem, you will investigate whether pre-trained embeddings help your part-of-speech tagger.

The file `glove.pt` contains a PyTorch tensor containing 50-dimensional pre-trained word embeddings from the [GloVe project](https://nlp.stanford.edu/projects/glove/). You can load this tensor using the command

```
glove = torch.load('glove.pt')
```

and should be able to use it as a drop-in replacement for your randomly initialized word embeddings, assuming that the words in your vocabulary are numbered in the order in which they are found in the training data. Have a look at the documentation of the class [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) to learn how to do this.

Run experiments to assess the effect that pre-trained embeddings have on (a)&nbsp;the accuracy of the tagger, and (b)&nbsp;the speed of learning, i.e., the number of training examples it takes to reach a certain loss. Document your exploration in a short reflection piece (ca. 150&nbsp;words). Respond to the following prompts:

* How did you integrate the pre-trained embeddings into your system? What did you measure? What results did you get?
* Based on what you know about word embeddings and transfer learning, did you expect your results? How do you explain them?
* What did you learn? How, exactly, did you learn it? Why does this learning matter?

### Reflection
**How did you integrate the pre-trained embeddings into your system? What did you measure? What results did you get?**
* We integrated the pre-trained embeddings upon initializing our model, where we replaced our word embeddings with the GloVe embeddings. We measured the accuracy and runtime of both taggers and received very similar results. 

* Our own embeddings:
`[runtime: 16:34, 205.74it/s, batch=2046, loss=0.289, acc: 0.8789]`

* GloVe embeddings:
`[runtime: 15:29, 220.14it/s, batch=2046, loss=0.237, acc: 0.8854]`


**Based on what you know about word embeddings and transfer learning, did you expect your results? How do you explain them?**

* The results we're somewhat expected, since transfer learning is mainly used to compensate for insuffcient data or computing power. It's not really correleated with improved accuracy, though it might help with the learning speed of the model since it has better intial parameters. This is why our model trained from scratch performs almost identically to the model with the fine-tuned GloVe embeddings.

**What did you learn? How, exactly, did you learn it? Why does this learning matter?**
* Transfer learning: how to use pre-trained embeddings and fine-tuning them according to our needs
* Using vectorisation and avoiding unnecessary looping are important key points during training.
* When creating a list with generator function, it will take much longer time since it will yield all dataset. Instead, use generator function directly. In our training_examples() function, we tried to create list of gold_data and access each sentence by index. But this approach resulted in a very slow runtime, since the entire dataset was unnecessarily copied each time it was accessed.
* The things we learned in this lab will be very important henceforth, especially since our project will utilize parts from both lab 4 and 5.

**🥳 Congratulations on finishing this lab! 🥳**