# L5: Dependency parsing

Dependency parsing is the task of mapping a sentence to a formal representation of its syntactic structure in the form of a dependency tree, which consists of directed arcs between individual words (tokens). In the lab you will implement a dependency parser based on the arc-standard algorithm and the fixed-window model that you implemented in Lab&nbsp;L4.

## The data set

The data set for this lab is the same as for Lab&nbsp;L4: the English Web Treebank from the [Universal Dependencies Project](http://universaldependencies.org). The code below defines an iterable-style dataset for parser data in the [CoNLL-U format](https://universaldependencies.org/format.html) that the project uses to distribute its data.

In [1]:
class Dataset():

    ROOT = ('<root>', '<root>', 0)  # Pseudo-root

    def __init__(self, filename):
        self.filename = filename

    def __iter__(self):
        with open(self.filename, 'rt', encoding='utf-8') as lines:
            tmp = [Dataset.ROOT]
            for line in lines:
                if not line.startswith('#'):  # Skip lines with comments
                    line = line.rstrip()
                    if line:
                        columns = line.split('\t')
                        if columns[0].isdigit():  # Skip range tokens
                            tmp.append((columns[1], columns[3], int(columns[6])))
                    else:
                        yield tmp
                        tmp = [Dataset.ROOT]

We load the training data and the development data:

In [2]:
train_data = Dataset('en_ewt-ud-train-projectivized.conllu')
dev_data = Dataset('en_ewt-ud-dev.conllu')

Both data sets consist of **parsed sentences**. A parsed sentence is represented as a list of triples, where the first component of each triple (a string) represents a word, and the second component (also a string) represents the word’s part-of-speech tag. The third component (an integer) specifies the position of the word’s syntactic head, i.e., its parent in the dependency tree. Run the following code cell to see an example:

In [3]:
example_sentence = list(train_data)[531]

example_sentence

[('<root>', '<root>', 0),
 ('I', 'PRON', 2),
 ('like', 'VERB', 0),
 ('yuor', 'PRON', 4),
 ('blog', 'NOUN', 2),
 ('.', 'PUNCT', 2)]

In this example the head of the pronoun *I* is the word at position&nbsp;2 – the verb *like*. The dependents of *like* are *I* (position&nbsp;1) and the noun *blog* (position&nbsp;4), as well as the final punctuation mark. Note that each sentence starts with the so-called **pseudo-root** (position&nbsp;0). This pseudo-root is a pseudo-word that is guaranteed to be the root of the dependency tree.

## Parser interface

Like the tagger in the previous lab, the parser that you will implement in this lab follows a simple interface:

In [4]:
class Parser(object):

    def predict(self, words, tags):
        raise NotImplementedError

The single method of this interface has the following specification:

**predict** (*self*, *words*, *tags*)

> Returns the list of predicted heads (a list of integers) for a single sentence, specified in terms of its *words* (a list of strings) and their corresponding *tags* (also a list of strings).

One trivial implementation of this interface is a parser that attaches each (real) word to its preceding word:

In [5]:
class TrivialParser(Parser):

    def predict(self, words, tags):
        return [0] + list(range(len(words)-1))

## Problem 1: Implement an evaluation function

Your first task is to implement a function that computes the **unlabelled attachment score (UAS)** of a parser on gold-standard data.

In [6]:
def uas(parser, gold_data):
    # TODO: Replace the next line with your own code
    nr_correct = 0
    nr_words = 0

    for sentence in gold_data:
        words = [tokens[0] for tokens in sentence]
        tags = [tokens[1] for tokens in sentence]
        # Do not include pseudo-root
        nr_words += (len(words) - 1)

        correct_head = [tokens[2] for tokens in sentence]
        predicted_head = parser.predict(words, tags)

        # skip pseudo-root
        for i in range(1, len(words)):
            if predicted_head[i] == correct_head[i]:
                nr_correct += 1

    acc = nr_correct / nr_words
    return acc

Your implementation should conform to the following specification:

**uas** (*parser*, *gold_data*)

> Computes the unlabelled attachment score of the specified *parser* on the gold-standard data *gold_data* (an iterable of tagged sentences) and returns it as a float. The unlabelled attachment score is the percentage of all tokens to which the parser assigns the correct head (as per the gold standard). The calculation excludes the pseudo-roots.

### 🤞 Test your code

Test your code by computing the unlabelled attachment score for the trivial parser that attaches every word to its preceding word. The expected score on the development set is 9.76%.

In [7]:
uas(TrivialParser(), dev_data)

0.09757843254204938

## Problem 2: Create the vocabularies

The next cell contains skeleton code for a function `make_vocabs` that constructs the two vocabularies of the parser: one for the words and one for the tags. You should be quite familiar with this task by now. You will be able to re-use your code from lab&nbsp;L4.

In [8]:
PAD = '<pad>'
UNK = '<unk>'

def make_vocabs(gold_data):
    vocab = {PAD: 0, UNK: 1}
    tags = {PAD: 0}
    for sentence in gold_data:
        for pair in sentence:
            word = pair[0]
            tag = pair[1]
            
            if word not in vocab:
                vocab[word] = len(vocab)
            
            if tag not in tags:
                tags[tag] = len(tags)
                    
    return vocab, tags

Complete the code according to the following specification:

**make_vocabs** (*gold_data*)

> Returns a pair of dictionaries mapping the unique words and tags in the gold-standard data *gold_data* (an iterable over parsed sentences) to contiguous ranges of integers starting at zero. The word dictionary contains the pseudowords `PAD` (index&nbsp;0) and `UNK` (index&nbsp;1); the tag dictionary contains `PAD` (index&nbsp;0).

### 🤞 Test your code

Test your implementation by computing the total number of unique words and part-of-speech tags in the training data (including the pseudowords and the part-of-speech tag for the pseudoroot). The expected values are 19,676&nbsp;words and 19&nbsp;tags.

In [9]:
vocab, tags = make_vocabs(train_data)
print(len(vocab))
print(len(tags))

19676
19


## Problem 3: Implement the arc-standard algorithm

The parser that you will implement in this lab consists of two parts: a static part that implements the logic of the arc-standard algorithm (presented in Lecture&nbsp;5.2), and a non-static part that contains the learning component – the fixed-window model that you implemented in Lab&nbsp;L4. In this problem you will implement the static part; the learning component is covered in Problem&nbsp;5.

Recall that, in the arc-standard algorithm, the next move (also called ‘transition’) of the parser is predicted based on features extracted from the current parser configuration, with references to the words and part-of-speech tags of the input sentence. On the Python side of things, the words and part-of-speech tags are represented as lists of strings, and a configuration is represented as a triple

$$
(i, \mathit{stack}, \mathit{heads})
$$

where $i$ is an integer specifying the position of the next word in the buffer, $\mathit{stack}$ is a list of integers specifying the positions of the words currently on the stack (with the topmost element last in the list), and $\mathit{heads}$ is a list of integers specifying the positions of the head words. If a word has not yet been assigned a head, its head value is&nbsp;0. To illustrate this representation, the initial configuration for the example sentence above is

and a possible final configuration is

**Note:** In Lecture&nbsp;5.2, both the buffer and the stack were presented as list of words. Here we only represent the *stack* as a list of words. To represent the *buffer*, we simply record the position of the next word that has not been processed yet (the integer $i$). This acknowledges the fact that the buffer (in contrast to the stack) can never grow, but will be processed from left to right.

The cell below contains a complete skeleton for the logic of the arc-standard algorithm:

In [10]:
class ArcStandardParser(Parser):

    MOVES = tuple(range(3))

    SH, LA, RA = MOVES  # Parser moves are specified as integers.

    @staticmethod
    def initial_config(num_words):
        # TODO: Replace the next line with your own code
        return (0, [], [0] * num_words)

    @staticmethod
    def valid_moves(config):
        # TODO: Replace the next line with your own code
        valid_moves = []
        buffer, stack, heads = config

        if buffer < len(heads):
            valid_moves.append(ArcStandardParser.SH)
        if len(stack) > 2:
            valid_moves.append(ArcStandardParser.LA)
        if len(stack) > 1:
            valid_moves.append(ArcStandardParser.RA)
        return valid_moves

    @staticmethod
    def next_config(config, move):
        # TODO: Replace the next line with your own code
        buffer, stack, heads = config
        # SHIFT
        if move == ArcStandardParser.SH:
            stack.append(buffer)
            buffer += 1
        # LEFT ARC
        elif move == ArcStandardParser.LA:
            heads[stack[-2]] = stack[-1]
            top = stack[-1]
            stack = stack[:-2] 
            stack.append(top)
        # RIGHT ARC
        elif move == ArcStandardParser.RA:
            heads[stack[-1]] = stack[-2]
            stack = stack[:-1]
            
        return (buffer, stack, heads)

    @staticmethod
    def is_final_config(config):
        # TODO: Replace the next line with your own code
        buffer, stack, heads = config
        return buffer == len(heads) and len(stack) == 1 and stack[0] == 0

Your implementation should conform to the following specification:

**initial_config** (*num_words*)

> Returns the initial configuration for a sentence with the specified number of words (*num_words*).

**valid_moves** (*config*)

> Returns the list of valid moves for the specified configuration (*config*).

**next_config** (*config*, *move*)

> Applies the *move* in the specified configuration *config* and returns the new configuration. This must not modify the input configuration.

**is_final_config** (*config*)

> Tests whether *config* is a final configuration.

### 🤞 Test your code

To test your implementation, you can run the code below. The code in this cell creates the initial configuration for the example sentence, simulates a sequence of moves, and then tests that the resulting configuration is the expected final configuration.

In [11]:
moves = [0, 0, 0, 1, 0, 0, 1, 2, 0, 2, 2]    # 0 = SH, 1 = LA, 2 = RA

parser = ArcStandardParser()
config = parser.initial_config(len(example_sentence))
for move in moves:
    assert move in parser.valid_moves(config)
    config = parser.next_config(config, move)
assert parser.is_final_config(config)
assert config == (6, [0], [0, 2, 0, 4, 2, 2])

print('Looks good!')

Looks good!


## Problem 4: Implement the oracle

The learning component of the parser is the next move classifier. To train this classifier, we need training examples of the form $(\mathbf{x}, m)$, where $\mathbf{x}$ is a feature vector extracted from a given parser configuration $c$, and $m$ is the corresponding gold-standard move. To obtain $m$, we need an **oracle**.

Recall that, in the context of transition-based dependency parsing, an oracle is a function that translates a gold-standard dependency tree (here represented as a list of head ids) into a sequence of moves such that, when the parser takes the moves starting from the initial configuration, then it recreates the original dependency tree. Here we ask you to implement the static oracle that was presented in Lecture&nbsp;5.2.

In [12]:
def oracle_moves(gold_heads):
    # TODO: Replace the next line with your own code
    parser = ArcStandardParser()
    config = parser.initial_config(len(gold_heads))
    buffer, stack, heads = config
    SH, LA, RA = parser.MOVES
    dependants = {}

    # For each word, count how many other words are dependant on it
    for head in gold_heads:
        if head not in dependants:
            dependants[head] = 1    
        else:
            dependants[head] += 1
    
    # If we haven't reached our final configuration, keep looking
    while not parser.is_final_config(config):
        if len(stack) >= 2:
            top = stack[-1]
            second_top = stack[-2]
            
            # LEFT ARC
            # Does the top of the stack match the gold_head[second_top] and does the 2nd top not have any dependants left?
            # Since second_top will be pushed off the stack, we need to have processed all of it's dependants
            if top == gold_heads[second_top] and dependants.get(second_top, 0) == 0:
                yield config, LA
                config = parser.next_config(config, LA)
                buffer, stack, heads = config
                dependants[top] -= 1 # 1 dependant processed

            # RIGHT ARC
            # Does the second_top of the stack match the gold_head[top] and does the top not have any dependants left?
            # Since top will be pushed off the stack, we need to have processed all of it's dependants
            elif second_top == gold_heads[top] and dependants.get(top, 0) == 0:
                yield config, RA
                config = parser.next_config(config, RA)
                buffer, stack, heads = config
                dependants[second_top] -= 1 # 1 dependant processed
            
            # SHIFT
            # If neither LA or RA is the right move we have to keep shifting
            else:
                yield config, SH
                config = parser.next_config(config, SH)
        
        # SHIFT
        # Shift more words from buffer onto the stack
        else:
            yield config, SH
            config = parser.next_config(config, SH)

Your implementation should conform to the following specification:

**oracle_moves** (*gold_heads*)

> Translates a gold-standard head assignment for a single sentence (*gold_heads*) into the corresponding stream of oracle moves. More specifically, this yields pairs $(c, m)$ where $m$ is a move (an integer, as specified in the `ArcStandardParser` interface) and $c$ is the parser configuration in which $m$ was taken.

### 🤞 Test your code

Test your code by running the cell below. This uses your implementation of *oracle_moves* to extract the oracle move sequence from the example sentence and compares it to the gold-standard move sequence *gold_moves*.

In [13]:
gold_heads = [h for w, t, h in example_sentence]
gold_moves = [0, 0, 0, 1, 0, 0, 1, 2, 0, 2, 2]

assert list(m for _, m in oracle_moves(gold_heads)) == gold_moves

## Problem 5: Fixed-window parser

Now it is time to put everything together. For the full implementation of the fixed-window parser, you will need the correspondents of the four parts of the fixed-window tagger from Lab&nbsp;L4: an implementation of the fixed-window model; a parser that uses the fixed-window model to make predictions; a function that generates the training examples for the parser; and the training loop.

### Problem 5.1: Implement the fixed-window model

The fixed-window model for the parser is the same as the fixed-window model for the tagger in Lab&nbsp;L4. You can simply copy your code from that lab.

In [14]:
import torch
import torch.nn as nn

class FixedWindowModel(nn.Module):

    def __init__(self, embedding_specs, hidden_dim, output_dim):
        super().__init__()
        # TODO: Add your code here
        # Extract embedding_specs
        emb_spec_words = embedding_specs[0]
        emb_spec_tags = embedding_specs[1]

        n_words = emb_spec_words[0]
        vocab_size = emb_spec_words[1]
        word_dim = emb_spec_words[2]

        n_tags = emb_spec_tags[0]
        tags_size = emb_spec_tags[1]
        tag_dim = emb_spec_tags[2]

        # Create embeddings
        self.embeddings = nn.ModuleDict([['word_embs', nn.Embedding(vocab_size, word_dim, padding_idx=0)],
                                         ['tag_embs', nn.Embedding(tags_size, tag_dim, padding_idx=0)]])

        # Create hidden layers
        self.hidden = nn.Linear(n_words * word_dim + n_tags * tag_dim, hidden_dim) # 3 * 50 + 3 * 10,

        # Create ReLU
        self.activation = nn.ReLU()

        # Create output layers
        self.output = nn.Linear(hidden_dim, output_dim)

    def forward(self, features):
        # TODO: Replace the next line with your own code
        batch_size = len(features)
        
        # Extract words and tags for buffer 1, stack 1, stack 2
        words, tags = torch.split(features, 3, dim=1)

        # Get the word and tag embeddings
        word_embs = self.embeddings['word_embs'](words) # 3 * 50
        tag_embs = self.embeddings['tag_embs'](tags) # 3 * 10
        
        concat_words = word_embs.view(batch_size, -1)
        concat_tags = tag_embs.view(batch_size, -1)
        
        concat_embs = torch.cat([concat_words, concat_tags], dim=1)

        hidden = self.hidden(concat_embs)

        relu = self.activation(hidden)

        output = self.output(relu)

        return output

### Problem 5.2: Implement the parser

The next step is to implement the parser itself. This parser will use the fixed-window model to predict the next move for a given configuration in the arc-standard algorithm, based on the features extracted from the current feature window.

#### Default feature model

For the parser, we ask you to implement a fixed-window model with the following features ($k=6$):

0. word form of the next word in the buffer
1. word form of the topmost word on the stack
2. word form of the second-topmost word on the stack
3. part-of-speech tag of the next word in the buffer
4. part-of-speech tag of the topmost word on the stack
5. part-of-speech tag of the second-topmost word on the stack

Whenever the value of a feature is undefined, you should use the special value `PAD`.

#### Hyperparameters

The following choices are reasonable defaults for the hyperparameters of the network architecture used by the parser:

* width of the word embedding: 50
* width of the tag embedding: 10
* size of the hidden layer: 180

In [15]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
print(torch.__version__)

cpu
1.13.1


In [16]:
class FixedWindowParser(ArcStandardParser):

    def __init__(self, vocab_words, vocab_tags, word_dim=50, tag_dim=10, hidden_dim=180):
        # TODO: Add your own code
        num_moves = len(ArcStandardParser.MOVES)
        embedding_specs = [(3, len(vocab_words), word_dim), (3, len(vocab_tags), tag_dim)]
        self.model = FixedWindowModel(embedding_specs, hidden_dim, num_moves).to(device)
        self.vocab_words = vocab_words
        self.vocab_tags = vocab_tags

    def featurize(self, words, tags, config):
        # TODO: Replace the next line with your own code
        buffer, stack, heads = config

        # stack might be empty or not have enough words, set words and tags to PAD
        word_2 = self.vocab_words[PAD]
        tag_2 = self.vocab_tags[PAD]
        word_3 = self.vocab_words[PAD]
        tag_3 = self.vocab_tags[PAD]

        if buffer < len(heads):
            word_1 = words[buffer]
            tag_1 = tags[buffer]
        else:
            word_1 = self.vocab_words[PAD]
            tag_1 = self.vocab_tags[PAD]
        
        if len(stack) >= 2 and len(stack) <= len(words):
            word_2 = words[stack[-1]]
            tag_2 = tags[stack[-1]]
            word_3 = words[stack[-2]]
            tag_3 = tags[stack[-2]]

        elif len(stack) == 1:
            word_2 = words[stack[-1]]
            tag_2 = tags[stack[-1]]

        # next word in buffer, topmost word on stack, 2nd topmost word on stack,
        # tag of next word in buffer, tag of topmost word on stack, tag of 2nd topmost word on stack
        feature = [word_1, word_2, word_3, tag_1, tag_2, tag_3]
        return torch.tensor([feature]).to(device)

    def predict(self, words, tags):
        # TODO: Replace the next line with your own code
        # find word indexes for given words
        words_idxs = []
        for word in words:
            if word in self.vocab_words:
                words_idxs.append(self.vocab_words[word])
            else:
                words_idxs.append(self.vocab_words[UNK])

        # find tag indexes for given tags
        tags_idxs = []
        for tag in tags:
            if tag in self.vocab_tags:
                tags_idxs.append(self.vocab_tags[tag])
            else:
                tags_idxs.append(self.vocab_tags[PAD])

        config = self.initial_config(len(words))

        while not self.is_final_config(config):
            valid_moves = self.valid_moves(config)
            feature = self.featurize(words_idxs, tags_idxs, config)
            pred_moves = self.model.forward(feature)
            _, sorted_indexes = torch.sort(pred_moves, descending=True)
            # find valid move with highest score (SH, LA, RA)
            if len(valid_moves) > 0:
                sorted_move_list = sorted_indexes.tolist()[0]
                # choose first valid move as default move
                new_move = valid_moves[0]
                for move in sorted_move_list:
                    if move in valid_moves:
                        new_move = move
                        break
                config = self.next_config(config, new_move)

        return config[2]

Complete the skeleton code by implementing the methods of this interface:

**__init__** (*self*, *vocab_words*, *vocab_tags*, *word_dim* = 50, *tag_dim* = 10, *hidden_dim* = 100)

> Creates a new fixed-window model of appropriate dimensions and sets up any other data structures that you consider relevant. The parameters *vocab_words* and *vocab_tags* are the word vocabulary and tag vocabulary. The parameters *word_dim* and *tag_dim* specify the embedding width for the word embeddings and tag embeddings.

**featurize** (*self*, *words*, *tags*, *config*)

> Extracts features from the specified parser state according to the feature model given above. The state is specified in terms of the words in the input sentence (*words*, a list of word ids), their part-of-speech tags (*tags*, a list of tag ids), and the parser configuration proper (*config*, as specified in Problem&nbsp;3).

**predict** (*self*, *words*, *tags*)

> Predicts the list of all heads for the input sentence. This simulates the arc-standard algorithm, calling the move classifier whenever it needs to take a decision. The input sentence is specified in terms of the list of its words (strings) and the list of its tags (strings). Both of these should include the pseudoroot.

#### 💡 Hint on the implementation

In the *predict* function, you must make sure to only execute valid moves. One simple way to do so is to let the fixed-window model predict scores for all moves, and to implement your own, customised argmax operation to find the *valid* move with the highest score.

In [17]:
vocab_words, vocab_tags =  make_vocabs(train_data)
fwp = FixedWindowParser(vocab_words, vocab_tags)

words = []
tags = []
heads = []
for word, tag, head in example_sentence:
    words.append(word)
    tags.append(tag)
    heads.append(head)

pred = fwp.predict(words, tags)
print('pred',pred)
print('heads', heads)

pred [0, 0, 1, 4, 1, 1]
heads [0, 2, 0, 4, 2, 2]


### Problem 5.3: Generate the training examples

Your next task is to implement a function that generates the training examples for the parser. You will train as usual, using minibatch training.

In [18]:
def training_examples(vocab_words, vocab_tags, gold_data, parser, batch_size=100):
    #list(m for _, m in oracle_moves(gold_heads))
    batch = []
    moves = []
    sentence_idx = 0
    for sentence in gold_data:
        sentence_idx += 1
        all_words_idx = []
        all_tags_idx = []
        all_heads = []

        for word, tag, head in sentence:
            all_words_idx.append(vocab_words[word])
            all_tags_idx.append(vocab_tags[tag])
            all_heads.append(head)

        for c, m in oracle_moves(all_heads):
            batch.append(parser.featurize(all_words_idx, all_tags_idx, c))
            moves.append(m)

            # Yield batch
            if len(batch) == batch_size:
                batch_tensor = torch.Tensor(batch_size, 6).long().to(device)
                bx = torch.cat(batch, out=batch_tensor).to(device)
                by = torch.Tensor(moves).long().to(device)
                yield bx, by
                batch = []
                moves = []

    # Yield remaining batch
    if sentence_idx == len(list(gold_data))-1:
        remainder = len(batch)
        batch_tensor = torch.Tensor(remainder, 6).long().to(device)
        bx = torch.cat(batch, out=batch_tensor).to(device)
        by = torch.Tensor(moves).long().to(device)
        yield bx, by


Your code should comply with the following specification:

**training_examples** (*vocab_words*, *vocab_tags*, *gold_data*, *tagger*, *batch_size* = 100)

> Iterates through the given *gold_data* (an iterable of parsed sentences), encodes it into word ids and tag ids using the specified vocabularies *vocab_words* and *vocab_tags*, and then yields batches of training examples for gradient-based training. Each batch contains *batch_size* examples, except for the last batch, which may contain fewer examples. Each example in the batch is created by a call to the `featurize` function of the *parser*.

In [19]:
vocab_words, vocab_tags =  make_vocabs(train_data)

fwp = FixedWindowParser(vocab_words, vocab_tags)
train = training_examples(vocab_words, vocab_tags, train_data, fwp, batch_size=100)

x, y = next(train)

print(x.shape)
print(y.shape)

torch.Size([100, 6])
torch.Size([100])


### Problem 5.4: Training loop

The last piece of the puzzle is the training loop. This should be straightforward by now. Complete the skeleton code in the cell below:

In [20]:
import torch.optim as optim
import torch.nn.functional as F
from tqdm import tqdm

def train_fixed_window(train_data, n_epochs=1, batch_size=100, lr=1e-2):
    # TODO: Replace the next line with your own code
    vocab_words, vocab_tags =  make_vocabs(train_data)

    parser = FixedWindowParser(vocab_words, vocab_tags)

    optimizer = optim.Adam(parser.model.parameters(), lr=lr, weight_decay=1e-5)

    nr_examples = 421700

    try:    
        for epoch in range(n_epochs):
            # Begin training
            with tqdm(total=nr_examples) as pbar:
                batch = 1
                train_loss = 0

                parser.model.train()
                for bx, by in training_examples(vocab_words, vocab_tags, train_data, parser, batch_size):
                    curr_batch_size = len(bx)

                    score = parser.model.forward(bx)
                    optimizer.zero_grad()
                    loss = F.cross_entropy(score, by)
                    train_loss += loss.item()
                    loss.backward()
                    optimizer.step()

                    pbar.set_postfix(loss=(train_loss/batch), batch=batch)
                    pbar.update(curr_batch_size)
                    batch += 1
                
    except KeyboardInterrupt:
        pass
    
    return parser

Here is the specification of the training function:

**train_fixed_window** (*train_data*, *n_epochs* = 1, *batch_size* = 100, *lr* = 1e-2)

> Trains a fixed-window parser from a set of training data *train_data* (an iterable over parsed sentences) using minibatch gradient descent and returns it. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*.

The next code cell trains a parser and evaluates it on the development data (gold-standard part-of-speech tags):

In [22]:
parser = train_fixed_window(train_data, n_epochs=1)
print('{:.4f}'.format(uas(parser, dev_data)))

100%|██████████| 421700/421700 [00:18<00:00, 22619.12it/s, batch=4217, loss=0.278]


0.7025


**⚠️ Your submitted notebook must contain output demonstrating at least 68% UAS on the development set.**

The next code cell trains a parser and evaluates it on the development data (predicted part-of-speech tags):

In [21]:
train_data_retaged = Dataset('en_ewt-ud-train-projectivized-retagged.conllu')
dev_data_retaged = Dataset('en_ewt-ud-dev-retagged.conllu')

parser_retaged = train_fixed_window(train_data_retaged, n_epochs=1)
print('{:.4f}'.format(uas(parser_retaged, dev_data_retaged)))

100%|██████████| 421700/421700 [00:18<00:00, 22586.58it/s, batch=4217, loss=0.303]


0.6548


## Problem 6: Predicted part-of-speech tags (reflection)

The data that you have used in this lab so far contains gold-standard part-of-speech tags, which makes the evaluation of your parser somewhat misleading: In a practical system (including the baseline for the standard project), one does not have access to gold-standard tags; instead one has to first tag the sentences with an automatic part-of-speech tagger.

The lab directory contains the following alternative versions of the two data for this lab:

* `en_ewt-ud-train-projectivized-retagged.conllu`
* `en_ewt-ud-dev-retagged.conllu`

In each of them, the gold-standard part-of-speech tags have been replaced by part-of-speech tags automatically predicted by the tagger from Lab&nbsp;L4.

Run an experiment to assess the effect that using predicted part-of-speech tags instead of gold-standard tags has on the unlabelled attachment score of your parser. Document your exploration in a short reflection piece (ca. 150&nbsp;words). Respond to the following prompts:

* How did you set up your experiment? What results did you get?
* Based on what you know about machine learning, did you expect your results? How do you explain them?
* What did you learn? How, exactly, did you learn it? Why does this learning matter?

### Reflection
**How did you set up your experiment? What results did you get?**
* We ran our first experiment with gold-standard part-of-speech tags dataset, where we applied L2-regularization through `weight_decay=1e-5` in the Adam Optimizer. This setup gave us much better UAS compared to running without. For the second experiment, we trained the parser with predicted part-of-speech tags dataset received a worse UAS value compared to our first experiment.

* With gold-standard part-of-speech tags:
`[runtime: 00:18, 22619.12it/s, batch=4217, loss=0.278, uas: 0.7025]`

* With predicted part-of-speech tags:
`[runtime: 00:18, 22586.58it/s, batch=4217, loss=0.303, uas: 0.6548]`

**Based on what you know about machine learning, did you expect your results? How do you explain them?**
* Yes, we expected that the result of the second experiment would be lower than the first, the reasoning behind this is the nature of training process. When the gold-standard dataset is used during the training process of a model, fabricated dependencies will be much better and accurate compared to using a predicted dataset. In our case, utilizing predicted part-of-speech tags instead of gold-standard tags, carries the already existing inaccuracy, which then reflects on the fabricated dependencies.

**What did you learn? How, exactly, did you learn it? Why does this learning matter?**
* In this lab, we have learnt how the arc-standard algorithm works to create a dependency tree, and how static oracle tells the gold-standard transition sequence for a tree.
* The things we learned in this lab will be very important henceforth, especially since our project will utilize parts from both lab 4 and 5.

**🥳 Congratulations on finishing the last lab in this course! 🥳**