## The data set

The data set for this lab is the English Web Treebank from the [Universal Dependencies Project](http://universaldependencies.org). The code below defines an iterable-style dataset for parser data in the [CoNLL-U format](https://universaldependencies.org/format.html) that the project uses to distribute its data.

In [166]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from tqdm import tqdm

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
print(torch.__version__)

cpu
1.13.1


In [167]:
class Dataset():

    ROOT = ('<root>', '<root>', 0)  # Pseudo-root

    def __init__(self, filename):
        self.filename = filename

    def __iter__(self):
        with open(self.filename, 'rt', encoding='utf-8') as lines:
            tmp = [Dataset.ROOT]
            for line in lines:
                if not line.startswith('#'):  # Skip lines with comments
                    line = line.rstrip()
                    if line:
                        columns = line.split('\t')
                        if columns[0].isdigit():  # Skip range tokens
                            tmp.append((columns[1], columns[3], int(columns[6])))
                    else:
                        yield tmp
                        tmp = [Dataset.ROOT]

We load the training data and the development data:

In [168]:
train_data = Dataset('en_ewt-ud-train-projectivized.conllu')
dev_data = Dataset('en_ewt-ud-dev.conllu')

In [169]:
example_sentence = list(train_data)[1000]
example_sentence

[('<root>', '<root>', 0),
 ('They', 'PRON', 4),
 ('are', 'AUX', 4),
 ('merely', 'ADV', 4),
 ('imposters', 'NOUN', 0),
 ('.', 'PUNCT', 4)]

## Tagger evaluation function

**accuracy** (*tagger*, *gold_data*)

> Computes the accuracy of the *tagger* on the gold-standard data *gold_data* (an iterable of tagged sentences) and returns it as a float. Recall that the accuracy is defined as the percentage of tokens to which the tagger assigns the correct tag (as per the gold standard).

In [170]:
def accuracy(tagger, gold_data):
    nr_correct = 0
    nr_words = 0

    for sentence in gold_data:
        words = [tokens[0] for tokens in sentence]
        
        nr_words += len(words)

        correct_tags = [tokens[1] for tokens in sentence]
        predicted_tags = tagger.predict(words)

        for i in range(len(words)):
            if predicted_tags[i] == correct_tags[i]:
                nr_correct += 1

    acc = nr_correct / nr_words

    return acc

## Create the vocabularies

**make_vocabs** (*gold_data*)

> Returns a pair of dictionaries mapping the unique words and tags in the gold-standard data *gold_data* (an iterable over tagged sentences) to contiguous ranges of integers starting at zero. The word dictionary contains the pseudowords `PAD` (index&nbsp;0) and `UNK` (index&nbsp;1); the tag dictionary contains `PAD` (index&nbsp;0).

In [171]:
PAD = '<pad>'
UNK = '<unk>'

def make_vocabs(gold_data):
    vocab = {PAD: 0, UNK: 1}
    tags = {PAD: 0}
    for sentence in gold_data:
        for pair in sentence:
            word = pair[0]
            tag = pair[1]
            
            if word not in vocab:
                vocab[word] = len(vocab)
            
            if tag not in tags:
                tags[tag] = len(tags)
                    
    return vocab, tags

In [172]:
vocab, tags = make_vocabs(train_data)
print(len(vocab))
print(len(tags))

19676
19


## Fixed-window tagger

An input to the network takes the form of a $k$-dimensional vector of word ids and/or tag ids. Each integer $i$ is mapped to an $e_i$-dimensional embedding vector. These vectors are concatenated to form a vector of length $e_1 + \cdots + e_k$, and sent through a feed-forward network with a single hidden layer and a rectified linear unit (ReLU).

#### Default features

A fixed-window model with the following features ($k=4$):

0. current word
1. previous word
2. next word
3. tag predicted for the previous word

Whenever the value of a feature is undefined, the special value `PAD` is used.

#### Embedding specifications

 An embedding specification is a triple $(m, n, e)$ consisting of three integers. Such a triple specifies that the model should include $m$ instances of an embedding from $n$ items to vectors of size $e$. All of the $m$ instances are to share their weights. The embeddings are embeddings for words and tags. For example, to instantiate the default feature model:

``
[(3, num_words, word_dim), (1, num_tags, tag_dim)]
``

#### Hyperparameters

The network architecture introduces a number of hyperparameters. The following choices are reasonable defaults:

* width of each word embedding: 50
* width of each tag embedding: 10
* size of the hidden layer: 100

**__init__** (*self*, *embedding_specs*, *hidden_dim*, *output_dim*)

> A fixed-window model is initialized with a list of specifications for the embeddings the network should use (*embedding_specs*), the size of the hidden layer (*hidden_dim*), and the size of the output layer (*output_dim*).

**forward** (*self*, *features*)

> Computes the network output for a given feature representation *features*. This is a tensor of shape $B \times k$ where $B$ is the batch size (number of samples in the batch) and $k$ is the total number of embeddings specified upon initialisation. For example, for the default feature model, $k=4$, as this model includes 3 (weight-sharing) word embeddings and 1 tag embedding.

In [173]:
class FixedWindowTaggerModel(nn.Module):

    def __init__(self, embedding_specs, hidden_dim, output_dim):
        super().__init__()
        # Extract embedding_specs
        emb_spec_words = embedding_specs[0]
        emb_spec_tags = embedding_specs[1]

        n_words = emb_spec_words[0]
        vocab_size = emb_spec_words[1]
        word_dim = emb_spec_words[2]

        n_tags = emb_spec_tags[0]
        tags_size = emb_spec_tags[1]
        tag_dim = emb_spec_tags[2]

        # Create embeddings
        self.embeddings = nn.ModuleDict([
                        ['word_embs', nn.Embedding(vocab_size, word_dim, padding_idx=0)],
                        ['tag_embs', nn.Embedding(tags_size, tag_dim, padding_idx=0)]])

        # Create hidden layers
        self.hidden = nn.Linear(n_words * word_dim + n_tags * tag_dim, hidden_dim) # 3 * 50 + 1 * 10,

        # Create RELU
        self.activation = nn.ReLU()

        # Create output layers
        self.output = nn.Linear(hidden_dim, output_dim)

    def forward(self, features):
        batch_size = len(features)
        
        # Extract words and tags 
        words = features[:,:-1]
        tags = features[:,-1]

        # Get the word and tag embeddings
        word_embs = self.embeddings['word_embs'](words) # 3 * 50
        tag_embs = self.embeddings['tag_embs'](tags) # 1 * 10
        
        concat_words = word_embs.view(batch_size, -1)
        
        concat_embs = torch.cat([concat_words, tag_embs], dim=1)

        hidden = self.hidden(concat_embs)

        relu = self.activation(hidden)

        output = self.output(relu)

        return output

## Tagger interface

**predict** (*self*, *sentence*)

> Returns the list of predicted tags (a list of strings) for a single *sentence* (a list of string tokens).

In [174]:
class Tagger(object):

    def predict(self, sentence):
        raise NotImplementedError

## The Tagger

**__init__** (*self*, *vocab_words*, *vocab_tags*, *word_dim* = 50, *tag_dim* = 10, *hidden_dim* = 100)

> Creates a new fixed-window model of appropriate dimensions and sets up any other data structures that you consider relevant. The parameters *vocab_words* and *vocab_tags* are the word vocabulary and tag vocabulary. The parameters *word_dim* and *tag_dim* specify the embedding width for the word embeddings and tag embeddings.

**featurize** (*self*, *words*, *i*, *pred_tags*)

> Extracts features from the specified tagger configuration according to the default feature model. The configuration is specified in terms of the words in the input sentence (*words*, a list of word ids), the position of the current word (*i*), and the list of already predicted tags (*pred_tags*, a list of tag ids). Returns a tensor that can be fed to the fixed-window model.

**predict** (*self*, *words*)

> Processes the input sentence *words* (a list of string tokens) and makes calls to the fixed-window model to predict the tag of each word. Returns the list of the predicted tags (strings).

In [175]:
class FixedWindowTagger(Tagger):

    def __init__(self, vocab_words, vocab_tags, word_dim=50, tag_dim=10, hidden_dim=100):
        embedding_specs = [(3, len(vocab_words), word_dim), (1, len(vocab_tags), tag_dim)]
        self.model = FixedWindowTaggerModel(embedding_specs, hidden_dim, len(vocab_tags)).to(device)
        self.vocab_words = vocab_words
        self.vocab_tags = vocab_tags

    def featurize(self, words, i, pred_tags):
        feature = []
        if len(words) == 1:
            feature = [words[i], 0, 0, 0]

        elif i == 0: # first word
            # Wi, PAD, PAD, PAD
            feature = [words[i], words[i+1], 0, 0]
        elif i == len(words)-1: # last word
            # Wi, Wi+1, PAD, PAD
            feature = [words[i], 0, words[i-1], pred_tags[i-1]]
        else:
            # Wi, Wi+1, Wi-1, Ti-1
            feature = [words[i], words[i+1], words[i-1], pred_tags[i-1]]
        return torch.tensor([feature]).to(device)

    def predict(self, words):
        # find word indexes for given words
        words_idxs = []
        for word in words:
            if not word in self.vocab_words:
                words_idxs.append(self.vocab_words[UNK])
            else:
                words_idxs.append(self.vocab_words[word])

        # predict tags
        pred_tags_idxs = [0] * len(words)
        for i in range(0, len(words_idxs)):
            feature = self.featurize(words_idxs, i, pred_tags_idxs)
            pred_tags = self.model.forward(feature)
            # Find tag index with highest probability
            pred_tags_idxs[i] = torch.argmax(pred_tags).item()
        
        # convert tag indexes
        pred_tags = []
        for tag_idx in pred_tags_idxs:
            tag = [k for k, v in self.vocab_tags.items() if v == tag_idx][0]
            pred_tags.append(tag)
        
        return pred_tags

### Generate the training examples for the Tagger

**training_examples_tagger** (*vocab_words*, *vocab_tags*, *gold_data*, *tagger*, *batch_size* = 100)

> Iterates through the given *gold_data* (an iterable of tagged sentences), encodes it into word ids and tag ids using the specified vocabularies *vocab_words* and *vocab_tags*, and then yields batches of training examples for gradient-based training. Each batch contains *batch_size* examples, except for the last batch, which may contain fewer examples. Each example in the batch is created by a call to the `featurize` function of the *tagger*.

In [176]:
def training_examples_tagger(vocab_words, vocab_tags, gold_data, tagger, batch_size=100):
    batch = []
    gold_label = []
    sentence_idx = 0
    for sentence in gold_data:
        sentence_idx += 1
        all_words_idx = []
        all_tags_idx = []

        for word, tag, _ in sentence:
            all_words_idx.append(vocab_words[word])
            all_tags_idx.append(vocab_tags[tag])

        for i in range(0, len(all_words_idx)):
            batch.append(tagger.featurize(all_words_idx, i, all_tags_idx))
            gold_label.append(all_tags_idx[i])

            # Yield batch
            if len(batch) == batch_size:
                batch_tensor = torch.Tensor(batch_size, 4).long().to(device)
                bx = torch.cat(batch, out=batch_tensor).to(device)
                by = torch.Tensor(gold_label).long().to(device)
                yield bx, by
                batch = []
                gold_label = []

        # Yield remaining batch
        if sentence_idx == len(list(gold_data))-1:
            remainder = len(batch)
            batch_tensor = torch.Tensor(remainder, 4).long().to(device)
            bx = torch.cat(batch, out=batch_tensor).to(device)
            by = torch.Tensor(gold_label).long().to(device)
            yield bx, by

### Training loop for the Tagger

**train_fixed_window_tagger** (*train_data*, *n_epochs* = 1, *batch_size* = 100, *lr* = 1e-2)

> Trains a fixed-window tagger from a set of training data *train_data* (an iterable over tagged sentences) using minibatch gradient descent and returns it. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*.

In [177]:
def train_fixed_window_tagger(train_data, n_epochs=1, batch_size=100, lr=1e-2):
    vocab_words, vocab_tags =  make_vocabs(train_data)

    tagger = FixedWindowTagger(vocab_words, vocab_tags)
    tagger.model.to(device)
    
    optimizer = optim.Adam(tagger.model.parameters(), lr=lr)

    nr_iterations = 0

    for sentence in train_data:
        words = [tokens[0] for tokens in sentence]
        nr_iterations += len(words)

    try:    
        for epoch in range(n_epochs):
            # Begin training
            with tqdm(total=nr_iterations) as pbar:
                batch = 0
                tagger.model.train()
                for bx, by in training_examples_tagger(vocab_words, vocab_tags, train_data, tagger, batch_size):
                    curr_batch_size = len(bx)

                    score = tagger.model.forward(bx)
                    optimizer.zero_grad()
                    loss = F.cross_entropy(score, by)
                    loss.backward()
                    optimizer.step()

                    pbar.set_postfix(loss=(loss.item()), batch=batch+1)
                    pbar.update(curr_batch_size)
                    batch += 1
                
    except KeyboardInterrupt:
        pass
    
    return tagger

In [178]:
vocab_words, vocab_tags =  make_vocabs(train_data)
tagger = FixedWindowTagger(vocab_words, vocab_tags)
tagger.model = torch.load('tagger_model', map_location=device)
print('{:.4f}'.format(accuracy(tagger, dev_data)))

0.8917


In [179]:
# tagger = train_fixed_window_tagger(train_data)
# vocab_words, vocab_tags =  make_vocabs(train_data)
# tagger = FixedWindowTagger(vocab_words, vocab_tags)
# print('{:.4f}'.format(accuracy(tagger, dev_data)))

## Create predicted part-of-speech tags dataset

Use tagger to create predicted part-of-speech tags dataset for parser!

In [180]:
class TaggedDataset():

    def __init__(self, filename):
        self.filename = filename

    def __iter__(self):
        with open(self.filename, 'rt', encoding='utf-8') as lines:
            tmp = []
            for line in lines:
                if not line.startswith('#'):  # Skip lines with comments
                    line = line.rstrip()
                    if line:
                        columns = line.split('\t')
                        if columns[0].isdigit():  # Skip range tokens
                            tmp.append(columns)
                    else:
                        yield tmp
                        tmp = []

In [181]:
with open('en_ewt-ud-train-projectivized-retagged.conllu', 'wt', encoding='utf-8') as target:
    for sentence in TaggedDataset('en_ewt-ud-train-projectivized.conllu'):
        words = [columns[1] for columns in sentence]
        for i, t in enumerate(tagger.predict(words)):
            sentence[i][3] = t
        for columns in sentence:
            print('\t'.join(c for c in columns), file=target)
        print(file=target)

In [182]:
with open('en_ewt-ud-dev-retagged.conllu', 'wt', encoding='utf-8') as target:
    for sentence in TaggedDataset('en_ewt-ud-dev.conllu'):
        words = [columns[1] for columns in sentence]
        for i, t in enumerate(tagger.predict(words)):
            sentence[i][3] = t
        for columns in sentence:
            print('\t'.join(c for c in columns), file=target)
        print(file=target)

In [183]:
train_data_retagged = Dataset('en_ewt-ud-train-projectivized-retagged.conllu')
dev_data_retagged = Dataset('en_ewt-ud-dev-retagged.conllu')

## Parser evaluation function

**uas** (*parser*, *gold_data*)

> Computes the unlabelled attachment score of the specified *parser* on the gold-standard data *gold_data* (an iterable of tagged sentences) and returns it as a float. The unlabelled attachment score is the percentage of all tokens to which the parser assigns the correct head (as per the gold standard). The calculation excludes the pseudo-roots.

In [184]:
def uas(parser, gold_data):
    nr_correct = 0
    nr_words = 0

    for sentence in gold_data:
        words = [tokens[0] for tokens in sentence]
        tags = [tokens[1] for tokens in sentence]
        correct_head = [tokens[2] for tokens in sentence]
        # Do not include pseudo-root
        nr_words += (len(words) - 1)

        
        predicted_head = parser.predict(words, tags)

        # skip pseudo-root
        for i in range(1, len(words)):
            if predicted_head[i] == correct_head[i]:
                nr_correct += 1

    acc = nr_correct / nr_words
    return acc

## Parser interface

In [185]:
class Parser(object):

    def predict(self, words, tags):
        raise NotImplementedError

## The arc-hybrid algorithm

In [186]:
class ArcHybridParser(Parser):

    MOVES = tuple(range(3))

    SH, LA, RA = MOVES  # Parser moves are specified as integers.

    @staticmethod
    def initial_config(num_words):
        return (0, [], [0] * num_words)

    @staticmethod
    def valid_moves(config):
        # TODO: Replace the next line with your own code
        valid_moves = []
        buffer, stack, heads = config

        if buffer < len(heads):
            valid_moves.append(ArcHybridParser.SH)
        if len(stack) > 1 and buffer < len(config[2]):
            valid_moves.append(ArcHybridParser.LA)
        if len(stack) > 1:
            valid_moves.append(ArcHybridParser.RA)
        return valid_moves

    @staticmethod
    def next_config(config, move):
        buffer, stack, heads = config
        # SHIFT
        if move == ArcHybridParser.SH:
            stack.append(buffer)
            buffer += 1
        # LEFT ARC
        elif move == ArcHybridParser.LA:
            heads[stack[-1]] = buffer
            stack = stack[:-1]
        # RIGHT ARC
        elif move == ArcHybridParser.RA:
            heads[stack[-1]] = stack[-2]
            stack = stack[:-1]
            
        return (buffer, stack, heads)

    @staticmethod
    def is_final_config(config):
        buffer, stack, heads = config
        return buffer == len(heads) and len(stack) == 1 and stack[0] == 0
    
    # Buffer = 0
    # stack = 1
    # heads = 2
    @staticmethod
    def zero_cost_shift(current_config, gold_config):
        if current_config[0] == len(current_config[2]):
            return False
        if len(current_config[1]) == 0:
            return True
        item = current_config[0]

        # SH 1
        if item in gold_config:
            for d in current_config[1][0:-1]:
                if item == gold_config[d]:
                    return False

        # SH 2
        for h in current_config[1][0:-1]:
            if h == gold_config[item]:
                return False

        return True
    
    # Buffer = 0
    # stack = 1
    # heads = 2
    @staticmethod
    def zero_cost_la(current_config, gold_config):
        s0 = current_config[1][-1]
        if len(current_config[1]) < 2:
            s1 = None
        else:
            s1 = current_config[1][-2]

        # LA 1
        for buffer_item in range(current_config[0],len(current_config[2])):
            if s0 == gold_config[buffer_item]:
                return False 
    
        # LA 2
        if s1 == gold_config[s0]:
            return False
         
        # LA 3
        for buffer_item in range(current_config[0]+1,len(current_config[2])):
            if buffer_item == gold_config[s0]:
                return False
        return True


    # Buffer = 0
    # stack = 1
    # heads = 2
    @staticmethod
    def zero_cost_ra(current_config, gold_config):
        s0 = current_config[1][-1]

        # RA 1
        for buffer_item in range(current_config[0],len(current_config[2])):
            if s0 == gold_config[buffer_item]:
                return False

        # RA 2
        for buffer_item in range(current_config[0],len(current_config[2])):
            if buffer_item == gold_config[s0]:
                return False
        return True

## Test zero cost functions

In [187]:
parser = ArcHybridParser()

gold_config = [0, 2, 0, 5, 2, 2]
sh_cost_1 = (5, [0, 2, 3, 4], [0, 2, 0, 0, 0, 0]) # stack[..., d(3), ...] buffer[h(5), ...]
assert not parser.zero_cost_shift(sh_cost_1, gold_config)

gold_config = [0, 2, 0, 4, 2, 2]
sh_cost_2 = (5, [0, 2, 4], [0, 2, 0, 4, 0, 0]) # stack[..., h(2), ...] buffer[d(5), ...]
assert not parser.zero_cost_shift(sh_cost_2, gold_config)

gold_config = [0, 2, 0, 4, 2, 2]
la_cost_1 = (4, [0, 2], [0, 2, 0, 0, 0, 0]) # stack[..., h(2)] buffer[b, ..., d(5)]
assert not parser.zero_cost_la(la_cost_1, gold_config)

gold_config = [0, 2, 0, 4, 2, 2]
la_cost_2 = (5, [0, 2, 4], [0, 2, 0, 4, 0, 0]) # stack[..., h(2), d(4)] buffer[b, ...]
assert not parser.zero_cost_la(la_cost_2, gold_config)

gold_config = [0, 2, 0, 5, 2, 2]
la_cost_3 = (4, [0, 2, 3], [0, 2, 0, 0, 0, 0]) # stack[..., d(3)] buffer[b, ..., h(5)]
assert not parser.zero_cost_la(la_cost_3, gold_config)

gold_config = [0, 2, 0, 5, 2, 2]
ra_cost_1 = (3, [0, 2], [0, 2, 0, 0, 0, 0]) # stack[..., h(2)] buffer[..., d(4)]
assert not parser.zero_cost_ra(ra_cost_1, gold_config)

gold_config = [0, 2, 0, 5, 2, 2]
ra_cost_2 = (4, [0, 2, 3], [0, 2, 0, 0, 0, 0]) # stack[..., d(3)] buffer[..., h(5)]
assert not parser.zero_cost_ra(ra_cost_2, gold_config)


## The dynamic oracle

In [188]:
SHIFT, LA, RA = 0,1,2
def dynamic_oracle(gold_config, current_config, legal_transition,parser):
    moves = []
    if SHIFT in legal_transition and parser.zero_cost_shift(current_config,gold_config):
        moves.append(SHIFT)
    if LA in legal_transition and parser.zero_cost_la(current_config,gold_config):
        moves.append(LA)
    if RA in legal_transition and parser.zero_cost_ra(current_config,gold_config):
        moves.append(RA)
    return moves

## Fixed-window parser

In [189]:
class FixedWindowParserModel(nn.Module):

    def __init__(self, embedding_specs, hidden_dim, output_dim):
        super().__init__()
        # Extract embedding_specs
        emb_spec_words = embedding_specs[0]
        emb_spec_tags = embedding_specs[1]

        n_words = emb_spec_words[0]
        vocab_size = emb_spec_words[1]
        word_dim = emb_spec_words[2]

        n_tags = emb_spec_tags[0]
        tags_size = emb_spec_tags[1]
        tag_dim = emb_spec_tags[2]

        # Create embeddings
        self.embeddings = nn.ModuleDict([['word_embs', nn.Embedding(vocab_size, word_dim, padding_idx=0)],
                                         ['tag_embs', nn.Embedding(tags_size, tag_dim, padding_idx=0)]])

        # Create hidden layers
        self.hidden = nn.Linear(n_words * word_dim + n_tags * tag_dim, hidden_dim) # 12 * 50 + 12 * 10,

        # Create ReLU
        self.activation = nn.ReLU()

        # Create output layers
        self.output = nn.Linear(hidden_dim, output_dim)

    def forward(self, features):
        batch_size = len(features)
        
        # Extract words and tags
        words, tags = torch.split(features, 12, dim=1)
        
        # Get the word and tag embeddings
        word_embs = self.embeddings['word_embs'](words) # 12 * 50
        tag_embs = self.embeddings['tag_embs'](tags) # 12 * 10
        
        concat_words = word_embs.view(batch_size, -1)
        concat_tags = tag_embs.view(batch_size, -1)
        
        concat_embs = torch.cat([concat_words, concat_tags], dim=1)

        hidden = self.hidden(concat_embs)

        relu = self.activation(hidden)

        output = self.output(relu)

        return output

### The parser

In [190]:
class FixedWindowParser(ArcHybridParser):

    def __init__(self, vocab_words, vocab_tags, word_dim=50, tag_dim=10, hidden_dim=180):
        num_moves = len(ArcHybridParser.MOVES)
        embedding_specs = [(12, len(vocab_words), word_dim), (12, len(vocab_tags), tag_dim)]
        self.model = FixedWindowParserModel(embedding_specs, hidden_dim, num_moves).to(device)
        self.vocab_words = vocab_words
        self.vocab_tags = vocab_tags

    def featurize(self, words, tags, gold_heads, config):
        buffer, stack, heads = config
    
        s0_w = self.vocab_words[PAD]
        s0_t = self.vocab_tags[PAD]
        s1_w = self.vocab_words[PAD]
        s1_t = self.vocab_tags[PAD]
        s2_w = self.vocab_words[PAD]
        s2_t = self.vocab_tags[PAD]

        b0_w = self.vocab_words[PAD]
        b0_t = self.vocab_tags[PAD]
        b1_w = self.vocab_words[PAD]
        b1_t = self.vocab_tags[PAD]
        b2_w = self.vocab_words[PAD]
        b2_t = self.vocab_tags[PAD]

        if buffer < len(heads):
            b0_w = words[buffer]
            b0_t = tags[buffer]
            if buffer + 1 < len(heads):
                b1_w = words[buffer + 1]
                b1_t = tags[buffer + 1]
                if buffer + 2 < len(heads):
                    b2_w = words[buffer + 2]
                    b2_t = tags[buffer + 2]
        
        if len(stack) >= 1:
            s0_w = words[stack[-1]]
            s0_t = tags[stack[-1]]
            if len(stack) >= 2:
                s1_w = words[stack[-2]]
                s1_t = tags[stack[-2]]
                if len(stack) >= 3:
                    s2_w = words[stack[-3]]
                    s2_t = tags[stack[-3]]
        
        s0_b1_w = self.vocab_words[PAD]
        s0_b2_w = self.vocab_words[PAD]
        s0_b1_t = self.vocab_tags[PAD]
        s0_b2_t = self.vocab_tags[PAD]
        for idx, head in enumerate(gold_heads[0:s0_w]):
            if head == s0_w and s0_b1_w == self.vocab_tags[PAD]:
                s0_b1_w = words[idx]
                s0_b1_t = tags[idx]
            if head == s0_w and s0_b2_w == self.vocab_tags[PAD]:
                s0_b2_w = words[idx]
                s0_b2_t = tags[idx]


        s0_f1_w = self.vocab_words[PAD]
        s0_f2_w = self.vocab_words[PAD]
        s0_f1_t = self.vocab_tags[PAD]
        s0_f2_t = self.vocab_tags[PAD]
        if len(stack) >= 1:
            for idx, head in enumerate(gold_heads[s0_w:]):
                if head == s0_w and s0_f1_w == self.vocab_tags[PAD]:
                    s0_f1_w = words[idx]
                    s0_f1_t = tags[idx]
                if head == s0_w and s0_f2_w == self.vocab_tags[PAD]:
                    s0_f2_w = tags[idx]
                    s0_f2_t = tags[idx]


        n0_b1_w = self.vocab_words[PAD]
        n0_b2_w = self.vocab_words[PAD]
        n0_b1_t = self.vocab_tags[PAD]
        n0_b2_t = self.vocab_tags[PAD]
        for idx, head in enumerate(gold_heads[0:b0_w]):
            if head == b0_w and n0_b1_w == self.vocab_tags[PAD]:
                n0_b1_w = words[idx]
                n0_b1_t = tags[idx]
            if head == b0_w and n0_b2_w == self.vocab_tags[PAD]:
                n0_b2_w = words[idx]
                n0_b2_t = tags[idx]


        feature = [b0_w, b1_w, b2_w, s0_w, s1_w, s2_w,
                   s0_b1_w, s0_b2_w, s0_f1_w, s0_f2_w, n0_b1_w, n0_b2_w,
                   b0_t, b1_t, b2_t, s0_t, s1_t, s2_t,
                   s0_b1_t, s0_b2_t, s0_f1_t, s0_f2_t, n0_b1_t, n0_b2_t]
        return torch.tensor([feature]).to(device)

    def predict(self, words, tags):
        # find word indexes for given words
        words_idxs = []
        for word in words:
            if word in self.vocab_words:
                words_idxs.append(self.vocab_words[word])
            else:
                words_idxs.append(self.vocab_words[UNK])

        # find tag indexes for given tags
        tags_idxs = []
        for tag in tags:
            if tag in self.vocab_tags:
                tags_idxs.append(self.vocab_tags[tag])
            else:
                tags_idxs.append(self.vocab_tags[PAD])

        config = self.initial_config(len(words))

        while not self.is_final_config(config):
            valid_moves = self.valid_moves(config)
            feature = self.featurize(words_idxs, tags_idxs, list(config[2]), config)
            pred_moves = self.model.forward(feature)
            _, sorted_indexes = torch.sort(pred_moves, descending=True)
            # find valid move with highest score (SH, LA, RA)
            if len(valid_moves) > 0:
                sorted_move_list = sorted_indexes.tolist()[0]
                # choose first valid move as default move
                new_move = valid_moves[0]
                for move in sorted_move_list:
                    if move in valid_moves:
                        new_move = move
                        break
                config = self.next_config(config, new_move)

        return config[2]

### Training loop for the Parser

**train_fixed_window_parser** (*train_data*, *n_epochs* = 1, *batch_size* = 100, *lr* = 1e-2)

> Trains a fixed-window parser from a set of training data *train_data* (an iterable over parsed sentences) using minibatch gradient descent and returns it. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*.

In [191]:
def find_highest_move(scores, legal_transitions):
    _, sorted_indexes = torch.sort(scores, descending=True)
    # find valid move with highest score (SH, LA, RA)
    if len(legal_transitions) > 0:
        sorted_move_list = sorted_indexes.tolist()[0]
        # choose first valid move as default move
        t_p = legal_transitions[0]
        for move in sorted_move_list:
            if move in legal_transitions:
                return t_p

In [192]:
import random
def train_fixed_window_parser(train_data, n_epochs=1, batch_size=100, lr=1e-2):
    vocab_words, vocab_tags =  make_vocabs(train_data)

    parser = FixedWindowParser(vocab_words, vocab_tags)
    arc_parser = ArcHybridParser()

    optimizer = optim.Adam(parser.model.parameters(), lr=lr, weight_decay=1e-5)

    nr_iterations = 0

    for sentence in train_data:
        nr_iterations += 1

    try:    
        for epoch in range(n_epochs):
            # Begin training
            with tqdm(total=nr_iterations) as pbar:
                total_moves = 1
                train_loss = 0
                batch_loss = []
                batch_iter = 0

                parser.model.train()
                for sentence in train_data:
                    all_words_idx = []
                    all_tags_idx = []
                    all_heads = []

                    for word, tag, head in sentence:
                        all_words_idx.append(vocab_words[word])
                        all_tags_idx.append(vocab_tags[tag])
                        all_heads.append(head)
                    
                    config = arc_parser.initial_config(len(all_heads))

                    while not arc_parser.is_final_config(config):
                        # Get all legal moves
                        legal_transitions = arc_parser.valid_moves(config)

                        # Compute which move should be taken next
                        scores = parser.model.forward(parser.featurize(all_words_idx, all_tags_idx, list(config[2]), config)).to(device)

                        # Get legal move with highest probability
                        t_p = find_highest_move(scores, legal_transitions)

                        # Extract scores to list
                        scores_list = scores.tolist()[0]

                        # Compute which moves are zero cost
                        zero_cost_moves = dynamic_oracle(all_heads, config, legal_transitions, arc_parser)
                        
                        # Get the best legal zero cost move
                        t_o = max(zero_cost_moves, key=lambda p: scores_list[p])
                        # Target vector    
                        y = torch.tensor([t_o]).long().to(device)

                        loss = F.cross_entropy(scores, y)
                        batch_loss.append(loss)
                        train_loss += loss.item()

                        # If predicted transition is not in the zero cost moves, update weights.
                        if t_p not in zero_cost_moves:
                            # choose random transition from zero_cost. Might be bad move but such is life.
                            config = parser.next_config(config, random.choice(zero_cost_moves))
                        else:
                            config = parser.next_config(config, t_p)

                        pbar.set_postfix(loss=(train_loss/total_moves), configs=total_moves)
                        total_moves += 1
                        batch_iter += 1

                        # Update the parameters
                        if len(batch_loss) > 0 and batch_iter >= batch_size:
                            optimizer.zero_grad()
                            loss = sum(batch_loss)
                            loss.backward()
                            optimizer.step()
                            batch_loss = []
                            batch_iter = 0
                    
                    #if batch_iter == 10:
                    #    break
                    
                    pbar.update(1)
                
    except KeyboardInterrupt:
        pass
    
    return parser

In [193]:
parser = train_fixed_window_parser(train_data_retagged, n_epochs=1)

100%|██████████| 12544/12544 [11:16<00:00, 18.53it/s, configs=421696, loss=0.267]


In [194]:
print('{:.4f}'.format(uas(parser, dev_data_retagged)))

0.6717
