# Neural architectures for POS tagging
###### Campardo, Chinellato, Fanti, Longhi
##### NLP Course, AI Master's Degree, University of Bologna

In this report we show an application of neural architectures trained on the Penn Treebank dataset to perform POS tagging. In particular, we show that recurrent models can be employed to achieve a performance of 0.8 in terms of average F1-score and 0.93 of accuracy on this dataset.

In [1]:
import re
import os
import math
from collections import defaultdict
from functools import reduce
import random

import numpy as np
import torch

try:
    import wandb
except ImportError:
    wandb = None
from torch import nn
from torchinfo import summary
from pprint import pprint
from sklearn.metrics import f1_score

import config as cfg

In [None]:
# Just some utility functions and constants
PAD_TOKEN = 400000

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

classes = {'$', 'NN', ',', 'RBS', 'FW', 'CC', '#', 'VBD', 'PRP', 'RBR', 'LS', ':', 'VBZ', 'MD',
           'EX', 'RB', 'WRB', 'NNS', 'VBG', 'PRP$', 'JJR', 'WP$', 'WP', '-LRB-', 'WDT', '``',
           '.', 'CD', 'JJ', "''", 'UH', 'VBN', 'IN', 'SYM', 'DT', 'JJS', '-RRB-', 'RP', 'VB',
           'POS', 'NNP', 'PDT', 'NNPS', 'VBP', 'TO', '<PAD>'}
punctuation_cls = {'$', ',', '#', ':', '-LRB-', '``', '.', "''", 'SYM', '-RRB-', '<PAD>'}
class2idx = {c: i for i, c in enumerate(classes)}


def download_and_unzip(url, save_dir='.'):
    # downloads and unzips url, if not already downloaded
    # used for downloading dataset and glove embeddings
    import os
    from urllib.request import urlopen
    from io import BytesIO
    from zipfile import ZipFile
    fname = url.split('/')[-1][:-4] if save_dir == '.' else save_dir
    if fname not in os.listdir():
        print(f'downloading and unzipping {fname}...', end=' ')
        r = urlopen(url)
        zipf = ZipFile(BytesIO(r.read()))
        zipf.extractall(path=save_dir)
        print(f'completed')

def get_wandbkey():
    with open('wandbkey.txt') as f:
        return f.read().strip()

## Word embeddings
We first define functions for loading glove embeddings, and for computing OOV terms embeddings based on a contextual representation: given an OOV word $w$, we define its embedding as average embedding of words appearing in the same context (sequences) as $w$.

In [2]:
def get_glove(emb_size=100, number_token=False):
  """
    Download and load glove embeddings. 
    Parameters:
      emb_size: embedding size (50/100/200/300-dimensional vectors).    
    Returns tuple (voc, emb) where voc is dict from words to idx (in emb) and emb is (numpy) embedding matrix
  """
  n_tokens = 400000 + 1 # glove vocabulary size + PAD
  if emb_size not in (50, 100, 200, 300):
    raise ValueError(f'wrong size parameter: {emb_size}')
  
  if number_token: 
    n_tokens += 1
  download_and_unzip('http://nlp.stanford.edu/data/glove.6B.zip', save_dir='glove')
  vocabulary = dict()
  embedding_matrix = np.ones((n_tokens, emb_size))

  with open(f'glove/glove.6B.{emb_size}d.txt', encoding="utf8") as f:
    for i, line in enumerate(f):
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embedding_matrix[i] = coefs
        vocabulary[word] = i
  
  # add embedding for and padding and number token
  if number_token:
    embedding_matrix[n_tokens - 2] = 0
    vocabulary['<PAD>'] = n_tokens - 2
    digits = list(filter(lambda s: re.fullmatch('\d+(\.\d*)?', s) is not None, vocabulary.keys()))
    embedding_matrix[n_tokens - 1] = np.mean(embedding_matrix[[vocabulary[d] for d in digits]], axis=0)
    vocabulary['<NUM>'] = n_tokens - 1
  else: 
    embedding_matrix[n_tokens - 1] = 0
    vocabulary['<PAD>'] = n_tokens - 1
  return vocabulary, embedding_matrix

def add_oov(start_voc, oovs, embedding_matrix, sentences, verbose=True):
  """
    Computes new embedding matrix, adding embeddings for oovs
    Parameters:
      start_voc: dict, starting vocabulary that is extended with oovs
      oovs: set of string, oovs to add to the starting vocabulary and embedding matrix
      embedding_matrix: starting embedding matrix (numpy)
      sentences: list of list of strings, set used to compute oov embeddings
    Returns tuple (voc, emb) where voc is dict from words to idx (in emb) and emb is (numpy) embedding matrix with oovs
  """
  oovs = oovs - set(start_voc.keys())
  start_voc_size, emb_size = embedding_matrix.shape
  oov_embeddings = np.zeros((start_voc_size + len(oovs), emb_size))
  oov_embeddings[:start_voc_size] = embedding_matrix
  new_voc = dict(start_voc)

  for i, oov in enumerate(oovs):
    context_words = [new_voc[word]
                    for sentence in filter(lambda s: oov in s, sentences)
                    for word in sentence if word in new_voc and word not in (oov, '<PAD>')]
    if verbose and len(context_words) == 0:
        print(f'Empty context for oov: {oov}')
        print([sentence for sentence in filter(lambda s: oov in s, sentences)])
    oov_embeddings[start_voc_size + i] = np.mean(oov_embeddings[context_words], axis=0)
    new_voc[oov] = start_voc_size + i
  return new_voc, oov_embeddings

## Preprocessing data
The following function performs data preprocessing, loading documents from a given range and updating a starting vocabulary and embedding matrix with OOV embeddings. Later, in the train function, we will call load_data 3 times to build the train, validation, and test sets, iterativately (independently) updating the vocabulary and embedding matrix at each step. By default, we do not drop punctuation (although it's not considered by the evaluation metrics, it can still be useful to the model) and we split documents into individual sentences, i.e. each sequence in the preprocessed set is a sentence. We also tried to mask all numeric tokens (e.g. '12', '2021', '19.24') with a single number token '<NUM>' to make the task easier (we noticed that numeric tokens can belong to only two classes), but it turns out it hurts performances, so we dropped the idea.

In [3]:
def load_data(start, end, start_voc, embedding_matrix, number_token=False,
              drop_punctuation=False, split_docs=True, ret_counts=False):
  """
    Downloads dataset and preprocess dataset.
    Params:
      start: idx of first file to include in data
      end: idx of last file to include in data
      start_voc: starting vocabulary that is extended with oov terms
      embedding_matrix: starting embedding matrix that is extended with OOV embeddings
      number_token: if True, use a single token for all cardinal numbers
      drop_punctuation: if True, drop punctuation
      split_docs: if True, each sequence is one sentence; if false, each sequence is one document
      ret_counts: if True, also return counts of each word in the documents
    Returns tuple (inputs, labels, voc, em, [counts]) where:
      inputs, labels are lists containing words and their respective labels
      voc is the union of start_voc and the set of OOV terms
      em is the updated embeddding matrix
      (optional) counts is dict of frequencies of words in the dataset
  """
  # download dataset
  download_and_unzip('https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip')

  inputs, labels = [], []
  vocabulary = set()
  counts = defaultdict(int)
  
  # build dataset
  for doc in range(start, end+1):
    with open(f'dependency_treebank/wsj_{doc:04d}.dp') as f:
      
      input_seq, label_seq = [], []
      
      for line in f:
        if line.strip(): # check for empty lines
          word, label, _ = line.split('\t')
          word = word.lower()
          if '\/' in word:
            word = word.replace('\/', '-')
          if number_token and re.fullmatch('\d+(\.\d*)?(\,\d*)?', word) is not None:
            word = '<NUM>'
          if not drop_punctuation or label.isalpha(): # eventually drop punctuation
            vocabulary.add(word)
            input_seq.append(word)
            label_seq.append(label)
            counts[word] += 1
        elif split_docs: # sentence over, add to input if splitting documents
          inputs.append(input_seq)
          labels.append(label_seq)
          input_seq, label_seq = [], []

      # add either last sentence or whole document
      inputs.append(input_seq)
      labels.append(label_seq)

  vocabulary, embedding_matrix = add_oov(start_voc, vocabulary, embedding_matrix, inputs)

  if ret_counts:
    return inputs, labels, vocabulary, embedding_matrix, counts
  else:
    return inputs, labels, vocabulary, embedding_matrix

## Batching and padding
We define a POSDataset class, subclassing from PyTorch Dataset class, in order to use automatic batching procedures provided by the library. Moreover, we also define a custom collate function that performs padding at the batch level, thus minimizing the amount of padding tokens actually used during training w.r.t. padding the whole dataset at once before building the individual batches.

In [4]:
class POSDataset(torch.utils.data.Dataset):
    """ Simple dataset class to use dataloaders (batching) """
    def __init__(self, inputs, labels, vocabulary):
        self.inputs_str = inputs
        self.labels_str = labels
        self.voc = vocabulary
        # map each string of the dataset into its corresponding numeric index
        self.inputs = [[vocabulary[word] for word in sequence] for sequence in inputs]
        self.labels = [[class2idx[label] for label in sequence] for sequence in labels]
    def __getitem__(self, idx):
        return self.inputs[idx], self.labels[idx]
    def __len__(self):
        return len(self.inputs)

def collate_fn(batch):
    """ Used by DataLoader to pad each batch independently """
    seq_lens = torch.as_tensor([len(seq[0]) for seq in batch])
    padded_inputs = torch.nn.utils.rnn.pad_sequence([torch.as_tensor(seq[0]) for seq in batch], 
                                                     batch_first=True, padding_value=PAD_TOKEN)
    padded_targets = torch.nn.utils.rnn.pad_sequence([torch.as_tensor(seq[1]) for seq in batch], 
                                                     batch_first=True, padding_value=class2idx['<PAD>'])
    return padded_inputs, padded_targets, seq_lens

## Model definition
The POSTagger class contains the implementation of the four considered neural architectures. The baseline $LSTM\_1L$ is composed by a LSTM layer followed by a Linear layer; the three variations are $LSTM\_2L$, which features two LSTM layers; $GRU$, which uses one GRU layer rather than LSTM, and $FC\_2L$ which is made of a single LSTM layer followed by two linear layers with a ReLU in between the two (otherwise the model would be equivalent to $LSTM\_1L$).
We also implemented an initialization trick where we set the bias of the forget (reset) gate of the LSTM (GRU) layers to 1 (-1); this is because sometimes it can take a while for RNNs to learn to remember information form the last time step, thus we make them remember more by default.

We also implemented a custom Focal Loss (a variation of cross entropy originally used to train CNNs for object detection) to tackle the problem of class imbalance that is present in the Penn Treebank dataset that we are using, but it didn't appear to be really useful so we later switched to a simpler weighted cross entropy.

In [5]:
class POSTagger(torch.nn.Module):

  def __init__(self, embedding_matrix, type, rec_size=1, units=None, hid_size=64):
    """
      A recurrent network performing multiclass classification (POS tagging).
      Params:
        type: type of rnn, either 'lstm' or 'gru'
        embedding_matrix: embedding matrix for embedding layer
        rec_size: number of stacked recurrent modules
        units: int or None, if given then add one additional linear layer with given number of units
        hid_size: size of hidden state of recurrent module
    """
    super().__init__()

    emb_size = embedding_matrix.shape[1]
    self.emb_layer = nn.Embedding.from_pretrained(torch.as_tensor(embedding_matrix))

    # build recurrent layer(s)
    if type == 'lstm':
      self.rec_modules = nn.LSTM(input_size=emb_size, hidden_size=hid_size, bidirectional=True,
                                 batch_first=True, num_layers=rec_size)
    elif type == 'gru':
      self.rec_modules = nn.GRU(input_size=emb_size, hidden_size=hid_size, bidirectional=True,
                                batch_first=True, num_layers=rec_size)
    else:
      raise ValueError(f'wrong type {type}, either lstm or gru')
    self._rec_init(type)

    # build linear layer(s)
    self.fc_modules = nn.Sequential(nn.Linear(2 * hid_size, units if units is not None else len(classes)))
    if units is not None:
      self.fc_modules.add_module('fc1_relu', nn.ReLU())
      self.fc_modules.add_module('fc2', nn.Linear(units, len(classes)))


  def forward(self, x, seq_lens):
    embed_vecs = self.emb_layer(x).float()
    packed_vecs = torch.nn.utils.rnn.pack_padded_sequence(embed_vecs, seq_lens, batch_first=True, enforce_sorted=False)
    rec_out, _ = self.rec_modules(packed_vecs)
    unpacked_rec_out, _ = torch.nn.utils.rnn.pad_packed_sequence(rec_out, batch_first=True, padding_value=PAD_TOKEN)
    fc_out = self.fc_modules(unpacked_rec_out)
    return fc_out

  def _rec_init(self, type):
      # initialization trick for making the RNN more "forgetful" at the beginning of training
      for names in self.rec_modules._all_weights:
        for name in filter(lambda n: "bias" in n,  names):
            bias = getattr(self.rec_modules, name)
            n = bias.size(0)
            if type == 'lstm': # init lstm forget gate bias to 1
                start, end = n//4, n//2
                bias.data[start:end].fill_(1.)
            else: # init gru reset gate bias to -1
                end = n//3
                bias.data[:end].fill_(-1.)

# Tried but not so useful
class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2, ignore_index=-100, reduce=True):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduce = reduce
        self.ignore_index = ignore_index
        self.loss_fn = nn.CrossEntropyLoss(ignore_index=self.ignore_index, reduction='none')

    def forward(self, inputs, targets): 
        ce_loss = self.loss_fn(inputs, targets)

        pt = torch.exp(-ce_loss)
        F_loss = self.alpha * (1-pt)**self.gamma * ce_loss

        if self.reduce:
            return torch.mean(F_loss)
        else:
            return F_loss

## Training procedure
In the following cells, we define functions for training the network for one epoch on the training set and for performing evaluation on the validation (or testing) set. We also implemented a EarlyStopping class that saves model checkpoints while keeping track of a score (we use the validation loss) to check for early stopping.

In [6]:
class EarlyStopping:
    def __init__(self, patience, model, delta=0, path='res'):
        """ Implements early stopping.
            Params:
                patience: int, number of epochs without score improvement before early stopping
                model: torch.nn.Module that is being trained
                delta: float, minimum change in score to detect improvement
                path: str, path to save checkpoints
        """
        self.patience = patience
        self.delta = delta
        self.model = model
        self.path = path
        if not os.path.isdir(self.path):
            os.mkdir(self.path)
        self.best_score = float('inf')
        self.counter = 0

    def step(self, epoch, score):
        """ Updates ES tracker after one epoch.
            Params:
                epoch: current epoch
                score: validation loss
            Returns tuple (stop, checkpoint),
                where stop is True if early stopping has occurred and False otherwise,
                and checkpoint is last best checkpoint
        """
        if score < self.best_score:
            # print('Validation loss decreasing, storing new checkpoint')
            self.best_score = score
            self.counter = 0
            checkpoint = {'model': self.model.state_dict(), 'epoch': epoch}
            torch.save(checkpoint, 'res/checkpoint.pth')
            return False, checkpoint
        elif abs(score - self.best_score) > self.delta:
            self.counter += 1
            if self.counter >= self.patience:
                print(f'Early stopping occured at epoch {epoch} with patience {self.patience}')
                checkpoint = torch.load('res/checkpoint.pth', map_location='cpu')
                return True, checkpoint
            if self.counter == (self.patience//2):
                print(f'Validation loss increasing for {self.counter} epochs')
        return False, None

In [7]:
def train_one_epoch(model, optimizer, loss_fn, data_loader, device):
    """ 
        Trains model for one epoch on the given dataloader.
        Parameters:
            model: torch.nn.Module to train
            optimizer: torch.optim optimizer object
            loss_fn: torch.nn criterion to use to compute loss, given outputs and targets
            data_loader: torch.utils.data.DataLoader 
            device: torch.device where training is performed
        Returns log dict {'train/loss' : list(loss values for each batch)} 
    """
    model.train()
    log_dict = {'train/loss': []}

    for inputs, targets, seq_lens in data_loader:
        inputs = inputs.to(device)
        targets = targets.to(device)

        scores = model(inputs, seq_lens).transpose(1, 2)
        loss = loss_fn(scores, targets)
        loss_value = loss.item()

        if not math.isfinite(loss_value):
            print(f"Loss is {loss_value}, stopping training")
            exit(1)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        log_dict['train/loss'].append(loss_value)

    return log_dict

def evaluate(model, loss_fn, data_loader, device, split, ret_f1_classes=False):
    """ 
        Evaluate model on the given dataloader.
        Parameters:
            model: torch.nn.Module to evaluate
            loss_fn: torch.nn criterion to use to compute loss, given outputs and targets
            data_loader: torch.utils.data.DataLoader 
            device: torch.device where evaluation is performed
            split: either 'valid' or 'test'
            ret_f1_classes: if True, also returns per-class f1 scores
        Returns log dict {'valid/loss' : mean loss, 'valid/{metric}': mean metric} 
    """
    model.eval()
    assert len(data_loader) == 1 # must be a single batch
    with torch.no_grad():
        inputs, targets, seq_lens = next(iter(data_loader))
        inputs = inputs.to(device)
        targets = targets.to(device)

        scores = model(inputs, seq_lens).transpose(1, 2)
        losses = loss_fn(scores, targets).item()
        preds = torch.argmax(scores, 1)

        targets = targets.cpu().numpy()
        preds = preds.cpu().numpy()
        mask = [targets != class2idx[c] for c in punctuation_cls]
        mask = np.array(reduce(lambda a,b: a & b, mask)).reshape(targets.shape)
        acc = np.where(mask, targets==preds, False).sum() / mask.sum()
        cls = [class2idx[c] for c in (classes - punctuation_cls)]
        f1_classes = f1_score(targets.reshape(-1), preds.reshape(-1),
                      labels=cls, average=None, zero_division=0)

    log_dict = {f'{split}/loss': losses,
                f'{split}/accuracy': acc,
                f'{split}/f1': np.mean(f1_classes)}
    if ret_f1_classes:
        return log_dict, {c:s for c,s in zip(cls, f1_classes)}
    else:
        return log_dict

## Main training loop
We finally define one function that fully trains one model. While performing tuning/model selection, we train on the training set and evaluate (every epoch) on the validation set; for the last two runs (test=True), we train on the big train + validation set and evaluate only once on the test set at the end of training.

In [8]:
def train(tags=None, verbose=False, test=False, emb_size=100, number_token=False, weighted_ce=False, use_wandb=False):
    """ Fully trains one model, based on cfg parameters, on training set and performs evaluation on validation set.
        Params:
            tags: list of str, wandb tags
            verbose: bool, if True verbose output
            test: if False, train on train set and evaluate at each epoch on the validation set;
                  if True, tra on train + validation set and evaluate on test set once at the end of training
            emb_size: one of (50, 100, 200, 300), size of embedding representation
            number_token: bool, if True use number token for numeric strings
            weigthed_ce: bool, if True use weighted cross entropy
            use_wandb: bool, if True enables wandb logging
        Returns trained model and wandb run
    """
    idx2classes = {i: c for c, i in class2idx.items()}
    cfg_dict = {
        'epochs': cfg.EPOCHS, 'batch_size': cfg.BATCH_SIZE, 'number_token': number_token,
        'model': cfg.TYPE, 'rec_size': cfg.REC_SIZE, 'units': cfg.UNITS, 'hid_size': cfg.HID_SIZE,
        'optim': cfg.OPTIM, 'lr': cfg.LR, 'alpha': cfg.ALPHA, 'betas': cfg.BETAS, 'momentum': cfg.MOMENTUM,
        'weighted_ce': weighted_ce, 'emb_size': emb_size
    }
    if verbose:
        print('CONFIG PARAMETERS:')
        pprint(cfg_dict)

    if use_wandb:
        wandb.login(key=get_wandbkey())
        run = wandb.init(project="assignment-one", entity="nlpetroni", group=f'{"testing" if test else "validation"}',
                        reinit=True, config=cfg_dict, tags=tags)
        wandb.define_metric("train_step")
        wandb.define_metric("epoch")
        wandb.define_metric('train/loss', step_metric="train_step", summary="min")
        wandb.define_metric(f"valid/loss", step_metric="epoch", summary="min")
        wandb.define_metric(f"valid/accuracy", step_metric="epoch", summary="max")
        wandb.define_metric(f"valid/f1", step_metric="epoch", summary="max")
        wandb.define_metric(f"test/accuracy", step_metric="epoch", summary="max")
        wandb.define_metric(f"test/f1", step_metric="epoch", summary="max")
    else:
        run = None

    glove_voc, embedding_matrix = get_glove(number_token=number_token, emb_size=emb_size)
    if not test:
        split = 'valid'
        train_set, train_labels, train_voc, embedding_matrix = load_data(1, 100, glove_voc, embedding_matrix, number_token=number_token, drop_punctuation=False)
        valid_set, valid_labels, valid_voc, embedding_matrix = load_data(101, 150, train_voc, embedding_matrix, number_token=number_token, drop_punctuation=False)
        train_ds = POSDataset(train_set, train_labels, train_voc)
        valid_ds = POSDataset(valid_set, valid_labels, valid_voc)
        train_dl = torch.utils.data.DataLoader(train_ds, batch_size=cfg.BATCH_SIZE, collate_fn=collate_fn, shuffle=True)
        valid_dl = torch.utils.data.DataLoader(valid_ds, batch_size=len(valid_ds), collate_fn=collate_fn)
    else:
        split = 'test'
        train_set, train_labels, train_voc, embedding_matrix = load_data(1, 150, glove_voc, embedding_matrix, number_token=number_token, drop_punctuation=False)
        test_set, test_labels, test_voc, embedding_matrix = load_data(151, 199, train_voc, embedding_matrix, number_token=number_token, drop_punctuation=False)
        train_ds = POSDataset(train_set, train_labels, train_voc)
        test_ds = POSDataset(test_set, test_labels, test_voc)
        train_dl = torch.utils.data.DataLoader(train_ds, batch_size=cfg.BATCH_SIZE, collate_fn=collate_fn, shuffle=True)
        test_dl = torch.utils.data.DataLoader(test_ds, batch_size=len(test_ds), collate_fn=collate_fn)


    model = POSTagger(embedding_matrix, type=cfg.TYPE, rec_size=cfg.REC_SIZE, units=cfg.UNITS, hid_size=cfg.HID_SIZE).to(device)
    if use_wandb:
        wandb.watch(model, log_graph=True)
    if verbose:
        print(summary(model))

    params = [p for p in model.parameters() if p.requires_grad]
    if cfg.OPTIM == 'rmsprop':
        optimizer = torch.optim.RMSprop(params, lr=cfg.LR, alpha=cfg.ALPHA, momentum=cfg.MOMENTUM, weight_decay=cfg.WEIGHT_DECAY)
    elif cfg.OPTIM == 'adam':
        optimizer = torch.optim.Adam(params, lr=cfg.LR, betas=cfg.BETAS, weight_decay=cfg.WEIGHT_DECAY)
    else:
        raise ValueError(f'wrong optim {cfg.OPTIM}, either rmsprop or adam')

    if weighted_ce:
        cls, counts = np.unique([w for s in train_labels for w in s], return_counts=True)
        weights = torch.ones(len(classes))
        for c,n in zip(cls, counts):
            if n <= 150 and c not in punctuation_cls:
                weights[class2idx[c]] += 1-(n/150)
        weights[class2idx['<PAD>']] = 0
        weights = (weights - weights.min())/(weights.max()-weights.min())
        weights = weights.to(device)
        loss_fn = nn.CrossEntropyLoss(ignore_index=class2idx['<PAD>'], weight=weights)
    else:
        loss_fn = nn.CrossEntropyLoss(ignore_index=class2idx['<PAD>'])

    train_step = 0
    es_tracker = EarlyStopping(20, model)
    epoch = 0
    stop = False
    print('STARTING TRAINING')

    while epoch < cfg.EPOCHS and not stop:
        train_log_dict = train_one_epoch(model, optimizer, loss_fn, train_dl, device)
        if not test:
            valid_log_dict, f1_classes = evaluate(model, loss_fn, valid_dl, device, split=split, ret_f1_classes=True)
            stop, checkpoint = es_tracker.step(epoch, valid_log_dict['valid/loss'])
            if use_wandb:
                wandb.log({'epoch': epoch, 'valid/loss': valid_log_dict['valid/loss'],
                        'valid/accuracy': valid_log_dict['valid/accuracy'], 'valid/f1': valid_log_dict['valid/f1'],
                        'valid/f1_distribution': wandb.Histogram(np_histogram=np.histogram(list(f1_classes.values())))})
            if stop:
                model.load_state_dict(checkpoint['model'])
                # re evaluate on the best checkpoint for loggin purposes
                valid_log_dict, f1_classes = evaluate(model, loss_fn, valid_dl, device, split=split, ret_f1_classes=True)
            if (epoch % 25) == 0:
                print(f'[{epoch:03d}/{cfg.EPOCHS:03d}] train loss: {np.mean(train_log_dict["train/loss"]):.3f}, valid loss: {valid_log_dict["valid/loss"]:.3f}, f1: {valid_log_dict["valid/f1"]:.2f} accuracy: {valid_log_dict["valid/accuracy"]:.2f}')
        if use_wandb:
            for batch_loss in train_log_dict['train/loss']:
                wandb.log({'train_step': train_step, 'epoch': epoch, 'train/loss': batch_loss})
                train_step += 1
        epoch += 1

    # log per-class f1 scores
    if not test:
        data = [[idx2classes[i], score] for i, score in f1_classes.items()]
        table = wandb.Table(data=data, columns=["class", "f1_score"])
        wandb.log({'valid/f1_per_class': wandb.plot.bar(table, "class", "f1_score", title="F1 per class bar chart")})
    else:
        log_dict, f1_classes = evaluate(model, loss_fn, test_dl, device, split=split, ret_f1_classes=True)
        data = [[idx2classes[i], score] for i,score in f1_classes.items()]
        table = wandb.Table(data=data, columns = ["class", "f1_score"])
        wandb.log({'test/loss': log_dict['test/loss'], 'test/accuracy': log_dict['test/accuracy'], 'test/f1': log_dict['test/f1'],
                   'test/f1_per_class': wandb.plot.bar(table, "class", "f1_score", title="F1 per class bar chart")})
        print(f"TEST SET METRICS \nloss: {log_dict['test/loss']:.3f}, accuracy: {log_dict['test/accuracy']:.2f}, f1: {log_dict['test/f1']:.1f}")
    
    if use_wandb:
        run.finish()

    return model, run

def fix_random(seed):
    """Fix all the possible sources of randomness
        Params:
        seed: the seed to use
    """
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

## Hyperparameter tuning

In [10]:
use_wandb = False
fix_random(42)
tuning = False
if tuning: # takes a while to run, see https://wandb.ai/nlpetroni/assignment-one for results
    for (tag, type, rec_size, units) in (('lstm_1L', 'lstm', 1, None), ('lstm_2L', 'lstm', 2, None),
                                         ('fc_2L', 'lstm', 1, 128), ('gru', 'gru', 1, None)):
        for wce in (True, False):
            for lr in (1e-3, 7.5e-4, 2.5e-4, 1e-4):
                cfg.TYPE = type
                cfg.REC_SIZE = rec_size
                cfg.LR = lr
                cfg.UNITS = units
                _, run = train(tags=['tuning', tag], weighted_ce=wce, use_wandb=use_wandb)
else:
    print(f'Skipping tuning, see https://wandb.ai/nlpetroni/assignment-one to view the tuning results')
    print('Set tuning = True to perform tuning')

Skipping tuning, see https://wandb.ai/nlpetroni/assignment-one to view the tuning results
Set tuning = True to perform tuning


## Model selection

In [12]:
use_wandb = True
fix_random(42)
cfg.EPOCHS = 200
for (tag, type, rec_size, units, lr, wl) in (('lstm_1L', 'lstm', 1, None, 7.5e-4, True),
                                            ('lstm_2L', 'lstm', 2, None, 1e-3, True),
                                            ('fc_2L', 'lstm', 1, 128, 2.5e-4, True),
                                            ('gru', 'gru', 1, None, 7.5e-4, True)):
    cfg.TYPE = type
    cfg.REC_SIZE = rec_size
    cfg.LR = lr
    cfg.UNITS = units
    _, run = train(tags=['model selection', tag], weighted_ce=wl, use_wandb=use_wandb)

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/diego/.netrc


[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`


STARTING TRAINING
[000/200] train loss: 2.327, valid loss: 1.593, f1: 0.22 accuracy: 0.57
[025/200] train loss: 0.086, valid loss: 0.277, f1: 0.79 accuracy: 0.91
Validation loss increasing for 10 epochs
Early stopping occured at epoch 41 with patience 20


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=0.76092233009…

0,1
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/loss,█▅▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
valid/accuracy,▁▄▆▆▇▇▇▇▇███████████████████████████████
valid/f1,▁▃▄▅▆▆▆▆▇▇▇▇▇▇▇█████████████████████████
valid/loss,█▅▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
epoch,41
train_step,2603


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/diego/.netrc


[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`


STARTING TRAINING
[000/200] train loss: 2.466, valid loss: 1.654, f1: 0.19 accuracy: 0.54
Validation loss increasing for 10 epochs
[025/200] train loss: 0.018, valid loss: 0.355, f1: 0.79 accuracy: 0.91
Early stopping occured at epoch 33 with patience 20


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/loss,█▆▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
valid/accuracy,▁▄▆▇▇▇████████████████████████████
valid/f1,▁▃▄▅▆▆▆▇▇▇▇▇▇▇████████████████████
valid/loss,█▅▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▁▂▂▂▂

0,1
epoch,33
train_step,2107


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/diego/.netrc


[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`


STARTING TRAINING
[000/200] train loss: 2.914, valid loss: 2.563, f1: 0.04 accuracy: 0.30
[025/200] train loss: 0.232, valid loss: 0.343, f1: 0.70 accuracy: 0.89
Validation loss increasing for 10 epochs
[050/200] train loss: 0.093, valid loss: 0.315, f1: 0.75 accuracy: 0.90
Early stopping occured at epoch 60 with patience 20


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=0.78201151971…

0,1
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
train/loss,█▅▄▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
valid/accuracy,▁▃▅▆▇▇▇▇████████████████████████████████
valid/f1,▁▂▄▄▅▅▆▆▆▆▇▇▇▇▇▇▇▇█▇████████████████████
valid/loss,█▆▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
epoch,60
train_step,3781


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/diego/.netrc


[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`


STARTING TRAINING
[000/200] train loss: 1.837, valid loss: 1.030, f1: 0.37 accuracy: 0.72
[025/200] train loss: 0.048, valid loss: 0.295, f1: 0.77 accuracy: 0.91
Validation loss increasing for 10 epochs
Early stopping occured at epoch 38 with patience 20


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/loss,█▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
valid/accuracy,▁▄▆▆▇▇▇▇███████████████████████████████
valid/f1,▁▃▅▆▆▆▇▇▇██████████████████████████████
valid/loss,█▅▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂▂▂▂▂▂▂▂

0,1
epoch,38
train_step,2417


## Final evaluation of the two best architectures on the testing set
The two best performing architectures, based on evaluation on the validation set, appear to be $LSTM\_1L$ and $LSTM\_2L$, i.e. LSTM-based architectures. In this last cell, we retrain these two architectures on the train + validation set and perform evaluation on the testing set.

In [13]:
use_wandb = True
fix_random(42)
# first best model
cfg.EPOCHS = 41
cfg.TYPE = 'lstm'
cfg.LR = 7.5e-4
cfg.REC_SIZE = 1
cfg.UNITS = None
lstm_1l, _ = train(tags=['test', 'lstm_1L'], test=True, weighted_ce=True, use_wandb=use_wandb)
torch.save({'model': lstm_1l.state_dict()}, 'res/models/lstm_1l.pth')
# second best model
cfg.EPOCHS = 33
cfg.TYPE = 'lstm'
cfg.LR = 1e-3
cfg.REC_SIZE = 2
lstm_2l, _ = train(tags=['test', 'lstm_2L'], test=True, weighted_ce=True, use_wandb=use_wandb)
torch.save({'model': lstm_2l.state_dict()}, 'res/models/lstm_2l.pth')

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/diego/.netrc


[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`


STARTING TRAINING
TEST SET METRICS 
loss: 0.259, accuracy: 0.93, f1: 0.8


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇███
test/accuracy,▁
test/f1,▁
test/loss,▁
train/loss,█▅▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███

0,1
epoch,40.0
test/loss,0.25912
train_step,4181.0


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/diego/.netrc


[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`


STARTING TRAINING
TEST SET METRICS 
loss: 0.308, accuracy: 0.93, f1: 0.8


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
test/accuracy,▁
test/f1,▁
test/loss,▁
train/loss,█▄▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███

0,1
epoch,32.0
test/loss,0.30819
train_step,3365.0


## Conclusion and error analysis
*Note: to look at the plots please click [here](https://wandb.ai/nlpetroni/assignment-one/reports/Results--VmlldzoxMzU2NDU4?accessToken=fyze87urj6r5b2ok4p9xnfu26ouwmu7pkmge1yw838z07ohxtyf9mdla5n13s6fr)*

Given the plots of the F1 score per class on the validation set of the best two models, it is clear that all the models aren’t able to classify correctly token with the tags FW and UH due to the fact that there aren’t many examples of these two classes in the dataset (UH has only 3 examples in train + validation set, while FW has only 4). Test performances on these two classes cannot be evaluated as there are no examples on the test set. Other tags that have poor performances in the validation are LS, WP\$, PDT, NNPS, even though they’re variable among different models. LS, WP\$ and PDT are three underrepresented tags, the first is not even present in the test set, the second has only 4 examples in the test set out of 14, the third has only 4 examples in the test set out of 23 total examples. One motivation behind the poor performances on the NNPS(plural proper nouns) tag is that it doesn’t have as many examples as its singular counterpart NNP(singular proper nouns). Moreover, plenty of the NNPS tagged words are probably OOVs and contextual embeddings may not be the best embedding for such words. One possibly better way to address this issue could be to take into consideration the embedding of their singular counterpart in NNP, if present in the vocabulary, in computing its contextual embedding. Results on the test set partially confirm performances and findings discussed above: the NNPS tag has a good improvement on the test performances, but still has a F1 score below 0.5; the baseline model (the best, LSTM 1L) is able to improve performances from validation to test on several classes like PDT, NNPS and RBS. The second best model (LSTM 2L) seems more flawed: it is able to improve some of the poor performances on tags WP$, NNPS, RBS at the expense of the simple TO tag that had a F1 score of 0.99 in validation which drastically drops to 0.03 in the test.