# Assignment 1

**Due to**: 23/12/2021 (dd/mm/yyyy)

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Summary**: Part-of Speech (POS) tagging as Sequence Labelling using Recurrent Neural Architectures

# Intro

In this assignment  we will ask you to perform POS tagging using neural architectures

You are asked to follow these steps:
*   Download the corpora and split it in training and test sets, structuring a dataframe.
*   Embed the words using GloVe embeddings
*   Create a baseline model, using a simple neural architecture
*   Experiment doing small modifications to the baseline model, choose hyperparameters using the validation set
*   Evaluate your two best model
*   Analyze the errors of your model


**Task**: given a corpus of documents, predict the POS tag for each word

**Corpus**:
Ignore the numeric value in the third column, use only the words/symbols and its label. 
The corpus is available at:
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip

**Splits**: documents 1-100 are the train set, 101-150 validation set, 151-199 test set.


**Features**: you MUST use GloVe embeddings as the only input features to the model.

**Splitting**: you can decide to split documents into sentences or not, the choice is yours.

**I/O structure**: The input data will have three dimensions: 1-documents/sentences, 2-token, 3-features; for the output there are 2 possibilities: if you use one-hot encoding it will be 1-documents/sentences, 2-token labels, 3-classes, if you use a single integer that indicates the number of the class it will be 1-documents/sentences, 2-token labels.

**Baseline**: two layers architecture: a Bidirectional LSTM layer and a Dense/Fully-Connected layer on top; the choice of hyper-parameters is yours.

**Architectures**: experiment using a GRU instead of the LSTM, adding an additional LSTM layer, and adding an additional dense layer; do not mix these variantions.


**Training and Experiments**: all the experiments must involve only the training and validation sets.

**Evaluation**: in the end, only the two best models of your choice (according to the validation set) must be evaluated on the test set. The main metric must be F1-Macro computed between the various part of speech. DO NOT CONSIDER THE PUNCTUATION CLASSES.

**Metrics**: the metric you must use to evaluate your final model is the F1-macro, WITHOUT considering punctuation/symbols classes; during the training process you can use accuracy because you can't use the F1 metric unless you use a single (gigantic) batch because there is no way to aggregate "partial" F1 scores computed on mini-batches.

**Discussion and Error Analysis** : verify and discuss if the results on the test sets are coherent with those on the validation set; analyze the errors done by your model, try to understand which may be the causes and think about how to improve it.

**Report**: you are asked to deliver the code of your experiments and a small pdf report of about 2 pages; the pdf must begin with the names of the people of your team and a small abstract (4-5 lines) that sums up your findings.

# Out Of Vocabulary (OOV) terms

How to handle words that are not in GloVe vocabulary?
You can handle them as you want (random embedding, placeholder, whatever!), but they must be STATIC embeddings (you cannot train them).

But there is a very important caveat! As usual, the element of the test set must not influence the elements of the other splits!

So, when you compute new embeddings for train+validation, you must forget about test documents.
The motivation is to emulate a real-world scenario, where you select and train a model in the first stage, without knowing nothing about the testing environment.

For implementation convenience, you CAN use a single vocabulary file/matrix/whatever. The principle of the previous point is that the embeddings inside that file/matrix must be generated independently for train and test splits.

Basically in a real-world scenario, this is what would happen:
1. Starting vocabulary V1 (in this assignment, GloVe vocabulary)
2. Compute embeddings for terms out of vocabulary V1 (OOV1) of the training split 
3. Add embeddings to the vocabulary, so to obtain vocabulary V2=V1+OOV1
4. Training of the model(s)
5. Compute embeddings for terms OOV2 of the validation split 
6. Add embeddings to the vocabulary, so to obtain vocabulary V3=V1+OOV1+OOV2
7. Validation of the model(s)
8. Compute embeddings for terms OOV3 of the test split 
9. Add embeddings to the vocabulary, so to obtain vocabulary V4=V1+OOV1+OOV2+OOV3
10. Testing of the final model

In this case, where we already have all the documents, we can simplify the process a bit, but the procedure must remain rigorous.

1. Starting vocabulary V1 (in this assignment, GloVe vocabulary)
2. Compute embeddings for terms out of vocabulary V1 (OOV1) of the training split 
3. Add embeddings to the vocabulary, so to obtain vocabulary V2=V1+OOV1
4. Compute embeddings for terms OOV2 of the validation split 
5. Add embeddings to the vocabulary, so to obtain vocabulary V3=V1+OOV1+OOV2
6. Compute embeddings for terms OOV3 of the test split 
7. Add embeddings to the vocabulary, so to obtain vocabulary V4=V1+OOV1+OOV2
8. Training of the model(s)
9. Validation of the model(s)
10. Testing of the final model

Step 2 and step 6 must be completely independent of each other, for what concerns the method and the documents. But they can rely on the previous vocabulary (V1 for step 2 and V3 for step 6)
THEREFORE if a word is present both in the training set and the test split and not in the starting vocabulary, its embedding is computed in step 2) and it is not considered OOV anymore in step 6).

# Report
The report must not be just a copy and paste of graphs and tables!

The report must not be longer than 2 pages and must contain:
* The names of the member of your team
* A short abstract (4-5 lines) that sum ups everything
* A general description of the task you have addressed and how you have addressed it
* A short description of the models you have used
* Some tables that sum up your findings in validation and test and a discussion of those results
* The most relevant findings of your error analysis

# Evaluation Criterion

The goal of this assignment is not to prove you can find best model ever, but to face a common task, structure it correctly, and follow a correct and rigorous experimental procedure.
In other words, we don't care if you final models are awful as long as you have followed the correct procedure and wrote a decent report.

The score of the assignment will be computed roughly as follows
* 1 point for the general setting of the problem
* 1 point for the handling of OOV terms
* 1 point for the models
* 1 point for train-validation-test procedure
* 2 point for the discussion of the results, error analysis, and report

This distribution of scores is tentative and we may decide to alter it at any moment.
We also reserve the right to assign a small bonus (0.5 points) to any assignment that is particularly worthy. Similarly, in case of grave errors, we may decide to assign an equivalent malus (-0.5 points).

# Contacts

In case of any doubt, question, issue, or help we highly recommend you to check the [course useful material](https://virtuale.unibo.it/pluginfile.php/1036039/mod_resource/content/2/NLP_Course_Useful_Material.pdf) for additional information, and to use the Virtuale forums to discuss with other students.

You can always contact us at the following email addresses. To increase the probability of a prompt response, we reccomend you to write to both the teaching assistants.

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it


# FAQ
* You can use a non-trainable Embedding layer to load the glove embeddings
* You can use any library of your choice to implement the networks. Two options are tensorflow/keras or pythorch. Both these libraries have all the classes you need to implement these simple architectures and there are plenty of tutorials around, where you can learn how to use them.

In [1]:
# ! pip install wandb # colab only

import math
from collections import defaultdict, OrderedDict
import numpy as np
import torch
from torch import nn
from torchinfo import summary
import wandb
import config as cfg

def download_and_unzip(url, save_dir='.'):
  # downloads and unzips url, if not already downloaded
  import os
  from urllib.request import urlopen
  from io import BytesIO
  from zipfile import ZipFile
  fname = url.split('/')[-1][:-4] if save_dir == '.' else save_dir
  if fname not in os.listdir():
    print(f'downloading and unzipping {fname}...', end=' ')
    r = urlopen(url)
    zipf = ZipFile(BytesIO(r.read()))
    zipf.extractall(path=save_dir)
    print(f'completed')
  else:
    print(f'{fname} already downloaded')

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# GLOVE EMBEDDINGS

OOV: average of all words/subset of words https://groups.google.com/g/globalvectors/c/9w8ZADXJclA/m/hRdn4prm-XUJ

In [2]:
n_tokens = 400000 + 2 # glove vocabulary size + UNK + PAD

def get_glove(emb_size=100):
  """
    Download and load glove embeddings. 
    Parameters:
      size: embedding size (50/100/200/300-dimensional vectors).    
    Returns keras Embedding layer.  
  """
  if emb_size not in (50, 100, 200, 300):
    raise ValueError(f'wrong size parameter: {emb_size}')
  
  download_and_unzip('http://nlp.stanford.edu/data/glove.6B.zip', save_dir='glove')
  vocabulary = defaultdict(lambda: n_tokens - 1) # when word not present, it's UNK
  embedding_matrix = np.ones((n_tokens, emb_size))

  with open(f'glove/glove.6B.{emb_size}d.txt') as f:
    for i, line in enumerate(f):
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embedding_matrix[i] = coefs
        vocabulary[word] = i
  
  # add embedding for OOV terms and padding
  embedding_matrix[n_tokens - 2] = embedding_matrix.mean(axis=0)
  embedding_matrix[n_tokens - 1] = 0
  vocabulary['<UNK>'] = n_tokens - 2
  vocabulary['<PAD>'] = n_tokens - 1

  return vocabulary, embedding_matrix

# DATA PREPROCESSING

In [3]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, inputs, labels):
        self.inputs = inputs
        self.labels = labels
    def __getitem__(self, idx):
        return self.inputs[idx], self.labels[idx]
    def __len__(self):
        return self.inputs.shape[0]

In [4]:
def load_data(start, end, start_voc, class2idx,
              drop_punctuation=True, split_docs=True, ret_counts=False):
  """
    Downloads dataset and preprocess data.
    Params:
      start: 
      end: 
      start_voc:
      classes:
      drop_punctuation:
      split_docs:
      ret_counts:
    Returns 
  """
  # download dataset
  download_and_unzip('https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip')
  
  inputs, labels = [], []
  vocabulary = set()
  counts = defaultdict(int)
  max_seq_len = 0
  
  # build dataset
  for doc in range(start, end+1):
    with open(f'dependency_treebank/wsj_{doc:04d}.dp') as f:
      
      input_seq, label_seq = [], []
      
      for line in f:
        if line.strip(): # check for empty lines
          word, label, _ = line.split('\t')
          word = word.lower()
          if not drop_punctuation or label.isalpha(): # eventually drop punctuation
            vocabulary.add(word)
            input_seq.append(word)
            label_seq.append(label)
            counts[word] += 1
        elif split_docs: # sentence over, add to input if splitting documents
          max_seq_len = max(max_seq_len, len(input_seq))
          inputs.append(input_seq)
          labels.append(label_seq)
          input_seq, label_seq = [], []

      max_seq_len = max(max_seq_len, len(input_seq))
      inputs.append(input_seq)
      labels.append(label_seq)
  
  for i_seq, l_seq in zip(inputs, labels):
    i_seq += ['<PAD>'] * (max_seq_len - len(i_seq))
    l_seq += ['<PAD>'] * (max_seq_len - len(l_seq))

  inputs = torch.as_tensor([[start_voc[word] for word in sequence] for sequence in inputs])
  labels = torch.as_tensor([[class2idx[label] for label in sequence] for sequence in labels])

  if ret_counts:
    return inputs, labels, vocabulary, counts
  else:
    return inputs, labels, vocabulary

In [5]:
class POSTagger(torch.nn.Module):

  def __init__(self, embedding_matrix, type, rec_size=1, units=None, hid_size=50):
    """
      A recurrent network performing multiclass classification (POS tagging).
      Params:
        type: 
        embedding_matrix: 
        rec_size: 
        units: 
        hid_size: 
    """
    super().__init__()

    emb_size = embedding_matrix.shape[1]
    self.emb_layer = nn.Embedding.from_pretrained(torch.as_tensor(embedding_matrix))

    if type == 'lstm':
      rec_module = nn.LSTM
    elif type == 'gru':
      rec_module = nn.GRU
    else:
      raise ValueError(f'wrong type {type}, either lstm or gru')
    self.rec_modules = rec_module(input_size=emb_size, hidden_size=hid_size, bidirectional=True, batch_first=True, num_layers=rec_size)

    fc_params = [2 * hid_size] + ([units, 37] if units is not None else [37])
    self.fc_modules = nn.Sequential(
      OrderedDict([(f'fc_{i}', nn.Linear(in_shape, out_shape)) 
      for i, (in_shape, out_shape) in enumerate(zip(fc_params[:-1], fc_params[1:]))]))
      
    self.logsoftmax = nn.LogSoftmax(dim=1)

  def __call__(self, x):
    vecs = self.emb_layer(x).float()
    rec_out, _ = self.rec_modules(vecs)
    fc_out = self.fc_modules(rec_out)
    return self.logsoftmax(fc_out)

In [6]:
def train_one_epoch(model, optimizer, loss_fn, data_loader, device):
    model.train()
    log_dict = {'train/loss': []}

    for inputs, targets in data_loader:
        inputs.to(device)
        targets.to(device)

        logprobs = model(inputs).transpose(1, 2)
        loss = loss_fn(logprobs, targets)
        loss_value = loss.item()

        if not math.isfinite(loss_value):
            print(f"Loss is {loss_value}, stopping training")
            exit(1)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        log_dict['train/loss'].append(loss_value)

    return log_dict

def evaluate(model, loss_fn, data_loader, device, metric='accuracy'):
    model.eval()
    batch_losses = []
    batch_metrics = []

    with torch.no_grad():
        for inputs, targets in data_loader:
            inputs.to(device)
            targets.to(device)

            logprobs = model(inputs).transpose(1, 2)
            loss_value = loss_fn(logprobs, targets).item()
            preds = torch.argmax(logprobs, 1)

            if metric == 'accuracy':
                metric_value = (targets == preds).sum() / (data_loader.batch_size * targets.shape[1])
            elif metric == 'f1':
                pass
            else:
                raise ValueError(f'wrong metric {metric}, either accuracy or f1')

            batch_losses.append(loss_value)
            batch_metrics.append(metric_value)

    log_dict = {'valid/loss': np.mean(batch_losses), 
               f'valid/{metric}': np.mean(batch_metrics)}
    return log_dict


In [7]:
classes = ['CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNS', 
           'NNP', 'NNPS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 
           'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '<PAD>']
classes = {c: i for i, c in enumerate(classes)}
glove_voc, embedding_matrix = get_glove()

glove already downloaded


In [9]:
train_set, train_labels, train_voc, counts = load_data(1, 100, glove_voc, classes, ret_counts=True)
valid_set, valid_labels, valid_voc = load_data(101, 150, glove_voc, classes)
test_set, test_labels, test_voc = load_data(151, 199, glove_voc, classes)

train_dl = torch.utils.data.DataLoader(Dataset(train_set, train_labels), batch_size=cfg.BATCH_SIZE)
valid_dl = torch.utils.data.DataLoader(Dataset(valid_set, valid_labels), batch_size=cfg.BATCH_SIZE)
test_dl = torch.utils.data.DataLoader(Dataset(test_set, test_labels))

model = POSTagger(embedding_matrix, type=cfg.TYPE, rec_size=cfg.REC_SIZE, units=cfg.UNITS, hid_size=cfg.HID_SIZE).to(device)
summary(model)

dependency_treebank already downloaded
dependency_treebank already downloaded
dependency_treebank already downloaded


Layer (type:depth-idx)                   Param #
POSTagger                                --
├─Embedding: 1-1                         (40,000,200)
├─LSTM: 1-2                              19,520
├─Sequential: 1-3                        --
│    └─Linear: 2-1                       1,517
├─LogSoftmax: 1-4                        --
Total params: 40,021,237
Trainable params: 21,037
Non-trainable params: 40,000,200

In [13]:
cfg_dict = {
    'epochs': cfg.EPOCHS, 'batch_size': cfg.BATCH_SIZE, 
    'model': cfg.TYPE, 'rec_size': cfg.REC_SIZE, 'units': cfg.UNITS, 'hid_size': cfg.HID_SIZE,
    'optim': cfg.OPTIM, 'lr': cfg.LR, 'alpha': cfg.ALPHA, 'betas': cfg.BETAS, 'momentum': cfg.MOMENTUM, 'weight_decay': cfg.WEIGHT_DECAY
}

wandb.login(key='bb91b382cc121df7e109ec0ad0275f1accc4c2f4')
wandb.init(project="assignment-one", entity="nlpetroni", config=cfg_dict)
wandb.watch(model, log_graph=True)
wandb.define_metric("train_step")
wandb.define_metric("epoch")
wandb.define_metric('train/loss', step_metric="train_step", summary="min")
wandb.define_metric("valid/loss", step_metric="epoch", summary="min")
wandb.define_metric("valid/accuracy", step_metric="epoch", summary="max");

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/diegochine/.netrc
  warn("The `IPython.html` package has been deprecated since IPython 4.0. "


[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`


In [14]:
params = [p for p in model.parameters() if p.requires_grad]
if cfg.OPTIM == 'rmsprop':
    optimizer = torch.optim.RMSprop(params, lr=cfg.LR, alpha=cfg.ALPHA, momentum=cfg.MOMENTUM, weight_decay=cfg.WEIGHT_DECAY)
elif cfg.OPTIM == 'adam':
    optimizer = torch.optim.Adam(params, lr=cfg.LR, betas=cfg.BETAS, weight_decay=cfg.WEIGHT_DECAY)
else:
    raise ValueError(f'wrong optim {cfg.OPTIM}, either rmsprop or adam')
loss = nn.NLLLoss()

train_step = 0

for epoch in range(cfg.EPOCHS):
    print(f'EPOCH {epoch:03d}/{cfg.EPOCHS:03d}')
    log_dict = train_one_epoch(model, optimizer, loss, train_dl, device)
    for batch_loss in log_dict['train/loss']:
        wandb.log({'train_step': train_step, 'epoch': epoch, 'train/loss': batch_loss})
    log_dict = evaluate(model, loss, valid_dl, device)
    wandb.log( {'epoch': epoch, 'valid/loss': log_dict['valid/loss'], 'valid/accuracy': log_dict['valid/accuracy']})


EPOCH 000/200
EPOCH 001/200
EPOCH 002/200


KeyboardInterrupt: 

Exception in thread Thread-8:
Traceback (most recent call last):
  File "/home/diegochine/miniconda3/envs/nlp/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/diegochine/miniconda3/envs/nlp/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/diegochine/miniconda3/envs/nlp/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 149, in check_network_status
    status_response = self._interface.communicate_network_status()
  File "/home/diegochine/miniconda3/envs/nlp/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 120, in communicate_network_status
    resp = self._communicate_network_status(status)
  File "/home/diegochine/miniconda3/envs/nlp/lib/python3.9/site-packages/wandb/sdk/interface/interface_queue.py", line 411, in _communicate_network_status
    resp = self._communicate(req, local=True)
  File "/home/diegochine/miniconda3/envs/nlp/lib/python3.9/site-packages/wandb/