# Assignment 1

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: POS tagging, Sequence labelling, RNNs


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Introduction

You are tasked to address the task of POS tagging.

<center>
    <img src="./images/pos_tagging.png" alt="POS tagging" />
</center>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!cp -rf /content/drive/MyDrive/UNIBO/NLP/Assignments/Assignment-1/data ./
!cp -rf /content/drive/MyDrive/UNIBO/NLP/Assignments/Assignment-1/images ./
!cp /content/drive/MyDrive/UNIBO/NLP/Assignments/Assignment-1/data.csv ./

# [Task 1 - 0.5 points] Corpus

You are going to work with the [Penn TreeBank corpus](https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip).

**Ignore** the numeric value in the third column, use **only** the words/symbols and their POS label.

### Example

```Pierre	NNP	2
Vinken	NNP	8
,	,	2
61	CD	5
years	NNS	6
old	JJ	2
,	,	2
will	MD	0
join	VB	8
the	DT	11
board	NN	9
as	IN	9
a	DT	15
nonexecutive	JJ	15
director	NN	12
Nov.	NNP	9
29	CD	16
.	.	8
```

### Splits

The corpus contains 200 documents.

   * **Train**: Documents 1-100
   * **Validation**: Documents 101-150
   * **Test**: Documents 151-199

### Instructions

* **Download** the corpus.
* **Encode** the corpus into a pandas.DataFrame object.
* **Split** it in training, validation, and test sets.

In [5]:
import os
import pandas as pd
import numpy as np

if not os.path.exists('data.csv'):
    print('building dataset')
    path = './data/'
    files = os.listdir(path)
    # read sorted files
    files.sort()
    all_data = []
    for idx, file in enumerate(files):
        raw_data = pd.read_csv(path + file, header=None, sep='\t', dtype=str)
        tokens = raw_data.iloc[:,0].values
        pos_tags = raw_data.iloc[:,1].values
        split = None
        if idx < 100:
            split = 'train'
        elif idx < 150:
            split = 'validation'
        else:
            split = 'test'
        data = (tokens, pos_tags, split)
        all_data.append(data)

    df = pd.DataFrame(all_data)
    df.columns = ['token', 'pos', 'split']
    df.index = df.index + 1
    # save to csv
    df.to_csv('data.csv', index=False)
else:
    print('loading existing dataset')
    df = pd.read_csv('data.csv', dtype=str)
    df.index = df.index + 1

loading existing dataset


In [6]:
df.groupby('split').head()

Unnamed: 0,token,pos,split
1,"['Pierre' 'Vinken' ',' '61' 'years' 'old' ',' ...","['NNP' 'NNP' ',' 'CD' 'NNS' 'JJ' ',' 'MD' 'VB'...",train
2,"['Rudolph' 'Agnew' ',' '55' 'years' 'old' 'and...","['NNP' 'NNP' ',' 'CD' 'NNS' 'JJ' 'CC' 'JJ' 'NN...",train
3,['A' 'form' 'of' 'asbestos' 'once' 'used' 'to'...,['DT' 'NN' 'IN' 'NN' 'RB' 'VBN' 'TO' 'VB' 'NNP...,train
4,['Yields' 'on' 'money-market' 'mutual' 'funds'...,"['NNS' 'IN' 'JJ' 'JJ' 'NNS' 'VBD' 'TO' 'VB' ',...",train
5,"['J.P.' 'Bolduc' ',' 'vice' 'chairman' 'of' 'W...","['NNP' 'NNP' ',' 'NN' 'NN' 'IN' 'NNP' 'NNP' 'C...",train
101,['A' 'House-Senate' 'conference' 'approved' 'm...,['DT' 'NNP' 'NN' 'VBD' 'JJ' 'NNS' 'IN' 'DT' 'N...,validation
102,['Beauty' 'Takes' 'Backseat' 'To' 'Safety' 'on...,['NN' 'VBZ' 'NN' 'TO' 'NNP' 'IN' 'NNPS' 'NN' '...,validation
103,['The' 'Labor' 'Department' 'cited' 'USX' 'Cor...,['DT' 'NNP' 'NNP' 'VBD' 'NNP' 'NNP' 'IN' 'JJ' ...,validation
104,"['Due' 'to' 'an' 'editing' 'error' ',' 'a' 'le...","['JJ' 'TO' 'DT' 'NN' 'NN' ',' 'DT' 'NN' 'TO' '...",validation
105,['Your' 'Oct.' '6' 'editorial' '``' 'The' 'Ill...,['PRP$' 'NNP' 'CD' 'NN' '``' 'NNP' 'NNP' 'NNP'...,validation


In [7]:
all_pos_labels = [i for x in df.pos for i in x]
unique_pos_labels = set(all_pos_labels)

In [8]:
pos2id = {}
id2pos = {}
i = 0
for pos in sorted(unique_pos_labels):
    pos2id[pos] = i
    id2pos[i] = pos
    i+=1

# [Task 2 - 0.5 points] Text encoding

To train a neural POS tagger, you first need to encode text into numerical format.

### Instructions

* Embed words using **GloVe embeddings**.
* You are **free** to pick any embedding dimension.
* [Optional] You are free to experiment with text pre-processing: **make sure you do not delete any token!**

In [9]:
import torch
from torchtext.vocab import GloVe

embedding_dimension = 50

embedding = GloVe(name='6B', dim=embedding_dimension)

.vector_cache/glove.6B.zip: 862MB [02:40, 5.38MB/s]                           
100%|█████████▉| 399999/400000 [00:15<00:00, 25046.19it/s]


In [10]:
import re
from functools import reduce
import nltk

def lower(texts):
    return [text.lower() for text in texts]

def strip_text(texts):
    return [text.strip() for text in texts]

In [11]:
df['token'] = df['token'].apply(lambda x: strip_text(lower(x)))

# [Task 3 - 1.0 points] Model definition

You are now tasked to define your neural POS tagger.

### Instructions

* **Baseline**: implement a Bidirectional LSTM with a Dense layer on top.
* You are **free** to experiment with hyper-parameters to define the baseline model.

* **Model 1**: add an additional LSTM layer to the Baseline model.
* **Model 2**: add an additional Dense layer to the Baseline model.

* **Do not mix Model 1 and Model 2**. Each model has its own instructions.

**Note**: if a document contains many tokens, you are **free** to split them into chunks or sentences to define your mini-batches.

### Baseline

In [None]:
class VariableLengthLSTM(nn.Module):
  def __init__(self, lstm):
    super().__init__()
    self.lstm = lstm

  def forward(self, elements):
    return torch.stack([self.lstm(x)[0] for x in elements])

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class Baseline(nn.Module):
    def __init__(self, lstm_dimension, dense_dimension):
        super().__init__()
        bidirectional_layer = nn.LSTM(bidirectional=True, input_size=embedding_dimension, hidden_size=lstm_dimension, batch_first=True)
        self.variablelength_lstm = VariableLengthLSTM(bidirectional_layer)
        self.dense_layer = nn.Linear(in_features=lstm_dimension, out_features=dense_dimension)

    def forward(self, sentences):
        embeds = [embedding.get_vecs_by_tokens(sentence) for sentence in sentences]
        lstm_out, _ = self.variablelength_lstm(embeds)
        dense_out, _ = self.dense_layer(lstm_out.view(len(lstm_out), -1))
        tag_scores = F.log_softmax(dense_out, dim=1)
        return tag_scores

### Model 1

In [None]:
class Model1(nn.Module):
    def __init__(self, lstm_dimension, dense_dimension):
        super().__init__()
        self.num_lstm = 2
        self.bidirectional_layer_1 = nn.LSTM(bidirectional=True, input_size=embedding_dimension, hidden_size=lstm_dimension, batch_first=True)
        self.bidirectional_layer_2 = nn.LSTM(bidirectional=True, input_size=lstm_dimension, hidden_size=lstm_dimension, batch_first=True)
        self.dense_layer = nn.Linear(in_features=lstm_dimension, out_features=dense_dimension)

    def forward(self, sentence):
        embeds = embedding.get_vecs_by_tokens(sentence)
        lstm_out_1, _ = self.bidirectional_layer_1(embeds)
        lstm_out_2, _ = self.bidirectional_layer_2(embeds)
        dense_out, _ = self.dense_layer(lstm_out_2)
        tag_scores = F.log_softmax(dense_out, dim=1)
        return tag_scores

### Model 2

In [None]:
class Model2(nn.Module):
    def __init__(self, lstm_dimension, dense_dimension):
        super().__init__()
        self.bidirectional_layer = nn.LSTM(bidirectional=True, input_size=embedding_dimension, hidden_size=lstm_dimension, batch_first=True)
        self.dense_layer_1 = nn.Linear(in_features=lstm_dimension, out_features=dense_dimension)
        self.dense_layer_2 = nn.Linear(in_features=dense_dimension, out_features=dense_dimension)

    def forward(self, sentence):
        embeds = embedding.get_vecs_by_tokens(sentence)
        lstm_out, _ = self.bidirectional_layer(embeds.view(len(sentence), 1, -1))
        dense_out_1, _ = self.dense_layer_1(lstm_out.view(len(lstm_out), -1))
        dense_out_2, _ = self.dense_layer_2(dense_out_1.view(len(dense_out_1), -1))
        tag_scores = F.log_softmax(dense_out_2, dim=1)
        return tag_scores

# [Task 4 - 1.0 points] Metrics

Before training the models, you are tasked to define the evaluation metrics for comparison.

### Instructions

* Evaluate your models using macro F1-score, compute over **all** tokens.
* **Concatenate** all tokens in a data split to compute the F1-score. (**Hint**: accumulate FP, TP, FN, TN iteratively)
* **Do not consider punctuation and symbol classes** $\rightarrow$ [What is punctuation?](https://en.wikipedia.org/wiki/English_punctuation)

**Note**: What about OOV tokens?
   * All the tokens in the **training** set that are not in GloVe are **not** considered as OOV
   * For the remaining tokens (i.e., OOV in the validation and test sets), you have to assign them a **static** embedding.
   * You are **free** to define the static embedding using any strategy (e.g., random, neighbourhood, etc...)

In [None]:
from torcheval.metrics.functional import multiclass_f1_score

def filter_punctuation(sentence):
    mask = sentence != '.'
    return sentence[mask]

def compute_metrics(ground_truths, predicteds):
    pass

# [Task 5 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate the Baseline, Model 1, and Model 2.

### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Compute metrics on the validation set.
* Pick **at least** three seeds for robust estimation.
* Pick the **best** performing model according to the observed validation set performance.

In [None]:
from torch.utils.data import Dataset, DataLoader

class PosDataset(Dataset):
    def __init__(self, text, labels):
        self.labels = labels
        self.text = text
    def __len__(self):
            return len(self.labels)
    def __getitem__(self, idx):
            label = self.labels[idx]
            text = self.text[idx]
            sample = (text, label)
            return sample

In [None]:
sentences_train = df[df['split'] == 'train'].token.values
pos_labels_train = df[df['split'] == 'train'].pos.values
pos_labels_train = [[pos2id[pos] for pos in pos_tags] for pos_tags in pos_labels_train]

sentences_test = df[df['split'] == 'test'].token.values
pos_labels_test = df[df['split'] == 'test'].pos.values
pos_labels_test = [[pos2id[pos] for pos in pos_tags] for pos_tags in pos_labels_test]

sentences_validation = df[df['split'] == 'validation'].token.values
pos_labels_validation = df[df['split'] == 'validation'].pos.values

In [None]:
def collate_fn(data):
    return ([x[0] for x in data], [x[1] for x in data])

In [None]:
batch_size = 32

pos_dataset_train = PosDataset(sentences_train, pos_labels_train)
pos_dataloader_train = DataLoader(pos_dataset_train, batch_size=batch_size, shuffle=True, collate_fn= collate_fn)

In [None]:
i = 0
for sentences, pos in pos_dataloader_train:
    print(embedding.get_vecs_by_tokens(sentences[0]))
    break

tensor([[-0.6575, -0.8316, -0.2622,  ...,  0.3867, -0.0968,  0.5784],
        [ 0.1512, -0.0656,  0.4108,  ...,  0.1737, -0.0909,  0.9407],
        [ 0.7085,  0.5709, -0.4716,  ..., -0.2256, -0.0939, -0.8037],
        ...,
        [ 1.1414,  0.0452,  1.8586,  ...,  0.3565, -0.4603, -0.1377],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.1516,  0.3018, -0.1676,  ..., -0.3565,  0.0164,  0.1022]])


In [None]:
from torch.nn import CrossEntropyLoss
from torch.optim import Adam

def train(model, epochs, loss_function, dataloader):
    model.train()
    optimizer = Adam(model.parameters())
    for epoch in range(epochs):
        for sentence, pos in dataloader:
            optimizer.zero_grad()
            predicted = model(sentence)
            loss = loss_function(predicted, pos)
            loss.backward()
            optimizer.step()
        print(f'Train epoch [{epoch}/{epochs}] loss: {loss.item()}')

In [None]:
epochs = 10
loss_function = CrossEntropyLoss()

lstm_dimension = 16
dense_dimension = len(unique_pos_labels)

baseline_model = Baseline(lstm_dimension, dense_dimension)
train(baseline_model, epochs, loss_function, pos_dataloader_train)

RuntimeError: ignored

# [Task 6 - 1.0 points] Error Analysis

You are tasked to evaluate your best performing model.

### Instructions

* Compare the errors made on the validation and test sets.
* Aggregate model errors into categories (if possible)
* Comment the about errors and propose possible solutions on how to address them.

# [Task 7 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Trainable Embeddings

You are **free** to define a trainable or non-trainable Embedding layer to load the GloVe embeddings.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.

### Neural Libraries

You are **free** to use any library of your choice to implement the networks (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Keras TimeDistributed Dense layer

If you are using Keras, we recommend wrapping the final Dense layer with `TimeDistributed`.

### Error Analysis

Some topics for discussion include:
   * Model performance on most/less frequent classes.
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

### Punctuation

**Do not** remove punctuation from documents since it may be helpful to the model.

You should **ignore** it during metrics computation.

If you are curious, you can run additional experiments to verify the impact of removing punctuation.

# The End