# Assignment 1

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: POS tagging, Sequence labelling, RNNs


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Introduction

You are tasked to address the task of POS tagging.

<center>
    <img src="./images/pos_tagging.png" alt="POS tagging" />
</center>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
!cp -rf /content/drive/MyDrive/UNIBO/NLP/Assignments/Assignment-1/data ./
!cp -rf /content/drive/MyDrive/UNIBO/NLP/Assignments/Assignment-1/images ./
!cp /content/drive/MyDrive/UNIBO/NLP/Assignments/Assignment-1/data.csv ./

# [Task 1 - 0.5 points] Corpus

You are going to work with the [Penn TreeBank corpus](https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip).

**Ignore** the numeric value in the third column, use **only** the words/symbols and their POS label.

### Example

```Pierre	NNP	2
Vinken	NNP	8
,	,	2
61	CD	5
years	NNS	6
old	JJ	2
,	,	2
will	MD	0
join	VB	8
the	DT	11
board	NN	9
as	IN	9
a	DT	15
nonexecutive	JJ	15
director	NN	12
Nov.	NNP	9
29	CD	16
.	.	8
```

### Splits

The corpus contains 200 documents.

   * **Train**: Documents 1-100
   * **Validation**: Documents 101-150
   * **Test**: Documents 151-199

### Instructions

* **Download** the corpus.
* **Encode** the corpus into a pandas.DataFrame object.
* **Split** it in training, validation, and test sets.

In [23]:
import os
import pandas as pd

if not os.path.exists('data.csv'):
    path = './data/'
    files = os.listdir(path)
    # read sorted files
    files.sort()
    all_data = []
    for idx, file in enumerate(files):
        data = pd.read_csv(path + file, header=None, sep='\t', dtype=str)
        data['doc_id'] = idx + 1
        if idx < 100:
            data['split'] = 'train'
        elif idx < 150:
            data['split'] = 'validation'
        else:
            data['split'] = 'test'
        all_data.append(data)

    df = pd.concat(all_data, ignore_index=True)
    df.drop(columns=[2], inplace=True)
    df.columns = ['token', 'pos', 'doc_id', 'split']

    # save to csv
    df.to_csv('data.csv', index=False)
else:
    df = pd.read_csv('data.csv', dtype=str)

In [18]:
df.groupby('split').head()

Unnamed: 0,token,pos,doc_id,split
0,Pierre,NNP,1,train
1,Vinken,NNP,1,train
2,",",",",1,train
3,61,CD,1,train
4,years,NNS,1,train
47356,A,DT,101,validation
47357,House-Senate,NNP,101,validation
47358,conference,NN,101,validation
47359,approved,VBD,101,validation
47360,major,JJ,101,validation


# [Task 2 - 0.5 points] Text encoding

To train a neural POS tagger, you first need to encode text into numerical format.

### Instructions

* Embed words using **GloVe embeddings**.
* You are **free** to pick any embedding dimension.
* [Optional] You are free to experiment with text pre-processing: **make sure you do not delete any token!**

In [11]:
import torch
from torchtext.vocab import GloVe

embedding_dimension = 50

embedding = GloVe(name='6B', dim=embedding_dimension)

.vector_cache/glove.6B.zip: 862MB [02:39, 5.39MB/s]                           
100%|█████████▉| 399999/400000 [00:16<00:00, 24181.77it/s]


In [19]:
import re
from functools import reduce
import nltk

def lower(text: str) -> str:
    return text.lower()

def strip_text(text: str) -> str:
    return text.strip()

In [24]:
df['token'] = df['token'].apply(lambda x: strip_text(lower(x)))

In [37]:
embedding_test = embedding.get_vecs_by_tokens(df['token'][0])

# [Task 3 - 1.0 points] Model definition

You are now tasked to define your neural POS tagger.

### Instructions

* **Baseline**: implement a Bidirectional LSTM with a Dense layer on top.
* You are **free** to experiment with hyper-parameters to define the baseline model.

* **Model 1**: add an additional LSTM layer to the Baseline model.
* **Model 2**: add an additional Dense layer to the Baseline model.

* **Do not mix Model 1 and Model 2**. Each model has its own instructions.

**Note**: if a document contains many tokens, you are **free** to split them into chunks or sentences to define your mini-batches.

### Baseline

In [39]:
import torch.nn as nn
import torch.nn.functional as F

class Baseline(nn.Module):
    def __init__(self, lstm_dimension, dense_dimension):
        super().__init__()
        self.bidirectional_layer = nn.LSTM(bidirectional=True, input_size=embedding_dimension, hidden_size=lstm_dimension)
        self.dense_layer = nn.Linear(in_features=lstm_dimension, out_features=dense_dimension)

    def forward(self, sentence):
        embeds = embedding.get_vecs_by_tokens(sentence)
        lstm_out, _ = self.bidirectional_layer(embeds.view(len(sentence), 1, -1))
        dense_out, _ = self.dense_layer(lstm_out.view(len(lstm_out), -1))
        tag_scores = F.log_softmax(dense_out, dim=1)
        return tag_scores

# [Task 4 - 1.0 points] Metrics

Before training the models, you are tasked to define the evaluation metrics for comparison.

### Instructions

* Evaluate your models using macro F1-score, compute over **all** tokens.
* **Concatenate** all tokens in a data split to compute the F1-score. (**Hint**: accumulate FP, TP, FN, TN iteratively)
* **Do not consider punctuation and symbol classes** $\rightarrow$ [What is punctuation?](https://en.wikipedia.org/wiki/English_punctuation)

**Note**: What about OOV tokens?
   * All the tokens in the **training** set that are not in GloVe are **not** considered as OOV
   * For the remaining tokens (i.e., OOV in the validation and test sets), you have to assign them a **static** embedding.
   * You are **free** to define the static embedding using any strategy (e.g., random, neighbourhood, etc...)

# [Task 5 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate the Baseline, Model 1, and Model 2.

### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Compute metrics on the validation set.
* Pick **at least** three seeds for robust estimation.
* Pick the **best** performing model according to the observed validation set performance.

# [Task 6 - 1.0 points] Error Analysis

You are tasked to evaluate your best performing model.

### Instructions

* Compare the errors made on the validation and test sets.
* Aggregate model errors into categories (if possible)
* Comment the about errors and propose possible solutions on how to address them.

# [Task 7 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Trainable Embeddings

You are **free** to define a trainable or non-trainable Embedding layer to load the GloVe embeddings.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.

### Neural Libraries

You are **free** to use any library of your choice to implement the networks (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Keras TimeDistributed Dense layer

If you are using Keras, we recommend wrapping the final Dense layer with `TimeDistributed`.

### Error Analysis

Some topics for discussion include:
   * Model performance on most/less frequent classes.
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

### Punctuation

**Do not** remove punctuation from documents since it may be helpful to the model.

You should **ignore** it during metrics computation.

If you are curious, you can run additional experiments to verify the impact of removing punctuation.

# The End