<a href="https://colab.research.google.com/github/Gabry23/NLP-assignment-1/blob/main/NLP2024_Assignment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: POS tagging, Sequence labelling, RNNs


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Introduction

You are tasked to address the task of POS tagging.

<center>
    <img src="images/pos_tagging.png" alt="POS tagging" />
</center>

# [Task 1 - 0.5 points] Corpus

You are going to work with the [Penn TreeBank corpus](https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip).

**Ignore** the numeric value in the third column, use **only** the words/symbols and their POS label.

### Example

```Pierre	NNP	2
Vinken	NNP	8
,	,	2
61	CD	5
years	NNS	6
old	JJ	2
,	,	2
will	MD	0
join	VB	8
the	DT	11
board	NN	9
as	IN	9
a	DT	15
nonexecutive	JJ	15
director	NN	12
Nov.	NNP	9
29	CD	16
.	.	8
```

### Splits

The corpus contains 200 documents.

   * **Train**: Documents 1-100
   * **Validation**: Documents 101-150
   * **Test**: Documents 151-199

### Instructions

* **Download** the corpus.
* **Encode** the corpus into a pandas.DataFrame object.
* **Split** it in training, validation, and test sets.

In [None]:
import os
import urllib.request
import zipfile
import pandas as pd
import numpy as np
import random
import tensorflow as tf

from keras.models import Sequential
from keras.layers import Dense, LSTM, GRU, InputLayer, Bidirectional, TimeDistributed, Embedding, Activation
from keras.optimizers import Adam
from keras.regularizers import l2
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.preprocessing import normalize


In [None]:
dataset_name = "dependency_treebank"
url = 'https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip'
dataset_path = os.path.join("Datasets", "dependency_treebank.zip")
dataset_folder = os.path.join(os.getcwd(), "Datasets")

if not os.path.exists(dataset_folder):
    os.makedirs(dataset_folder)

if not os.path.exists(dataset_path):
    urllib.request.urlretrieve(url, dataset_path)
    print("Download done for dataset!")

with zipfile.ZipFile(dataset_path,"r") as zip_ref:
    zip_ref.extractall("Datasets")
    print("Extraction done for dataset!")



url_glove = 'https://nlp.stanford.edu/data/glove.6B.zip'
glove_path = os.path.join("Datasets", "glove.6B.zip")
dataset_folder_glove = os.path.join(os.getcwd(), "Datasets/glove")

if not os.path.exists(dataset_folder_glove):
    os.makedirs(dataset_folder_glove)

if not os.path.exists(glove_path):
    urllib.request.urlretrieve(url_glove, glove_path)
    print("Download done for glove!")

with zipfile.ZipFile(glove_path,"r") as zip_ref:
    zip_ref.extractall("glove")
    print("Extraction done for glove!")

Download done for dataset!
Extraction done for dataset!
Download done for glove!
Extraction done for glove!


In [None]:

EMBEDDING_DIMENSION = 50

glove_file = os.path.join(os.getcwd(), "glove", f"glove.6B.{str(EMBEDDING_DIMENSION)}d.txt")
with open(glove_file, encoding = "utf8" ) as text_file:
  lines = text_file.readlines()


embedding_vocabulary = {}

for line in lines:
  splits = line.split()
  embedding_vocabulary[splits[0]] = np.array([float(val) for val in splits[1:]])

In [None]:
def get_embeddings(sentence, vocabulary, embedding_size):

  embeddings = []
  for word in sentence:
    embedding = vocabulary.get(word.lower())
    if embedding is not None:
      embeddings.append(embedding)
    else:
      embeddings.append(list(np.zeros(embedding_size)))

  return embeddings

In [None]:
dataframe_rows = []
row_words = []
row_tags = []
folder = os.path.join("Datasets", "dependency_treebank")


for filename in sorted(os.listdir(folder)):
  file_path = os.path.join(folder, filename)
  if os.path.isfile(file_path):


    with open(file_path, mode = "r") as text_file:
      while True:
        line = text_file.readline()


        if line and line != "\n":
          row_words.append(line.split()[0])
          row_tags.append(line.split()[1])

        else:
          dataframe_row = {"file_id": int(filename.split(".")[0].split("_")[1]),
                           "sentence": row_words,
                           "tags": row_tags,
                           "features": get_embeddings(row_words, embedding_vocabulary, EMBEDDING_DIMENSION)}
          dataframe_rows.append(dataframe_row)
          row_words = []
          row_tags = []

          if not line: break

dataframe = pd.DataFrame(dataframe_rows)
dataframe.head()

Unnamed: 0,file_id,sentence,tags,features
0,1,"[Pierre, Vinken, ,, 61, years, old, ,, will, j...","[NNP, NNP, ,, CD, NNS, JJ, ,, MD, VB, DT, NN, ...","[[0.23568, 0.39638, -0.60135, -0.52681, 0.1587..."
1,1,"[Mr., Vinken, is, chairman, of, Elsevier, N.V....","[NNP, NNP, VBZ, NN, IN, NNP, NNP, ,, DT, NNP, ...","[[0.006008, 0.57028, -0.064426, -0.044687, 0.8..."
2,2,"[Rudolph, Agnew, ,, 55, years, old, and, forme...","[NNP, NNP, ,, CD, NNS, JJ, CC, JJ, NN, IN, NNP...","[[0.86274, 0.056588, -0.081828, -0.35318, -0.0..."
3,3,"[A, form, of, asbestos, once, used, to, make, ...","[DT, NN, IN, NN, RB, VBN, TO, VB, NNP, NN, NNS...","[[0.21705, 0.46515, -0.46757, 0.10082, 1.0135,..."
4,3,"[The, asbestos, fiber, ,, crocidolite, ,, is, ...","[DT, NN, NN, ,, NN, ,, VBZ, RB, JJ, IN, PRP, V...","[[0.418, 0.24968, -0.41242, 0.1217, 0.34527, -..."


In [None]:
val_num=50
test_num=49

len_doc = dataframe["file_id"].nunique()

df_train = None
df_test = None
df_val = None
train_num = len_doc - test_num - val_num

df_train = dataframe.loc[dataframe["file_id"].isin(range(train_num + 1))]
df_val = dataframe.loc[dataframe["file_id"].isin(range(train_num + 1, train_num + val_num + 1))]
df_test = dataframe.loc[dataframe["file_id"].isin(range(train_num + val_num + 1, len_doc + 1))]

print('Length of training dataset: ', len(df_train))
print('Length of validation dataset: ', len(df_val))
print('Length of testing dataset: ', len(df_test))

Length of training dataset:  1963
Length of validation dataset:  1299
Length of testing dataset:  652


In [None]:

from keras.utils import pad_sequences

MAX_LENGTH = len(max(df_train["sentence"].tolist(), key = len))

train_features_padded = pad_sequences(df_train["features"].tolist(), maxlen = MAX_LENGTH, padding = "post", dtype = "float32")
validation_features_padded = pad_sequences(df_val["features"].tolist(), maxlen = MAX_LENGTH, padding = "post", dtype = "float32")
test_features_padded = pad_sequences(df_test["features"].tolist(), maxlen = MAX_LENGTH, padding = "post", dtype = "float32")

print(f"The maximum length is {MAX_LENGTH}, thus:")
print(f" - The shape of train_features is: ({len(train_features_padded)}, {len(train_features_padded[0])}, {len(train_features_padded[0, 0])}).")
print(f" - The shape of validation_features is: ({len(validation_features_padded)}, {len(validation_features_padded[0])}, {len(validation_features_padded[0, 0])}).")
print(f" - The shape of test_features is: ({len(test_features_padded)}, {len(test_features_padded[0])}, {len(test_features_padded[0, 0])}).")


The maximum length is 249, thus:
 - The shape of train_features is: (1963, 249, 50).
 - The shape of validation_features is: (1299, 249, 50).
 - The shape of test_features is: (652, 249, 50).


In [None]:
train_tags = [item for sublist in df_train["tags"].tolist() for item in sublist]
validation_tags = [item for sublist in df_val["tags"].tolist() for item in sublist]
test_tags = [item for sublist in df_train["tags"].tolist() for item in sublist]

tags = list(dict.fromkeys(train_tags))

tag_to_index = {}
tag_to_index["PAD"] = 0
for i, tag in enumerate(list(tags)):
  tag_to_index[tag] = i + 1

train_tags = [[tag_to_index[tag] for tag in tags_list] for tags_list in list(df_train["tags"])]
validation_tags = [[tag_to_index[tag] for tag in tags_list] for tags_list in list(df_val["tags"])]
test_tags = [[tag_to_index[tag] for tag in tags_list] for tags_list in list(df_test["tags"])]


train_tags_padded = pad_sequences(train_tags, maxlen = MAX_LENGTH, padding = "post")
validation_tags_padded = pad_sequences(validation_tags, maxlen = MAX_LENGTH, padding = "post")
test_tags_padded = pad_sequences(test_tags, maxlen = MAX_LENGTH, padding = "post")

print(f"The shape of train_tags is: ({len(train_tags_padded)}, {len(train_tags_padded[0])}).")
print(f"The shape of validation_tags is: ({len(validation_tags_padded)}, {len(validation_tags_padded[0])}).")
print(f"The shape of test_tags is: ({len(test_tags_padded)}, {len(test_tags_padded[0])}).")

test_counts = [test_tags.count(tag) for tag in tags]

The shape of train_tags is: (1963, 249).
The shape of validation_tags is: (1299, 249).
The shape of test_tags is: (652, 249).


# [Task 2 - 0.5 points] Text encoding

To train a neural POS tagger, you first need to encode text into numerical format.

### Instructions

* Embed words using **GloVe embeddings**.
* You are **free** to pick any embedding dimension.
* [Optional] You are free to experiment with text pre-processing: **make sure you do not delete any token!**

In [None]:
punctuation_tag_list = ["PAD", ",", ".", "``", "''", ":", "$", "-LRB-", "-RRB-", "SYM", "LS", "#"]

figures_path = os.path.join(os.getcwd(), "figures")
if not os.path.exists(figures_path):
  os.makedirs(figures_path)

def plot_classes_distribution(classes, counts, filename, figures_path = figures_path):

  fig, ax = plt.subplots(1, 1, figsize = (9, 4))
  bars = ax.bar(np.arange(0, len(classes), 1), counts)


  for i in range(len(classes)):
    if classes[i] in punctuation_tag_list: bars[i].set_alpha(0.5)

  ax.set_xlabel("Class")
  ax.set_ylabel("Count")
  ax.set_yscale("log")
  ax.set_xticks(np.arange(0, len(classes), 1))
  ax.set_xticklabels(classes, rotation = 90)

  fig.savefig(f"{figures_path}/{filename}_classes_distribution.pdf", bbox_inches = "tight")
  plt.show()

# [Task 3 - 1.0 points] Model definition

You are now tasked to define your neural POS tagger.

### Instructions

* **Baseline**: implement a Bidirectional LSTM with a Dense layer on top.
* You are **free** to experiment with hyper-parameters to define the baseline model.

* **Model 1**: add an additional LSTM layer to the Baseline model.
* **Model 2**: add an additional Dense layer to the Baseline model.

* **Do not mix Model 1 and Model 2**. Each model has its own instructions.

**Note**: if a document contains many tokens, you are **free** to split them into chunks or sentences to define your mini-batches.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Bidirectional, LSTM, TimeDistributed, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau


models_name = ["m_0", "m_1", "m_2"]
descriptions_dict = {models_name[0]: (f"Baseline model ({models_name[0]}): \n"
                                      " - Bi-directional LSTM layer. \n"
                                      " - Time-distributed dense layer. \n"
                                      " - Softmax activation function."),
                     models_name[1]: (f"Additional bi-directional LSTM model ({models_name[1]}): \n"
                                      " - Bi-directional LSTM layer. \n"
                                      " - Bi-directional LSTM layer. \n"
                                      " - Time-distributed dense layer. \n"
                                      " - Softmax activation function."),
                     models_name[2]: (f"Additional dense layer model ({models_name[2]}): \n"
                                      " - Bi-directional LSTM layer. \n"
                                      " - Time-distributed dense layer. \n"
                                      " - ReLU activation function. \n"
                                      " - Time-distributed dense layer. \n"
                                      " - Softmax activation function.")}


models = {}

BATCH_SIZE = 128
EPOCHS = 100
LR = 0.01
REG = 0.01
early_stopping = EarlyStopping(monitor = "val_loss", patience = 5, restore_best_weights = True)
reduce_lr = ReduceLROnPlateau(monitor = "val_loss", patience = 3, factor = 0.1)


def get_model(name, layers, input_shape):

  model = Sequential()
  model.add(InputLayer(input_shape = input_shape))
  for layer in layers:
    model.add(layer)
  model.add(TimeDistributed(Dense(len(tag_to_index), activation = "softmax")))
  model._name = name


  return model

def grid_search(model_name, units, best_baseline_LSTM_units = None):


  models = []
  histories = []

  print(f"Grid-search, {model_name} model.")


  for n in units:
    if model_name == "m_0":
      layers = [Bidirectional(LSTM(n, return_sequences = True, recurrent_regularizer = l2(REG)))]
    elif model_name == "m_1" and best_baseline_LSTM_units != None:
      layers = [Bidirectional(LSTM(best_baseline_LSTM_units, return_sequences = True, recurrent_regularizer = l2(REG))),
                Bidirectional(LSTM(n, return_sequences = True, recurrent_regularizer = l2(REG)))]
    elif model_name == "m_2" and best_baseline_LSTM_units != None:
      layers = [Bidirectional(LSTM(best_baseline_LSTM_units, return_sequences = True, recurrent_regularizer = l2(REG))),
                TimeDistributed(Dense(n, activation = "relu"))]

    model = get_model(name = model_name, layers = layers, input_shape = (MAX_LENGTH, EMBEDDING_DIMENSION))
    model.compile(loss = "sparse_categorical_crossentropy", optimizer = Adam(LR), metrics = ["accuracy"])
    models.append(model)
    print(f"\nNumber of units: {n}.\n")
    models[-1].summary()


    history = models[-1].fit(train_features_padded, train_tags_padded, batch_size = BATCH_SIZE, epochs = EPOCHS, validation_data = (validation_features_padded, validation_tags_padded), callbacks = [early_stopping, reduce_lr])
    histories.append(history)


  return models, histories

# [Task 4 - 1.0 points] Metrics

Before training the models, you are tasked to define the evaluation metrics for comparison.

### Instructions

* Evaluate your models using macro F1-score, compute over **all** tokens.
* **Concatenate** all tokens in a data split to compute the F1-score. (**Hint**: accumulate FP, TP, FN, TN iteratively)
* **Do not consider punctuation and symbol classes** $\rightarrow$ [What is punctuation?](https://en.wikipedia.org/wiki/English_punctuation)

In [None]:
punctuation_tag_list = ["PAD", ",", ".", "``", "''", ":", "$", "-LRB-", "-RRB-", "SYM", "LS", "#"]

In [None]:
def compute_F1_score(model, X, y, tag_to_index_vocabulary):

  pred = model.predict(X)
  report = classification_report(y.flatten(),
                                 np.argmax(pred, axis = 2).flatten(),
                                 labels = np.arange(0, len(tag_to_index_vocabulary), 1),
                                 target_names = list(tag_to_index_vocabulary.keys()),
                                 zero_division = 0,
                                 output_dict = True)

  macro_f1 = 0
  for tag in list(tag_to_index_vocabulary.keys()):
    if tag not in punctuation_tag_list:
      macro_f1 = macro_f1 + report[tag]["f1-score"]
  macro_f1 = macro_f1 / (len(list(tag_to_index_vocabulary.keys())) - len(punctuation_tag_list))
  return macro_f1, pred, report

def get_best_model(models, units):

  f1_scores = [compute_F1_score(model, validation_features_padded, validation_tags_padded, tag_to_index)[0] for model in models]
  best = units[np.argmax(f1_scores)]
  print(f"The best number of units is: {best}.")

  return best, f1_scores, models[np.argmax(f1_scores)]

# [Task 5 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate the Baseline, Model 1, and Model 2.

In [None]:
baseline_units = [32, 64, 128, 256]
baseline_models, baseline_model_histories = grid_search(models_name[0], baseline_units)

Grid-search, m_0 model.

Number of units: 32.

Model: "m_0"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional (Bidirection  (None, 249, 64)           21248     
 al)                                                             
                                                                 
 time_distributed (TimeDist  (None, 249, 46)           2990      
 ributed)                                                        
                                                                 
Total params: 24238 (94.68 KB)
Trainable params: 24238 (94.68 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 

In [None]:
baseline_best_units, baseline_f1_scores, models[models_name[0]] = get_best_model(baseline_models, baseline_units)


The best number of units is: 256.


In [None]:
double_lstm_units = [32, 64, 128, 256]
double_lstm_models, double_lstm_model_histories = grid_search(models_name[1], double_lstm_units, baseline_best_units)


Grid-search, m_1 model.

Number of units: 32.

Model: "m_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional_4 (Bidirecti  (None, 249, 512)          628736    
 onal)                                                           
                                                                 
 bidirectional_5 (Bidirecti  (None, 249, 64)           139520    
 onal)                                                           
                                                                 
 time_distributed_4 (TimeDi  (None, 249, 46)           2990      
 stributed)                                                      
                                                                 
Total params: 771246 (2.94 MB)
Trainable params: 771246 (2.94 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/10

In [None]:
_, double_lstm_f1_scores, models[models_name[1]] = get_best_model(double_lstm_models, double_lstm_units)


The best number of units is: 128.


In [None]:
double_dense_units = [32, 64, 128, 256]
double_dense_models, double_dense_model_histories = grid_search(models_name[2], double_dense_units, baseline_best_units)



Grid-search, m_2 model.

Number of units: 32.

Model: "m_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional_12 (Bidirect  (None, 249, 512)          628736    
 ional)                                                          
                                                                 
 time_distributed_8 (TimeDi  (None, 249, 32)           16416     
 stributed)                                                      
                                                                 
 time_distributed_9 (TimeDi  (None, 249, 46)           1518      
 stributed)                                                      
                                                                 
Total params: 646670 (2.47 MB)
Trainable params: 646670 (2.47 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/10

In [None]:
_, double_dense_f1_scores, models[models_name[2]] = get_best_model(double_dense_models, double_dense_units)


The best number of units is: 256.


### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Compute metrics on the validation set.
* Pick **at least** three seeds for robust estimation.
* Pick the **best** performing model according to the observed validation set performance.

In [None]:
models_val_score = {}
best_val_pred = {}
models_val_report = {}


for model in models:
  print(f"{descriptions_dict[model]}\n")
  models_val_score[model], best_val_pred[model], models_val_report[model] = compute_F1_score(models[model], validation_features_padded, validation_tags_padded, tag_to_index)
  print(f"The macro F1-score for model {model} is: {models_val_score[model]}.\n")


best_models = sorted(models_val_score, key = models_val_score.get, reverse = True)[:2]

Baseline model (m_0): 
 - Bi-directional LSTM layer. 
 - Time-distributed dense layer. 
 - Softmax activation function.

The macro F1-score for model m_0 is: 0.7667518693664673.

Additional bi-directional LSTM model (m_1): 
 - Bi-directional LSTM layer. 
 - Bi-directional LSTM layer. 
 - Time-distributed dense layer. 
 - Softmax activation function.

The macro F1-score for model m_1 is: 0.7750726757833377.

Additional dense layer model (m_2): 
 - Bi-directional LSTM layer. 
 - Time-distributed dense layer. 
 - ReLU activation function. 
 - Time-distributed dense layer. 
 - Softmax activation function.

The macro F1-score for model m_2 is: 0.7605632972975537.



In [None]:
models_test_score = {}
best_test_pred = {}
models_test_report = {}

for model in best_models:
  print(f"{descriptions_dict[model]}\n")
  models_test_score[model], best_test_pred[model], models_test_report[model] = compute_F1_score(models[model], test_features_padded, test_tags_padded, tag_to_index)
  print(f"The macro F1-score, on the test set, for model {model} is: {models_test_score[model]}.\n")

Additional bi-directional LSTM model (m_1): 
 - Bi-directional LSTM layer. 
 - Bi-directional LSTM layer. 
 - Time-distributed dense layer. 
 - Softmax activation function.

The macro F1-score, on the test set, for model m_1 is: 0.7735930620632316.

Baseline model (m_0): 
 - Bi-directional LSTM layer. 
 - Time-distributed dense layer. 
 - Softmax activation function.

The macro F1-score, on the test set, for model m_0 is: 0.7764972851995102.



# [Task 6 - 1.0 points] Error Analysis

You are tasked to evaluate your best performing model.

In [None]:
def get_masked_labels(true_labels, pred_labels, tag_to_index, punctuation_tag_list):

  punctuation_indexes = [tag_to_index[tag] for tag in punctuation_tag_list]

  mask = np.isin(true_labels, punctuation_indexes)
  true = np.delete(true_labels, mask)
  pred = np.delete(pred_labels, mask)

  return true, pred, punctuation_indexes

for model in best_models:

  model_val_pred = np.argmax(best_val_pred[model], axis = 2).flatten()
  true, pred, _ = get_masked_labels(validation_tags_padded.flatten(), model_val_pred, tag_to_index, punctuation_tag_list)
  print("The error rate, on the validation set, for model {} is: {}%.".format(model, np.sum(true != pred) * 100 / len(true)))

  model_test_pred = np.argmax(best_test_pred[model], axis = 2).flatten()
  true, pred, _ = get_masked_labels(test_tags_padded.flatten(), model_test_pred, tag_to_index, punctuation_tag_list)
  print("The error rate, on the test set, for model {} is: {}%.".format(model, np.sum(true != pred) * 100 / len(true)))


The error rate, on the validation set, for model m_1 is: 10.410359292358198%.
The error rate, on the test set, for model m_1 is: 9.54226381983036%.
The error rate, on the validation set, for model m_0 is: 11.161772752142987%.
The error rate, on the test set, for model m_0 is: 9.93711611582334%.


In [None]:
for model in best_models:

  val_precision = 0
  test_precision = 0
  val_recall = 0
  test_recall = 0

  for tag in list(tag_to_index.keys()):

    if tag not in punctuation_tag_list:
      val_precision = val_precision + models_val_report[model][tag]["precision"]
      test_precision = test_precision + models_test_report[model][tag]["precision"]
      val_recall = val_recall + models_val_report[model][tag]["recall"]
      test_recall = test_recall + models_test_report[model][tag]["recall"]

  val_precision = val_precision / (len(list(tag_to_index.keys())) - len(punctuation_tag_list))
  test_precision = test_precision / (len(list(tag_to_index.keys())) - len(punctuation_tag_list))
  val_recall = val_recall / (len(list(tag_to_index.keys())) - len(punctuation_tag_list))
  test_recall = test_recall / (len(list(tag_to_index.keys())) - len(punctuation_tag_list))

  print(f"The macro precision, on the validation set, for model {model} is: {val_precision}.")
  print(f"The macro precision, on the test set, for model {model} is: {test_precision}.")
  print(f"The macro recall, on the validation set, for model {model} is: {val_recall}.")
  print(f"The macro recall, on the test set, for model {model} is: {test_recall}.")

The macro precision, on the validation set, for model m_1 is: 0.7839906594391554.
The macro precision, on the test set, for model m_1 is: 0.7755735456258689.
The macro recall, on the validation set, for model m_1 is: 0.7779439873104167.
The macro recall, on the test set, for model m_1 is: 0.7826815825372172.
The macro precision, on the validation set, for model m_0 is: 0.7698285926465419.
The macro precision, on the test set, for model m_0 is: 0.7810172057964292.
The macro recall, on the validation set, for model m_0 is: 0.7756780444207914.
The macro recall, on the test set, for model m_0 is: 0.7806723478763761.


In [None]:
TOP_ERROR_RATES = 5

for model in best_models:

  print(f"Model {model}:\n")

  # Error rates.
  error_rates = []
  for i in range(len(test_tags)):
    true, pred, punctuation_indexes = get_masked_labels(test_tags_padded[i], np.argmax(best_test_pred[model], axis = 2)[i], tag_to_index, punctuation_tag_list)
    error_rates.append(np.sum(true != pred) * 100 / len(true))

  most_mistakes = np.argpartition(error_rates, -TOP_ERROR_RATES)[-TOP_ERROR_RATES:]

  for i in range(TOP_ERROR_RATES):
    true, pred, _ = get_masked_labels(test_tags_padded[most_mistakes[i]], np.argmax(best_test_pred[model], axis = 2)[most_mistakes[i]], tag_to_index, punctuation_tag_list)
    true = [tag for index in true for tag, value in tag_to_index.items() if value == index]
    pred = [tag for index in pred for tag, value in tag_to_index.items() if value == index]
    print("The true tags, without punctuation, are {}, while the predicted ones are {}.\n".format(true, pred))
  print()

Model m_1:

The true tags, without punctuation, are ['NNP', 'VBZ', 'JJ', 'NNS', 'CC', 'VBZ', 'NN', 'NNS', 'CC', 'JJ', 'NN', 'NNS'], while the predicted ones are ['NN', 'NN', 'NN', 'NNS', 'CC', 'NN', 'NN', 'NNS', 'CC', 'NN', 'NN', 'NN'].

The true tags, without punctuation, are ['NNP', 'VBZ', 'DT', 'JJ', 'JJ', 'NN', 'NN'], while the predicted ones are ['VBZ', 'VBZ', 'DT', 'NNP', 'NNP', 'NNP', 'NN'].

The true tags, without punctuation, are ['NN', 'NNS', 'CC', 'NN'], while the predicted ones are ['NNP', 'NNPS', 'CC', 'NNP'].

The true tags, without punctuation, are ['NNP'], while the predicted ones are ['NN'].

The true tags, without punctuation, are ['NNPS', 'NNP', 'NNPS'], while the predicted ones are ['NNS', 'CC', 'NNP'].


Model m_0:

The true tags, without punctuation, are ['NNP', 'NNPS', 'NNP', 'NNS', 'VBD', 'RB', 'VBN', 'NN', 'CC', 'NN', 'VBD', 'IN', 'DT', 'NNS', 'NNS'], while the predicted ones are ['JJ', 'NNS', 'NNS', 'NNS', 'RB', 'IN', 'VBD', 'NN', 'CC', 'NN', 'VBD', 'IN', 'DT'

### More about OOV

For a given token:

* **If in train set**: add to vocabulary and assign an embedding (use GloVe if token in GloVe, custom embedding otherwise).
* **If in val/test set**: assign special token if not in vocabulary and assign custom embedding.

Your vocabulary **should**:

* Contain all tokens in train set; or
* Union of tokens in train set and in GloVe $\rightarrow$ we make use of existing knowledge!


**Note**: What about OOV tokens?
   * All the tokens in the **training** set that are not in GloVe **must** be added to the vocabulary.
   * For the remaining tokens (i.e., OOV in the validation and test sets), you have to assign them a **special token** (e.g., [UNK]) and a **static** embedding.
   * You are **free** to define the static embedding using any strategy (e.g., random, neighbourhood, etc...)

### Token to embedding mapping

You can follow two approaches for encoding tokens in your POS tagger.

### Work directly with embeddings

- Compute the embedding of each input token
- Feed the mini-batches of shape (batch_size, # tokens, embedding_dim) to your model

### Work with Embedding layer

- Encode input tokens to token ids
- Define a Embedding layer as the first layer of your model
- Compute the embedding matrix of all known tokens (i.e., tokens in your vocabulary)
- Initialize the Embedding layer with the computed embedding matrix
- You are **free** to set the Embedding layer trainable or not

### Padding

Pay attention to padding tokens!

Your model **should not** be penalized on those tokens.

#### How to?

There are two main ways.

However, their implementation depends on the neural library you are using.

- Embedding layer
- Custom loss to compute average cross-entropy on non-padding tokens only

**Note**: This is a **recommendation**, but we **do not penalize** for missing workarounds.

### Instructions

* Compare the errors made on the validation and test sets.
* Aggregate model errors into categories (if possible)
* Comment the about errors and propose possible solutions on how to address them.

# [Task 7 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Execution Order

You are **free** to address tasks in any order (if multiple orderings are available).

### Trainable Embeddings

You are **free** to define a trainable or non-trainable Embedding layer to load the GloVe embeddings.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.

### Neural Libraries

You are **free** to use any library of your choice to implement the networks (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Keras TimeDistributed Dense layer

If you are using Keras, we recommend wrapping the final Dense layer with `TimeDistributed`.

### Robust Evaluation

Each model is trained with at least 3 random seeds.

Task 4 requires you to compute the average performance over the 3 seeds and its corresponding standard deviation.

### Model Selection for Analysis

To carry out the error analysis you are **free** to either

* Pick examples or perform comparisons with an individual seed run model (e.g., Baseline seed 1337)
* Perform ensembling via, for instance, majority voting to obtain a single model.

### Error Analysis

Some topics for discussion include:
   * Model performance on most/less frequent classes.
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

### Punctuation

**Do not** remove punctuation from documents since it may be helpful to the model.

You should **ignore** it during metrics computation.

If you are curious, you can run additional experiments to verify the impact of removing punctuation.

# The End