# Assignment 1

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: POS tagging, Sequence labelling, RNNs


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Introduction

You are tasked to address the task of POS tagging.

<center>
    <img src="images/pos_tagging.png" alt="POS tagging" />
</center>

# [Task 1 - 0.5 points] Corpus

You are going to work with the [Penn TreeBank corpus](https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip).

**Ignore** the numeric value in the third column, use **only** the words/symbols and their POS label.

### Example

```Pierre	NNP	2
Vinken	NNP	8
,	,	2
61	CD	5
years	NNS	6
old	JJ	2
,	,	2
will	MD	0
join	VB	8
the	DT	11
board	NN	9
as	IN	9
a	DT	15
nonexecutive	JJ	15
director	NN	12
Nov.	NNP	9
29	CD	16
.	.	8
```

### Splits

The corpus contains 200 documents.

   * **Train**: Documents 1-100
   * **Validation**: Documents 101-150
   * **Test**: Documents 151-199

### Instructions

* **Download** the corpus.
* **Encode** the corpus into a pandas.DataFrame object.
* **Split** it in training, validation, and test sets.

#### Preliminaries

#### Out of Vocabulary (OOV) words in training set
We see words in the training set that are not alredy embedded through Glove (50) model, in addition we define the set oov_terms with all those words.

In [45]:
# file management
import sys
import shutil
import urllib
import zipfile
from pathlib import Path

# dataframe management
import pandas as pd

# data manipulation
import numpy as np

# for readability
from typing import Iterable
from tqdm import tqdm

#### Download

In [46]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)

def download_url(download_path: Path, url: str):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=download_path, reporthook=t.update_to)


def download_dataset(download_path: Path, url: str):
    print("Downloading dataset...")
    download_url(url=url, download_path=download_path)
    print("Download complete!")

def extract_dataset(download_path: Path, extract_path: Path):
    print("Extracting dataset... (it may take a while...)")

    with zipfile.ZipFile(download_path) as loaded_tar:
        loaded_tar.extractall(path=extract_path, pwd=None)
    print("Extraction completed!")

In [47]:
url = "https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip"
dataset_name = "dependency_treebank"

print(f"Current work directory: {Path.cwd()}")
dataset_folder = Path.cwd().joinpath("Datasets")

if not dataset_folder.exists():
    dataset_folder.mkdir(parents=True)

dataset_tar_path = dataset_folder.joinpath("dependency_treebank.zip")
dataset_path = dataset_folder.joinpath(dataset_name)

if not dataset_tar_path.exists():
    download_dataset(dataset_tar_path, url)

if not dataset_path.exists():
    extract_dataset(dataset_tar_path, dataset_folder)

Current work directory: /content


#### Encode and Split

The aim of the code below is to find a way to create a dataframe starting from all the files downloaded before.
For every downloaded file, we check the number through the function find_number(), we decide if it belongs to train, validation or test given that number, we then split it into rows to get the word and the POS and to check where a phrase ends. Given all this informations we can create a list whose columns are:
1. num_file: the number of the file
2. phrase_id: the id of the phrase contained in a file
3. text: the text that has to be analyzed
4. pos: the tag assigned to the text
5. split: the split to which the text belongs

In [48]:
import re

def find_number(string):
    """
    This function finds the number written in a string.
    """
    return re.findall(r'\d+', string)


In [49]:
dataframe_rows = []
id = 0

folder = dataset_folder.joinpath(dataset_name)
for file_path in folder.glob('*.dp'):
    num_file = int(find_number(file_path.name)[0])
    id = 1

    with file_path.open(mode='r', encoding='utf-8') as text_file:

        if num_file < 101:
            split = "train"
        elif num_file >= 101 and num_file < 151:
            split = "validation"
        else:
            split = "test"

        for row in text_file.readlines():
            if row=='\n' or row=='':
                id += 1

            else:
                text, pos, _ = row.split('\t')

                dataframe_row = {
                    "num_file": num_file,
                    "phrase_id": str(num_file) + "_" + str(id),
                    "text": text,
                    "pos": pos,
                    "split": split
                }

                dataframe_rows.append(dataframe_row)

In [50]:
df = pd.DataFrame(dataframe_rows)
df.head(20)

Unnamed: 0,num_file,phrase_id,text,pos,split
0,146,146_1,Rep.,NNP,validation
1,146,146_1,John,NNP,validation
2,146,146_1,Dingell,NNP,validation
3,146,146_1,",",",",validation
4,146,146_1,an,DT,validation
5,146,146_1,important,JJ,validation
6,146,146_1,sponsor,NN,validation
7,146,146_1,of,IN,validation
8,146,146_1,President,NNP,validation
9,146,146_1,Bush,NNP,validation


# [Task 2 - 0.5 points] Text encoding

To train a neural POS tagger, you first need to encode text into numerical format.

### Instructions

* Embed words using **GloVe embeddings**.
* You are **free** to pick any embedding dimension.
* [Optional] You are free to experiment with text pre-processing: **make sure you do not delete any token!**

In [51]:
# typing
from typing import List, Callable, Dict

#### Text pre-processing
In the code below we pre-processed the df dataframe in order to reduce the number of different words. Our text pre-processing consist just in lowering the text of words. <br>
**NB: should we add somenthing to the pre processing?**

In [52]:
import re
from functools import reduce
import nltk
from nltk.corpus import stopwords

In [53]:
def lower(text: str) -> str:
    """
    Transforms given text to lower case.
    """
    return text.lower()

In [54]:
PREPROCESSING_PIPELINE = [
                          lower
                          ]

def text_prepare(text: str,
                 filter_methods: List[Callable[[str], str]] = None) -> str:
    """
    Applies a list of pre-processing functions in sequence (reduce).
    Note that the order is important here!
    """
    filter_methods = filter_methods if filter_methods is not None else PREPROCESSING_PIPELINE
    return reduce(lambda txt, f: f(txt), filter_methods, text)

In [55]:
print('Pre-processing text...')

print()
print(f'[Debug] Before:\n{df.text.values[0]}')
print()

# Replace each sentence with its pre-processed version
df['text'] = df['text'].apply(lambda txt: text_prepare(txt))

print(f'[Debug] After:\n{df.text.values[0]}')
print()

print("Pre-processing completed!")

Pre-processing text...

[Debug] Before:
Rep.

[Debug] After:
rep.

Pre-processing completed!


#### Vocabulary creation for training set
We define a vocabulary for the training set assigning to each word a random index, the building_vocabulary function returns a list containing:<br>
- word vocabulary: vocabulary index to word
- inverse word vocabulary: word to vocabulary index
- word listing: set of unique terms that build up the vocabulary


In [56]:
df_train = df[df['split']=='train']
df_val = df[df['split']=='validation']
df_test = df[df['split']=='test']

#### GloVe embeddings (50)
Download GloVe 50 embedding where most of the words are alredy embedded in an embedding model that associate each word to a vector of dimension 50.

In [57]:
# !pip install gensim

In [58]:
import gensim
import gensim.downloader as gloader

def load_embedding_model(model_type: str,
                         embedding_dimension: int = 50) -> gensim.models.keyedvectors.KeyedVectors:
    """
    Loads a pre-trained word embedding model via gensim library.

    :param model_type: name of the word embedding model to load.
    :param embedding_dimension: size of the embedding space to consider

    :return
        - pre-trained word embedding model (gensim KeyedVectors object)
    """
    download_path = ""

    if model_type.strip().lower() == 'glove':
        download_path = "glove-wiki-gigaword-{}".format(embedding_dimension)
    else:
        raise AttributeError("Unsupported embedding model type! Available one: glove")

    try:
        emb_model = gloader.load(download_path)
    except ValueError as e:
        print("Invalid embedding model name! Check the embedding dimension:")
        print("Glove: 50, 100, 200, 300")
        raise e

    return emb_model

In [59]:
embedding_dim = 50

embedding_model = load_embedding_model(model_type="glove",
                                       embedding_dimension=embedding_dim)

In [60]:
vocab = {} # word to idx
embedding_matrix_glove = np.zeros((400001, embedding_dim))

for i in range(0, 400000):
    vocab[embedding_model.index_to_key[i]] = i+1
    embedding_matrix_glove[i+1] = embedding_model.vectors[i]

print(f'Embedding matrix shape: {embedding_matrix_glove.shape}')

Embedding matrix shape: (400001, 50)


In [61]:
def check_OOV_terms(embedding_model: gensim.models.keyedvectors.KeyedVectors,
                    word_listing: List[str]):
    """
    Checks differences between pre-trained embedding model vocabulary
    and dataset specific vocabulary in order to highlight out-of-vocabulary terms.

    :param embedding_model: pre-trained word embedding model (gensim wrapper)
    :param word_listing: dataset specific vocabulary (list)

    :return
        - list of OOV terms
    """
    embedding_vocabulary = set(embedding_model.key_to_index.keys())
    oov = set(word_listing).difference(embedding_vocabulary)
    return list(oov)

In [62]:
word_listing = set(df_train['text'])
oov_terms = check_OOV_terms(embedding_model, word_listing)
oov_percentage = float(len(oov_terms)) * 100 / len(word_listing)
print(f"Total OOV terms: {len(oov_terms)} ({oov_percentage:.2f}%)")

Total OOV terms: 359 (4.85%)


Here we add the OOV in the train set to the vocabulary and the embedded matrix </br>
*NB: maybe is better to define the OOV as the mean of all the other word, link for report (https://stackoverflow.com/questions/49239941/what-is-unk-in-the-pretrained-glove-vector-files-e-g-glove-6b-50d-txt)* <br>
Our embedding matrix has the following columns:
- column 0 is all zeros, represents the embedding vector for padding
- columns 1 to 400001 are the embedding vectors for the words in GloVe
- columns 400002 to 400360 are the embedding vectors for the words OOV in the training set (random vector)
- column 400361 is the embedding vector for the words OOV in the final vocabulary

In [63]:
for word in oov_terms:
    vocab[word] = 400002 + oov_terms.index(word)
    embedding_matrix_glove = np.append(embedding_matrix_glove, np.random.uniform(-0.25, 0.25, 50).reshape(1, 50), axis=0)

vocab['[OOV]'] = len(vocab) + 1
average_oov = np.mean(embedding_matrix_glove, axis=0)
embedding_matrix = np.append(embedding_matrix_glove, average_oov.reshape(1, 50), axis=0)

vocab['[PAD]'] = 0

print(len(vocab))
print(f"New embedding matrix size: {embedding_matrix.shape}")

400361
New embedding matrix size: (400361, 50)


#### Embedding for training set
We create the embedding matrix for all the training set:
- using GloVe embeddings for alredy known words
- assigning to each OOV word a random value.

**NB: maybe instead of random we can define OOV with the mean of its neighbour word embeddings (tutorial 2)** <br>
**NB: we can even add all embedding in GloVe to the embedding matrix, even if they are not in train set**

# [Task 3 - 1.0 points] Model definition

You are now tasked to define your neural POS tagger.

### Instructions

* **Baseline**: implement a Bidirectional LSTM with a Dense layer on top.
* You are **free** to experiment with hyper-parameters to define the baseline model.

* **Model 1**: add an additional LSTM layer to the Baseline model.
* **Model 2**: add an additional Dense layer to the Baseline model.

* **Do not mix Model 1 and Model 2**. Each model has its own instructions.

**Note**: if a document contains many tokens, you are **free** to split them into chunks or sentences to define your mini-batches.

In [64]:
list_of_pos = list(set(df_train['pos']))

def pos_to_int(string):
    if string == '[PAD]':
        return np.zeros((45,), dtype=int).tolist()
    length = len(list_of_pos)
    for i in range(length):
        if list_of_pos[i] == string:
            return [1 if j == i else 0 for j in range(length)]

def int_to_pos(phrase):
    return [list_of_pos[np.argmax(phrase[w])] if max(phrase[w]) != 0 else '[PAD]' for w in range(pad)]

In [91]:
print(pos_to_int('.'))
print(pos_to_int('[PAD]'))

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [66]:
print(len(list_of_pos), list_of_pos)

45 ['.', 'RBR', 'UH', '``', 'PRP$', 'TO', 'WP', ',', 'JJ', 'NNS', 'WRB', 'WDT', 'IN', 'PDT', 'SYM', 'MD', 'LS', '-RRB-', 'VB', 'PRP', '$', 'VBG', 'NNPS', '-LRB-', '#', 'RP', 'WP$', 'VBP', 'RBS', 'EX', 'JJS', 'CC', 'VBZ', 'CD', 'DT', "''", 'POS', 'VBD', ':', 'RB', 'NNP', 'JJR', 'FW', 'NN', 'VBN']


#### Baseline
**NB: reference slides 08 pag 38** <br>
https://analyticsindiamag.com/complete-guide-to-bidirectional-lstm-with-python-codes/

In [84]:
# ! pip install tensorflow
import tensorflow as tf
import tensorflow.keras.layers as layers
import tensorflow.keras as keras

In [68]:
batch_size = 64
GloVe_dim = 50 # GloVe embedding
units_bi = 64

n_unique_words = len(vocab) # input and output layer
outputs_dim = len(list_of_pos)

In [69]:
baseline = tf.keras.Sequential(name='baseline')

baseline.add(layers.Embedding(n_unique_words, GloVe_dim, embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix), mask_zero=True, trainable=False))
baseline.add(layers.Masking(mask_value=0))
baseline.add(layers.Bidirectional(layers.LSTM(units_bi, activation='relu', return_sequences=True)))
baseline.add(layers.TimeDistributed(layers.Dense(outputs_dim, activation='softmax')))

baseline.summary()



Model: "baseline"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, None, 50)          20018050  
                                                                 
 masking_3 (Masking)         (None, None, 50)          0         
                                                                 
 bidirectional_4 (Bidirecti  (None, None, 128)         58880     
 onal)                                                           
                                                                 
 time_distributed_4 (TimeDi  (None, None, 45)          5805      
 stributed)                                                      
                                                                 
Total params: 20082735 (76.61 MB)
Trainable params: 64685 (252.68 KB)
Non-trainable params: 20018050 (76.36 MB)
_________________________________________________________________


#### Model 1

In [70]:
model_1 = tf.keras.Sequential(name='Model_1')

model_1.add(layers.Embedding(n_unique_words, GloVe_dim, embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix), mask_zero=True, trainable=False))
model_1.add(layers.Masking(mask_value=0))
model_1.add(layers.Bidirectional(layers.LSTM(units_bi, activation='relu', return_sequences=True)))
model_1.add(layers.Bidirectional(layers.LSTM(units_bi, activation='relu', return_sequences=True)))
model_1.add(layers.TimeDistributed(layers.Dense(outputs_dim, activation='softmax')))


model_1.summary()



Model: "Model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, None, 50)          20018050  
                                                                 
 masking_4 (Masking)         (None, None, 50)          0         
                                                                 
 bidirectional_5 (Bidirecti  (None, None, 128)         58880     
 onal)                                                           
                                                                 
 bidirectional_6 (Bidirecti  (None, None, 128)         98816     
 onal)                                                           
                                                                 
 time_distributed_5 (TimeDi  (None, None, 45)          5805      
 stributed)                                                      
                                                           

#### Model 2

In [71]:
units_dense = 64

In [72]:
model_2 = tf.keras.Sequential(name='Model_2')

model_2.add(layers.Embedding(n_unique_words, GloVe_dim, embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix), mask_zero=True, trainable=False))
model_2.add(layers.Masking(mask_value=0))
model_2.add(layers.Bidirectional(layers.LSTM(units_bi, activation='relu', return_sequences=True)))
model_2.add(layers.TimeDistributed(layers.Dense(units_dense, activation='softmax')))
model_2.add(layers.TimeDistributed(layers.Dense(outputs_dim, activation='softmax')))


model_2.summary()



Model: "Model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, None, 50)          20018050  
                                                                 
 masking_5 (Masking)         (None, None, 50)          0         
                                                                 
 bidirectional_7 (Bidirecti  (None, None, 128)         58880     
 onal)                                                           
                                                                 
 time_distributed_6 (TimeDi  (None, None, 64)          8256      
 stributed)                                                      
                                                                 
 time_distributed_7 (TimeDi  (None, None, 45)          2925      
 stributed)                                                      
                                                           

#### try models

https://medium.com/analytics-vidhya/author-multi-class-text-classification-using-bidirectional-lstm-keras-c9a533a1cc4a
an example of a bi LSTM implementation with padding <br>

In [73]:
def assign_idx(word):
    try:
        idx = vocab[word]
    except(KeyError):
        idx = vocab['[OOV]']
    return idx

In [74]:
# df_train_reduction = df_train.loc[df_train['num_file'] <= 50]
# df_val_reduction = df_val.loc[df_val['num_file'] <= 115]
# df_test_reduction = df_test.loc[df_test['num_file'] <= 165]

In [75]:
x_train = [ [assign_idx(word) for word in df_train[df_train['phrase_id']==nr_phrase]['text']] for nr_phrase in set(df_train['phrase_id']) ]
x_val   = [ [assign_idx(word) for word in df_val[df_val['phrase_id']==nr_phrase]['text']] for nr_phrase in set(df_val['phrase_id']) ]
x_test   = [ [assign_idx(word) for word in df_test[df_test['phrase_id']==nr_phrase]['text']] for nr_phrase in set(df_test['phrase_id']) ]

y_train = [ [pos_to_int(pos) for pos in df_train[df_train['phrase_id']==nr_phrase]['pos']] for nr_phrase in set(df_train['phrase_id']) ]
y_val   = [ [pos_to_int(pos) for pos in df_val[df_val['phrase_id']==nr_phrase]['pos']] for nr_phrase in set(df_val['phrase_id']) ]
y_test   = [ [pos_to_int(pos) for pos in df_test[df_test['phrase_id']==nr_phrase]['pos']] for nr_phrase in set(df_test['phrase_id']) ]

In [76]:
#In order to put the data data in data in the model we need to pad the array representing the words / pos

pad = max(max([len(phrase) for phrase in x_train]), max([len(phrase) for phrase in x_val]), max([len(phrase) for phrase in x_test]))
print("The length of all phrases after padding will be ", pad)

x_train_pad = [phrase + np.zeros((pad-len(phrase),), dtype=int).tolist() for phrase in x_train]
x_val_pad = [phrase + np.zeros((pad-len(phrase),), dtype=int).tolist() for phrase in x_val]
x_test_pad = [phrase + np.zeros((pad-len(phrase),), dtype=int).tolist() for phrase in x_test]

y_train_pad = [phrase + np.zeros((pad-len(phrase), 45)).tolist() for phrase in y_train]
y_val_pad = [phrase + np.zeros((pad-len(phrase), 45)).tolist() for phrase in y_val]
y_test_pad = [phrase + np.zeros((pad-len(phrase), 45)).tolist() for phrase in y_test]

The length of all phrases after padding will be  249


In [77]:
print(len(x_train), len(x_val), len(x_test))

1963 1299 652


In [78]:
print(len(x_train[0]))
print(len(x_train_pad[0]))
print(x_train_pad[0])

24
249
[1, 21640, 6, 16350, 2309, 33, 288, 5, 466, 3126, 200, 183, 63, 2, 6, 1, 79607, 6, 400009, 2309, 169, 183, 63, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [79]:
reverse_word_index = dict([(value, key) for (key, value) in vocab.items()])
def decode_phrase(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])
print(decode_phrase(x_train_pad[0]))
print(int_to_pos(y_train_pad[0]))

the perch and dolphin fields are expected to start producing early next year , and the seahorse and tarwhine fields later next year . [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [P

# [Task 4 - 1.0 points] Metrics

Before training the models, you are tasked to define the evaluation metrics for comparison.

### Instructions

* Evaluate your models using macro F1-score, compute over **all** tokens.
* **Concatenate** all tokens in a data split to compute the F1-score. (**Hint**: accumulate FP, TP, FN, TN iteratively)
* **Do not consider punctuation and symbol classes** $\rightarrow$ [What is punctuation?](https://en.wikipedia.org/wiki/English_punctuation)

**Note**: What about OOV tokens?
   * All the tokens in the **training** set that are not in GloVe are **not** considered as OOV
   * For the remaining tokens (i.e., OOV in the validation and test sets), you have to assign them a **static** embedding.
   * You are **free** to define the static embedding using any strategy (e.g., random, neighbourhood, etc...)

In [80]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import numpy as np
from keras.callbacks import Callback

ignore_classes = ['[PAD]', '#', '``', '-RRB-', "''", '$', 'SYM', ':', '.', ',', '-LRB-']

class MetricsCallback(Callback):
    def __init__(self, x, y_true, ignore_classes=None):
        super().__init__()
        self.x = x
        self.y_true = y_true
        self.ignore_classes = ignore_classes if ignore_classes else []
        self.metrics_dict = {}

    def on_epoch_end(self, epoch, logs=None):
        y_pred = self.model.predict(self.x)
        y_true_pos = [int_to_pos(true) for true in self.y_true]
        y_pred_pos = [int_to_pos(pred) for pred in y_pred]

        exclude_indices = [i for i, tag in enumerate(y_true_pos) if tag in self.ignore_classes]
        y_true_pos_clean = np.delete(y_true_pos, exclude_indices)
        y_pred_pos_clean = np.delete(y_pred_pos, exclude_indices)

        precision, recall, f1, _ = precision_recall_fscore_support(
            y_true_pos_clean, y_pred_pos_clean, average='macro', zero_division=1
        )

        accuracy = accuracy_score(y_true_pos_clean, y_pred_pos_clean)

        self.metrics_dict['macro_f1'] = f1
        self.metrics_dict['precision'] = precision
        self.metrics_dict['recall'] = recall
        self.metrics_dict['accuracy'] = accuracy

        print(f"Macro F1-Score: {f1:.4f} - Precision: {precision:.4f} - Recall: {recall:.4f} - Accuracy: {accuracy:.4f}")

# [Task 5 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate the Baseline, Model 1, and Model 2.

### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Compute metrics on the validation set.
* Pick **at least** three seeds for robust estimation.
* Pick the **best** performing model according to the observed validation set performance.

**NB: todo fine tuning on embedding layer https://stackoverflow.com/questions/40345607/how-does-fine-tuning-word-embeddings-work**

In [86]:
train = True

early_stopping = keras.callbacks.EarlyStopping(monitor="val_loss", patience=10)
reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor="val_loss", patience=3, factor=0.1, min_lr=0.001)

In [87]:
model = model_1
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer=tf.keras.optimizers.experimental.Adadelta(learning_rate=1))

if not train:
    model.load_weights("model.h5")
    print("Loaded model from disk")
else:
    history = model.fit(x_train_pad, y_train_pad, batch_size=batch_size, epochs=10,
                        validation_data=(x_val_pad, y_val_pad), verbose=1,
                        callbacks=[MetricsCallback(x_val_pad, y_val_pad, ignore_classes), early_stopping, reduce_lr])

Epoch 1/10
Macro F1-Score: 0.0195 - Precision: 0.8894 - Recall: 0.0443 - Accuracy: 0.0187
Epoch 2/10
Macro F1-Score: 0.0210 - Precision: 0.8506 - Recall: 0.0552 - Accuracy: 0.0208
Epoch 3/10
Macro F1-Score: 0.0288 - Precision: 0.7790 - Recall: 0.0584 - Accuracy: 0.0207
Epoch 4/10
Macro F1-Score: 0.1051 - Precision: 0.6476 - Recall: 0.1375 - Accuracy: 0.0391
Epoch 5/10
Macro F1-Score: 0.1190 - Precision: 0.7297 - Recall: 0.1377 - Accuracy: 0.0388
Epoch 6/10
Macro F1-Score: 0.1659 - Precision: 0.6596 - Recall: 0.2140 - Accuracy: 0.0459
Epoch 7/10
Macro F1-Score: 0.2779 - Precision: 0.7663 - Recall: 0.3092 - Accuracy: 0.0584
Epoch 8/10
Macro F1-Score: 0.2825 - Precision: 0.7691 - Recall: 0.2912 - Accuracy: 0.0576
Epoch 9/10
Macro F1-Score: 0.3286 - Precision: 0.7907 - Recall: 0.3416 - Accuracy: 0.0618
Epoch 10/10
Macro F1-Score: 0.3712 - Precision: 0.8069 - Recall: 0.3803 - Accuracy: 0.0651


In [88]:
model.save_weights("model.h5")
print("Saved model to disk")

Saved model to disk


In [None]:
test_scores = model.evaluate(x_test_pad, y_test_pad)

In [90]:
pred = model.predict(x_test_pad)
print(decode_phrase(x_test_pad[4]))
print(int_to_pos(y_test_pad[4]))
print(int_to_pos(pred[4]))

but some analysts questioned how much of an impact the retirement package will have , because few jobs will end up being eliminated . [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [P

In [None]:
def merge_history(history1, history2):
    final_history = {}

    for key in history1.keys():
        final_history[key] = history1[key] + history2[key]

    return final_history

In [None]:
# final_history = merge_history(history.history, history_2.history)

In [None]:
import matplotlib.pyplot as plt

def graph_plots(history, string):
  plt.plot(history[string])
  plt.plot(history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

graph_plots(history.history, "accuracy")
graph_plots(history.history, "loss")

# [Task 6 - 1.0 points] Error Analysis

You are tasked to evaluate your best performing model.

### Instructions

* Compare the errors made on the validation and test sets.
* Aggregate model errors into categories (if possible)
* Comment the about errors and propose possible solutions on how to address them.

# [Task 7 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Trainable Embeddings

You are **free** to define a trainable or non-trainable Embedding layer to load the GloVe embeddings.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.

### Neural Libraries

You are **free** to use any library of your choice to implement the networks (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Keras TimeDistributed Dense layer

If you are using Keras, we recommend wrapping the final Dense layer with `TimeDistributed`.

### Error Analysis

Some topics for discussion include:
   * Model performance on most/less frequent classes.
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

### Punctuation

**Do not** remove punctuation from documents since it may be helpful to the model.

You should **ignore** it during metrics computation.

If you are curious, you can run additional experiments to verify the impact of removing punctuation.

# The End