# Assignment 1

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: POS tagging, Sequence labelling, RNNs


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Introduction

You are tasked to address the task of POS tagging.

<center>
    <img src="images/pos_tagging.png" alt="POS tagging" />
</center>

# [Task 1 - 0.5 points] Corpus

You are going to work with the [Penn TreeBank corpus](https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip).

**Ignore** the numeric value in the third column, use **only** the words/symbols and their POS label.

### Example

```Pierre	NNP	2
Vinken	NNP	8
,	,	2
61	CD	5
years	NNS	6
old	JJ	2
,	,	2
will	MD	0
join	VB	8
the	DT	11
board	NN	9
as	IN	9
a	DT	15
nonexecutive	JJ	15
director	NN	12
Nov.	NNP	9
29	CD	16
.	.	8
```

### Splits

The corpus contains 200 documents.

   * **Train**: Documents 1-100
   * **Validation**: Documents 101-150
   * **Test**: Documents 151-199

### Instructions

* **Download** the corpus.
* **Encode** the corpus into a pandas.DataFrame object.
* **Split** it in training, validation, and test sets.

#### Preliminaries

In [11]:
# file management
import sys
import shutil
import urllib
import zipfile
from pathlib import Path

# dataframe management
import pandas as pd

# data manipulation
import numpy as np

# for readability
from typing import Iterable
from tqdm import tqdm

#### Download

In [12]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(download_path: Path, url: str):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=download_path, reporthook=t.update_to)

        
def download_dataset(download_path: Path, url: str):
    print("Downloading dataset...")
    download_url(url=url, download_path=download_path)
    print("Download complete!")

def extract_dataset(download_path: Path, extract_path: Path):
    print("Extracting dataset... (it may take a while...)")
    
    with zipfile.ZipFile(download_path) as loaded_tar:
        loaded_tar.extractall(path=extract_path, pwd=None)
    print("Extraction completed!")

In [13]:
url = "https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip"
dataset_name = "dependency_treebank"

print(f"Current work directory: {Path.cwd()}")
dataset_folder = Path.cwd().joinpath("Datasets")

if not dataset_folder.exists():
    dataset_folder.mkdir(parents=True)

dataset_tar_path = dataset_folder.joinpath("dependency_treebank.zip")
dataset_path = dataset_folder.joinpath(dataset_name)

if not dataset_tar_path.exists():
    download_dataset(dataset_tar_path, url)

if not dataset_path.exists():
    extract_dataset(dataset_tar_path, dataset_folder)

Current work directory: c:\Users\Utente\Desktop\UNIVERSITA'\AI\2 Anno\Natural Language Processing\_ Esame\Assignment 1\NLP_POS-tagging


#### Encode and Split

The aim of the code below is to find a way to create a dataframe starting from all the files downloaded before.
For every downloaded file, we check the number through the function find_number(), we decide if it belongs to train, validation or test given that number, we then split it into rows to get the word and the POS and to check where a phrase ends. Given all this informations we can create a list whose columns are: 
1. num_file: the number of the file
2. phrase_id: the id of the phrase contained in a file 
3. text: the text that has to be analyzed
4. pos: the tag assigned to the text
5. split: the split to which the text belongs

In [14]:
import re

def find_number(string):
    """
    This function finds the number written in a string.
    """
    return re.findall(r'\d+', string)


In [15]:
dataframe_rows = []
id = 0

folder = dataset_folder.joinpath(dataset_name)
for file_path in folder.glob('*.dp'):
    num_file = int(find_number(file_path.name)[0])
    id = 1
    
    with file_path.open(mode='r', encoding='utf-8') as text_file:
        
        if num_file < 101:
            split = "train"
        elif num_file >= 101 and num_file < 151:
            split = "validation"
        else:
            split = "test"

        for row in text_file.readlines():
            if row=='\n' or row=='':
                id += 1

            else:
                text, pos, _ = row.split('\t')

                dataframe_row = {
                    "num_file": num_file,
                    "phrase_id": str(num_file) + "_" + str(id),
                    "text": text,
                    "pos": pos,
                    "split": split
                }

                dataframe_rows.append(dataframe_row)

In [16]:
df = pd.DataFrame(dataframe_rows)
df.head()

Unnamed: 0,num_file,phrase_id,text,pos,split
0,1,1_1,Pierre,NNP,train
1,1,1_1,Vinken,NNP,train
2,1,1_1,",",",",train
3,1,1_1,61,CD,train
4,1,1_1,years,NNS,train


# [Task 2 - 0.5 points] Text encoding

To train a neural POS tagger, you first need to encode text into numerical format.

### Instructions

* Embed words using **GloVe embeddings**.
* You are **free** to pick any embedding dimension.
* [Optional] You are free to experiment with text pre-processing: **make sure you do not delete any token!**

In [17]:
# typing
from typing import List, Callable, Dict

#### Text pre-processing
In the code below we pre-processed the df dataframe in order to reduce the number of different words. Our text pre-processing consist just in lowering the text of words. <br>
**NB: should we add somenthing to the pre processing?**

In [18]:
import re
from functools import reduce
import nltk
from nltk.corpus import stopwords

# REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
# GOOD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
# try:
#     STOPWORDS = set(stopwords.words('english'))
# except LookupError:
#     nltk.download('stopwords')
#     STOPWORDS = set(stopwords.words('english'))

In [19]:
def lower(text: str) -> str:
    """
    Transforms given text to lower case.
    """
    return text.lower()

# def replace_special_characters(text: str) -> str:
#     """
#     Replaces special characters, such as paranthesis, with spacing character
#     """
#     return REPLACE_BY_SPACE_RE.sub(' ', text)

# def replace_br(text: str) -> str:
#     """
#     Replaces br characters
#     """
#     return text.replace('br', '')

# def filter_out_uncommon_symbols(text: str) -> str:
#     """
#     Removes any special character that is not in the good symbols list (check regular expression)
#     """
#     return GOOD_SYMBOLS_RE.sub('', text)

# def remove_stopwords(text: str) -> str:
#     return ' '.join([x for x in text.split() if x and x not in STOPWORDS])

# def strip_text(text: str) -> str:
#     """
#     Removes any left or right spacing (including carriage return) from text.
#     """
#     return text.strip()

In [20]:
PREPROCESSING_PIPELINE = [
                          lower,
                          # replace_special_characters,
                        #   replace_br,
                          # filter_out_uncommon_symbols,
                        #   strip_text
                        #   remove_stopwords
                          ]

def text_prepare(text: str,
                 filter_methods: List[Callable[[str], str]] = None) -> str:
    """
    Applies a list of pre-processing functions in sequence (reduce).
    Note that the order is important here!
    """
    filter_methods = filter_methods if filter_methods is not None else PREPROCESSING_PIPELINE
    return reduce(lambda txt, f: f(txt), filter_methods, text)

In [21]:
print('Pre-processing text...')

print()
print(f'[Debug] Before:\n{df.text.values[50]}')
print()

# Replace each sentence with its pre-processed version
df['text'] = df['text'].apply(lambda txt: text_prepare(txt))

print(f'[Debug] After:\n{df.text.values[50]}')
print()

print("Pre-processing completed!")

Pre-processing text...

[Debug] Before:
director

[Debug] After:
director

Pre-processing completed!


#### Vocabulary creation for training set
We define a vocabulary for the training set assigning to each word a random index, the building_vocabulary function returns a list containing:<br>
- word vocabulary: vocabulary index to word
- inverse word vocabulary: word to vocabulary index
- word listing: set of unique terms that build up the vocabulary


In [22]:
df_train = df[df['split']=='train']
df_val = df[df['split']=='validation']
df_test = df[df['split']=='test']

In [23]:
from collections import OrderedDict

def build_vocabulary(df: pd.DataFrame) -> (Dict[int, str],
                                           Dict[str, int],
                                           List[str]):
    """
    Given a dataset, builds the corresponding word vocabulary.

    :param df: dataset from which we want to build the word vocabulary (pandas.DataFrame)
    :return:
      - word vocabulary: vocabulary index to word
      - inverse word vocabulary: word to vocabulary index
      - word listing: set of unique terms that build up the vocabulary
    """
    idx_to_word = OrderedDict()
    word_to_idx = OrderedDict()
    
    curr_idx = 0
    for sentence in tqdm(df.text.values):
        tokens = sentence.split()
        for token in tokens:
            if token not in word_to_idx:
                word_to_idx[token] = curr_idx
                idx_to_word[curr_idx] = token
                curr_idx += 1

    word_listing = list(idx_to_word.values())
    return idx_to_word, word_to_idx, word_listing

In [24]:
# This type of slicing is not mandatory, but it is sufficient to our purposes
np.random.seed(42)

random_indexes = np.random.choice(np.arange(df_train.shape[0]),
                                  size=len(df_train),
                                  replace=False)

df_train = df_train.iloc[random_indexes]
print(f'New dataset size: {df_train.shape}')
idx_to_word, word_to_idx, word_listing = build_vocabulary(df_train)

New dataset size: (47356, 5)


100%|██████████| 47356/47356 [00:00<00:00, 731316.13it/s]


In [25]:
df_train.head(10)

Unnamed: 0,num_file,phrase_id,text,pos,split
7155,29,29_8,interbank,NN,train
7497,32,32_4,offer,NN,train
15806,43,43_38,.,.,train
36501,85,85_41,is,VBZ,train
38803,89,89_26,hole,NN,train
18332,44,44_110,to,TO,train
25430,59,59_11,a,DT,train
47033,100,100_24,hahn,NNP,train
9672,36,36_47,",",",",train
36210,85,85_29,squeezed,VBN,train


#### GloVe embeddings (50)
Download GloVe 50 embedding where most of the words are alredy embedded in an embedding model that associate each word to a vector of dimension 50.

In [26]:
# !pip install gensim

In [27]:
import gensim
import gensim.downloader as gloader

def load_embedding_model(model_type: str,
                         embedding_dimension: int = 50) -> gensim.models.keyedvectors.KeyedVectors:
    """
    Loads a pre-trained word embedding model via gensim library.

    :param model_type: name of the word embedding model to load.
    :param embedding_dimension: size of the embedding space to consider

    :return
        - pre-trained word embedding model (gensim KeyedVectors object)
    """
    download_path = ""

    if model_type.strip().lower() == 'glove':
        download_path = "glove-wiki-gigaword-{}".format(embedding_dimension)
    else:
        raise AttributeError("Unsupported embedding model type! Available one: glove")
        
    try:
        emb_model = gloader.load(download_path)
    except ValueError as e:
        print("Invalid embedding model name! Check the embedding dimension:")
        print("Glove: 50, 100, 200, 300")
        raise e

    return emb_model

In [28]:
embedding_model = load_embedding_model(model_type="glove",
                                       embedding_dimension=50)

#### Out of Vocabulary (OOV) words in training set
We see words in the training set that are not alredy embedded through Glove (50) model, in addition we define the set oov_terms with all those words.

In [29]:
def check_OOV_terms(embedding_model: gensim.models.keyedvectors.KeyedVectors,
                    word_listing: List[str]):
    """
    Checks differences between pre-trained embedding model vocabulary
    and dataset specific vocabulary in order to highlight out-of-vocabulary terms.

    :param embedding_model: pre-trained word embedding model (gensim wrapper)
    :param word_listing: dataset specific vocabulary (list)

    :return
        - list of OOV terms
    """
    embedding_vocabulary = set(embedding_model.key_to_index.keys())
    oov = set(word_listing).difference(embedding_vocabulary)
    return list(oov)

In [30]:
oov_terms = check_OOV_terms(embedding_model, word_listing)
oov_percentage = float(len(oov_terms)) * 100 / len(word_listing)
print(f"Total OOV terms: {len(oov_terms)} ({oov_percentage:.2f}%)")

Total OOV terms: 359 (4.85%)


#### Embedding for training set
We create the embedding matrix for all the training set:
- using GloVe embeddings for alredy known words
- assigning to each OOV word a random value.

**NB: maybe instead of random we can define OOV with the mean of its neighbour word embeddings (tutorial 2)**

In [31]:
def build_embedding_matrix(embedding_model: gensim.models.keyedvectors.KeyedVectors,
                           embedding_dimension: int,
                           word_to_idx: Dict[str, int],
                           vocab_size: int,
                           oov_terms: List[str]) -> np.ndarray:
    """
    Builds the embedding matrix of a specific dataset given a pre-trained word embedding model

    :param embedding_model: pre-trained word embedding model (gensim wrapper)
    :param word_to_idx: vocabulary map (word -> index) (dict)
    :param vocab_size: size of the vocabulary
    :param oov_terms: list of OOV terms (list)

    :return
        - embedding matrix that assigns a high dimensional vector to each word in the dataset specific vocabulary (shape |V| x d)
    """
    embedding_matrix = np.zeros((vocab_size, embedding_dimension), dtype=np.float32)
    for word, idx in tqdm(word_to_idx.items()):
        try:
            embedding_vector = embedding_model[word]
        except (KeyError, TypeError):
            embedding_vector = np.random.uniform(low=-0.05, high=0.05, size=embedding_dimension)

        embedding_matrix[idx] = embedding_vector

    return embedding_matrix

In [32]:
# Testing
embedding_dimension = 50
embedding_matrix = build_embedding_matrix(embedding_model, embedding_dimension, word_to_idx, len(word_to_idx), oov_terms)
print(f"Embedding matrix shape: {embedding_matrix.shape}")

100%|██████████| 7404/7404 [00:00<00:00, 183099.71it/s]

Embedding matrix shape: (7404, 50)





# [Task 3 - 1.0 points] Model definition

You are now tasked to define your neural POS tagger.

### Instructions

* **Baseline**: implement a Bidirectional LSTM with a Dense layer on top.
* You are **free** to experiment with hyper-parameters to define the baseline model.

* **Model 1**: add an additional LSTM layer to the Baseline model.
* **Model 2**: add an additional Dense layer to the Baseline model.

* **Do not mix Model 1 and Model 2**. Each model has its own instructions.

**Note**: if a document contains many tokens, you are **free** to split them into chunks or sentences to define your mini-batches.

In [162]:
def pos_to_int(string):
    length = len(set(df['pos']))
    for i in range(length):
        if list(set(df['pos']))[i]==string:
            return [1 if j == i else 0 for j in range(length)]
        
# list(set(df['pos']))




In [163]:
print(pos_to_int('JJ'))

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


#### Baseline  HOW CAN WE GENERATE THE OUTPUT??
**NB: reference slides 08 pag 38** <br>
https://analyticsindiamag.com/complete-guide-to-bidirectional-lstm-with-python-codes/

In [46]:
# ! pip install tensorflow
import tensorflow as tf
import tensorflow.keras.layers as layers


In [147]:
batch_size = 128
GloVe_dim = 50 # GloVe embedding
units_bi = 100
n_unique_words = len(word_to_idx) # input and output layer
outputs_dim = len(set(df['pos']))


In [148]:
baseline = tf.keras.Sequential(name='baseline')

baseline.add(layers.Embedding(n_unique_words, GloVe_dim, embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix))) #trainable=False
baseline.add(layers.Bidirectional(layers.LSTM(units_bi, activation='relu', return_sequences=True)))
baseline.add(layers.Dense(outputs_dim, activation='sigmoid'))

baseline.summary()

Model: "baseline"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_43 (Embedding)    (None, None, 50)          370200    
                                                                 
 bidirectional_76 (Bidirect  (None, None, 200)         120800    
 ional)                                                          
                                                                 
 dense_76 (Dense)            (None, None, 45)          9045      
                                                                 
Total params: 500045 (1.91 MB)
Trainable params: 500045 (1.91 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


#### Model 1

In [149]:
model_1 = tf.keras.Sequential(name='Model 1')

model_1.add(layers.Embedding(n_unique_words, GloVe_dim, embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix)))
model_1.add(layers.Bidirectional(layers.LSTM(units_bi, activation='relu', return_sequences=True)))
model_1.add(layers.Bidirectional(layers.LSTM(units_bi, activation='relu', return_sequences=True)))
model_1.add(layers.Dense(outputs_dim, activation='sigmoid'))


model_1.summary()

Model: "Model 1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_44 (Embedding)    (None, None, 50)          370200    
                                                                 
 bidirectional_77 (Bidirect  (None, None, 200)         120800    
 ional)                                                          
                                                                 
 bidirectional_78 (Bidirect  (None, None, 200)         240800    
 ional)                                                          
                                                                 
 dense_77 (Dense)            (None, None, 45)          9045      
                                                                 
Total params: 740845 (2.83 MB)
Trainable params: 740845 (2.83 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


#### Model 2

In [115]:
units_dense = 100

In [150]:
model_2 = tf.keras.Sequential(name='Model 2')

model_2.add(layers.Embedding(n_unique_words, GloVe_dim, embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix)))
model_2.add(layers.Bidirectional(layers.LSTM(units_bi, activation='relu', return_sequences=True)))
model_2.add(layers.Dense(units_dense, activation='sigmoid'))
model_2.add(layers.Dense(outputs_dim, activation='sigmoid'))


model_2.summary()

Model: "Model 2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_45 (Embedding)    (None, None, 50)          370200    
                                                                 
 bidirectional_79 (Bidirect  (None, None, 200)         120800    
 ional)                                                          
                                                                 
 dense_78 (Dense)            (None, None, 100)         20100     
                                                                 
 dense_79 (Dense)            (None, None, 45)          4545      
                                                                 
Total params: 515645 (1.97 MB)
Trainable params: 515645 (1.97 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


#### try models

In [136]:
import numpy as np

In [135]:
df_train['text'][1]

'vinken'

In [170]:
df_train_red = df_train[:10]
df_val_red = df_val[:10]

In [172]:
baseline.compile(loss='categorical_crossentropy', optimizer='Adadelta')


# x_train = np.vectorize(np.array([[s] for s in batch_size])).numpy()
# x_val = np.vectorize(np.array([[s] for s in batch_size])).numpy()

# y_train = np.array(train_labels)
# y_val = np.array(val_labels)

x_train = df_train_red['text']
x_val = df_val_red['text']

y_train = [pos_to_int(el) for el in df_train_red['pos']]
y_val = [pos_to_int(el) for el in df_val_red['pos']]

baseline.fit(x_train, y_train, batch_size=batch_size, epochs=20, validation_data=(x_val, y_val))

ValueError: Failed to find data adapter that can handle input: <class 'numpy.ndarray'>, (<class 'list'> containing values of types {'(<class \'list\'> containing values of types {"<class \'int\'>"})'})

# [Task 4 - 1.0 points] Metrics

Before training the models, you are tasked to define the evaluation metrics for comparison.

### Instructions

* Evaluate your models using macro F1-score, compute over **all** tokens.
* **Concatenate** all tokens in a data split to compute the F1-score. (**Hint**: accumulate FP, TP, FN, TN iteratively) 
* **Do not consider punctuation and symbol classes** $\rightarrow$ [What is punctuation?](https://en.wikipedia.org/wiki/English_punctuation)

**Note**: What about OOV tokens?
   * All the tokens in the **training** set that are not in GloVe are **not** considered as OOV
   * For the remaining tokens (i.e., OOV in the validation and test sets), you have to assign them a **static** embedding.
   * You are **free** to define the static embedding using any strategy (e.g., random, neighbourhood, etc...)

# [Task 5 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate the Baseline, Model 1, and Model 2.

### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Compute metrics on the validation set.
* Pick **at least** three seeds for robust estimation.
* Pick the **best** performing model according to the observed validation set performance.

# [Task 6 - 1.0 points] Error Analysis

You are tasked to evaluate your best performing model.

### Instructions

* Compare the errors made on the validation and test sets.
* Aggregate model errors into categories (if possible) 
* Comment the about errors and propose possible solutions on how to address them.

# [Task 7 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Trainable Embeddings

You are **free** to define a trainable or non-trainable Embedding layer to load the GloVe embeddings.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.

### Neural Libraries

You are **free** to use any library of your choice to implement the networks (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Keras TimeDistributed Dense layer

If you are using Keras, we recommend wrapping the final Dense layer with `TimeDistributed`.

### Error Analysis

Some topics for discussion include:
   * Model performance on most/less frequent classes.
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

### Punctuation

**Do not** remove punctuation from documents since it may be helpful to the model.

You should **ignore** it during metrics computation.

If you are curious, you can run additional experiments to verify the impact of removing punctuation.

# The End