<a href="https://colab.research.google.com/github/AirNicco8/NLP_Assignments/blob/main/Assignment_1_GRU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1

**Due to**: 23/12/2021 (dd/mm/yyyy)

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Summary**: Part-of Speech (POS) tagging as Sequence Labelling using Recurrent Neural Architectures

# Intro

In this assignment  we will ask you to perform POS tagging using neural architectures

You are asked to follow these steps:
*   Download the corpora and split it in training and test sets, structuring a dataframe.
*   Embed the words using GloVe embeddings
*   Create a baseline model, using a simple neural architecture
*   Experiment doing small modifications to the baseline model, choose hyperparameters using the validation set
*   Evaluate your two best model
*   Analyze the errors of your model


**Task**: given a corpus of documents, predict the POS tag for each word

**Corpus**:
Ignore the numeric value in the third column, use only the words/symbols and its label. 
The corpus is available at:
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip

**Splits**: documents 1-100 are the train set, 101-150 validation set, 151-199 test set.


**Features**: you MUST use GloVe embeddings as the only input features to the model.

**Splitting**: you can decide to split documents into sentences or not, the choice is yours.

**I/O structure**: The input data will have three dimensions: 1-documents/sentences, 2-token, 3-features; for the output there are 2 possibilities: if you use one-hot encoding it will be 1-documents/sentences, 2-token labels, 3-classes, if you use a single integer that indicates the number of the class it will be 1-documents/sentences, 2-token labels.

**Baseline**: two layers architecture: a Bidirectional LSTM layer and a Dense/Fully-Connected layer on top; the choice of hyper-parameters is yours.

**Architectures**: experiment using a GRU instead of the LSTM, adding an additional LSTM layer, and adding an additional dense layer; do not mix these variantions.


**Training and Experiments**: all the experiments must involve only the training and validation sets.

**Evaluation**: in the end, only the two best models of your choice (according to the validation set) must be evaluated on the test set. The main metric must be F1-Macro computed between the various part of speech. DO NOT CONSIDER THE PUNCTUATION CLASSES.

**Metrics**: the metric you must use to evaluate your final model is the F1-macro, WITHOUT considering punctuation/symbols classes; during the training process you can use accuracy because you can't use the F1 metric unless you use a single (gigantic) batch because there is no way to aggregate "partial" F1 scores computed on mini-batches.

**Discussion and Error Analysis** : verify and discuss if the results on the test sets are coherent with those on the validation set; analyze the errors done by your model, try to understand which may be the causes and think about how to improve it.

**Report**: you are asked to deliver the code of your experiments and a small pdf report of about 2 pages; the pdf must begin with the names of the people of your team and a small abstract (4-5 lines) that sums up your findings.

# Out Of Vocabulary (OOV) terms

How to handle words that are not in GloVe vocabulary?
You can handle them as you want (random embedding, placeholder, whatever!), but they must be STATIC embeddings (you cannot train them).

But there is a very important caveat! As usual, the element of the test set must not influence the elements of the other splits!

So, when you compute new embeddings for train+validation, you must forget about test documents.
The motivation is to emulate a real-world scenario, where you select and train a model in the first stage, without knowing nothing about the testing environment.

For implementation convenience, you CAN use a single vocabulary file/matrix/whatever. The principle of the previous point is that the embeddings inside that file/matrix must be generated independently for train and test splits.

Basically in a real-world scenario, this is what would happen:
1. Starting vocabulary V1 (in this assignment, GloVe vocabulary)
2. Compute embeddings for terms out of vocabulary V1 (OOV1) of the training split 
3. Add embeddings to the vocabulary, so to obtain vocabulary V2=V1+OOV1
4. Training of the model(s)
5. Compute embeddings for terms OOV2 of the validation split 
6. Add embeddings to the vocabulary, so to obtain vocabulary V3=V1+OOV1+OOV2
7. Validation of the model(s)
8. Compute embeddings for terms OOV3 of the test split 
9. Add embeddings to the vocabulary, so to obtain vocabulary V4=V1+OOV1+OOV2+OOV3
10. Testing of the final model

In this case, where we already have all the documents, we can simplify the process a bit, but the procedure must remain rigorous.

1. Starting vocabulary V1 (in this assignment, GloVe vocabulary)
2. Compute embeddings for terms out of vocabulary V1 (OOV1) of the training split 
3. Add embeddings to the vocabulary, so to obtain vocabulary V2=V1+OOV1
4. Compute embeddings for terms OOV2 of the validation split 
5. Add embeddings to the vocabulary, so to obtain vocabulary V3=V1+OOV1+OOV2
6. Compute embeddings for terms OOV3 of the test split 
7. Add embeddings to the vocabulary, so to obtain vocabulary V4=V1+OOV1+OOV2
8. Training of the model(s)
9. Validation of the model(s)
10. Testing of the final model

Step 2 and step 6 must be completely independent of each other, for what concerns the method and the documents. But they can rely on the previous vocabulary (V1 for step 2 and V3 for step 6)
THEREFORE if a word is present both in the training set and the test split and not in the starting vocabulary, its embedding is computed in step 2) and it is not considered OOV anymore in step 6).

# Report
The report must not be just a copy and paste of graphs and tables!

The report must not be longer than 2 pages and must contain:
* The names of the member of your team
* A short abstract (4-5 lines) that sum ups everything
* A general description of the task you have addressed and how you have addressed it
* A short description of the models you have used
* Some tables that sum up your findings in validation and test and a discussion of those results
* The most relevant findings of your error analysis

# Evaluation Criterion

The goal of this assignment is not to prove you can find best model ever, but to face a common task, structure it correctly, and follow a correct and rigorous experimental procedure.
In other words, we don't care if you final models are awful as long as you have followed the correct procedure and wrote a decent report.

The score of the assignment will be computed roughly as follows
* 1 point for the general setting of the problem
* 1 point for the handling of OOV terms
* 1 point for the models
* 1 point for train-validation-test procedure
* 2 point for the discussion of the results, error analysis, and report

This distribution of scores is tentative and we may decide to alter it at any moment.
We also reserve the right to assign a small bonus (0.5 points) to any assignment that is particularly worthy. Similarly, in case of grave errors, we may decide to assign an equivalent malus (-0.5 points).

# Contacts

In case of any doubt, question, issue, or help we highly recommend you to check the [course useful material](https://virtuale.unibo.it/pluginfile.php/1036039/mod_resource/content/2/NLP_Course_Useful_Material.pdf) for additional information, and to use the Virtuale forums to discuss with other students.

You can always contact us at the following email addresses. To increase the probability of a prompt response, we reccomend you to write to both the teaching assistants.

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it


# FAQ
* You can use a non-trainable Embedding layer to load the glove embeddings
* You can use any library of your choice to implement the networks. Two options are tensorflow/keras or pythorch. Both these libraries have all the classes you need to implement these simple architectures and there are plenty of tutorials around, where you can learn how to use them.

In [43]:
import os, shutil  #  file management
import sys 
import pandas as pd  #  dataframe management
import numpy as np  #  data manipulation

In [160]:
import urllib.request  #  download files
import zipfile  #  unzip files
import nltk
import tensorflow as tf
from tensorflow import keras

In [45]:
dataset_folder = os.path.join(os.getcwd(), "Datasets", "Original")

if not os.path.exists(dataset_folder):
    os.makedirs(dataset_folder)

url = 'https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip'

dataset_path = os.path.join(dataset_folder, "treebank.zip")

if not os.path.exists(dataset_path):
    urllib.request.urlretrieve(url, dataset_path)
    print("Successful download")

with zipfile.ZipFile(dataset_path, 'r') as zip_ref:
    zip_ref.extractall(dataset_folder)
print("Successful extraction")

Successful extraction


In [46]:
dataset_name = "dependency_treebank"


folder = os.path.join(os.getcwd(), "Datasets", "Original", dataset_name)

pre_train = []
pre_valid = []
pre_test = []
i = 1

file_list = sorted(os.listdir(folder))

for filename in file_list:
  file_path = os.path.join(folder, filename)
  if os.path.isfile(file_path):
    # open the file
      text = []
      with open(file_path, mode='r', encoding='utf-8') as text_file:
        text = text_file.read()
        if i <= 100:
          pre_train.append(text)
        elif i <= 150:
          pre_valid.append(text)
        else:
          pre_test.append(text)  
  i+=1

tr = []
val = []
tes = []

for paragraph in pre_train:
   tr.append(paragraph.split('\n'))
for paragraph in pre_valid:
   val.append(paragraph.split('\n'))
for paragraph in pre_test:
   tes.append(paragraph.split('\n'))


In [47]:
train = []
valid = []
test = []

for i in tr:
  for j in i:
    train.append(j.split('\t'))

for i in val:
  for j in i:
    valid.append(j.split('\t'))

for i in tes:
  for j in i:
    test.append(j.split('\t'))

In [48]:
train_sentences = []
valid_sentences = []
test_sentences = []
train_tags = []
valid_tags = []
test_tags = []

s = []
t = []
for i in train:
  if i[0] != '':
    s.append(i[0])
    t.append(i[1])
  else:
    train_sentences.append(s)
    train_tags.append(t)
    s = []
    t = []

s = []
t = []
for i in valid:
  if i[0] != '':
    s.append(i[0])
    t.append(i[1])
  else:
    valid_sentences.append(s)
    valid_tags.append(t)
    s = []
    t = []

s = []
t = []
for i in test:
  if i[0] != '':
    s.append(i[0])
    t.append(i[1])
  else:
    test_sentences.append(s)
    test_tags.append(t)
    s = []
    t = []

In [49]:
flat_train = [item for sublist in train_sentences for item in sublist]
flat_valid = [item for sublist in valid_sentences for item in sublist]
flat_test = [item for sublist in test_sentences for item in sublist]

In [116]:
from keras.preprocessing.text import Tokenizer

train_tokenizer = Tokenizer()                     # instantiate tokeniser
train_tokenizer.fit_on_texts(train_sentences)                    # fit tokeniser on data
train_encoded = train_tokenizer.texts_to_sequences(train_sentences)

valid_tokenizer = Tokenizer()       
valid_tokenizer.fit_on_texts(valid_sentences)                  # instantiate tokeniser
valid_encoded = valid_tokenizer.texts_to_sequences(valid_sentences)

test_tokenizer = Tokenizer()                     # instantiate tokeniser
test_tokenizer.fit_on_texts(test_sentences)                    # fit tokeniser on data
test_encoded = test_tokenizer.texts_to_sequences(test_sentences)  # use the tokeniser to encode input sequence

In [117]:
tag_tokenizer = Tokenizer()
tag_tokenizer.fit_on_texts(train_tags)
tags_encoded = tag_tokenizer.texts_to_sequences(train_tags)

In [132]:
voc = list(set(train_tokenizer.word_index.keys()))
print(len(voc))
voc += list(set(valid_tokenizer.word_index.keys()) - set(train_tokenizer.word_index.keys()))
print(len(voc))
voc += list(set(test_tokenizer.word_index.keys()) - set(voc))
print(len(voc))

7404
9901
10947


In [136]:
word_index = dict(zip(voc, range(2, len(voc)+3)))
word_index['-PAD-'] = 0
word_index['-OOV-'] = 1

tags = set([item for sublist in train_tags for item in sublist])

tag2index = {t: i + 1 for i, t in enumerate(tags)}
tag2index['-PAD-'] = 0 

In [53]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2021-12-07 15:46:57--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-12-07 15:46:57--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-12-07 15:46:57--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip.1’


2021

In [54]:
path_to_glove_file = os.path.join(os.getcwd(), "glove.6B.100d.txt")

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))


Found 400000 word vectors.


In [137]:
num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0


# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        embedding_matrix += np.random.uniform(low=-0.05, high=0.05, size=embedding_dim)
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 10271 words (678 misses)


In [138]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

In [139]:
MAX_LENGTH = len(max(train_sentences, key=len))
print(MAX_LENGTH)

249


In [140]:
train_sentences_X, valid_sentences_X, test_sentences_X, train_tags_y, valid_tags_y, test_tags_y = [], [], [], [], [], []
 
for s in train_sentences:
    s_int = []
    for w in s:
        try:
            s_int.append(word_index[w.lower()])
        except KeyError:
            s_int.append(word_index['-OOV-'])
 
    train_sentences_X.append(s_int)

for s in valid_sentences:
    s_int = []
    for w in s:
        try:
            s_int.append(word_index[w.lower()])
        except KeyError:
            s_int.append(word_index['-OOV-'])
 
    valid_sentences_X.append(s_int)

for s in test_sentences:
    s_int = []
    for w in s:
        try:
            s_int.append(word_index[w.lower()])
        except KeyError:
            s_int.append(word_index['-OOV-'])
 
    test_sentences_X.append(s_int)
 
for s in train_tags:
    train_tags_y.append([tag2index[t] for t in s])

for s in valid_tags:
    valid_tags_y.append([tag2index[t] for t in s])
 
for s in test_tags:
    test_tags_y.append([tag2index[t] for t in s])

In [141]:
from keras.preprocessing.sequence import pad_sequences
 
train_sentences_X = pad_sequences(train_sentences_X, maxlen=MAX_LENGTH, padding='post')
valid_sentences_X = pad_sequences(valid_sentences_X, maxlen=MAX_LENGTH, padding='post')
test_sentences_X = pad_sequences(test_sentences_X, maxlen=MAX_LENGTH, padding='post')
train_tags_y = pad_sequences(train_tags_y, maxlen=MAX_LENGTH, padding='post')
valid_tags_y = pad_sequences(valid_tags_y, maxlen=MAX_LENGTH, padding='post')
test_tags_y = pad_sequences(test_tags_y, maxlen=MAX_LENGTH, padding='post')

In [60]:
!pip install -U keras



In [148]:
def to_categorical(sequences, categories):
    cat_sequences = []
    for s in sequences:
        cats = []
        for item in s:
            cats.append(np.zeros(categories))
            cats[-1][item] = 1.0
        cat_sequences.append(cats)
    return np.array(cat_sequences)

In [None]:
tag2index

In [None]:
point = [tag2index['.']]
virg = [tag2index[',']]
weird_apex = [tag2index['``']]
single_apex = [tag2index["''"]]
two_dots = [tag2index[':']]

punct_cat_classes = to_categorical([point, virg, weird_apex, single_apex, two_dots], len(tag2index))
punct_cat_classes.shape

In [149]:
cat_train_tags_y = to_categorical(train_tags_y, len(tag2index))
cat_val_tags_y = to_categorical(valid_tags_y, len(tag2index))
print(len(cat_train_tags_y), len(cat_val_tags_y))
print(len(train_sentences_X), len(valid_sentences_X))

1963 1299
1963 1299


In [164]:
!pip install tensorflow.keras.metrics.Metrics

[31mERROR: Could not find a version that satisfies the requirement tensorflow.keras.metrics.Metrics (from versions: none)[0m
[31mERROR: No matching distribution found for tensorflow.keras.metrics.Metrics[0m


In [168]:
from keras import metrics 
from tensorflow.python.keras.metrics import Metric
from sklearn.metrics import f1_score

In [178]:
class Val_F1(tf.keras.metrics.Metric):

    def __init__(self, **kwargs):
        # Initialise as normal and add flag variable for when to run computation
        super(Val_F1, self).__init__(**kwargs)
        self.metric_variable = self.add_weight(name='metric_variable', initializer='zeros')
        self.update_metric = tf.Variable(False)

    def update_state(self, y_true, y_pred, sample_weight=None):
        # Use conditional to determine if computation is done
        if self.update_metric:
            # run computation
            self.metric_variable.assign_add(f1_score(y_true, y_pred, average='macro'))

    def result(self):
        return self.metric_variable

    def reset_states(self):
        self.metric_variable.assign(0.)

class ToggleMetrics(tf.keras.callbacks.Callback):
    '''On test begin (i.e. when evaluate() is called or 
     validation data is run during fit()) toggle metric flag '''
    def on_test_begin(self, logs):
        for metric in self.model.metrics:
            if 'Val_F1' in metric.name:
                metric.on.assign(True)
    def on_test_end(self,  logs):
        for metric in self.model.metrics:
            if 'Val_F1' in metric.name:
                metric.on.assign(False)


In [181]:
from keras.models import Sequential
from keras.layers import Dense, GRU, InputLayer, Bidirectional, TimeDistributed, Embedding, Activation
#from keras.optimizer import Adam
from tensorflow.keras.optimizers import Adam

gru = tf.keras.layers.GRU(64, return_sequences=True)
 
model = Sequential()
model.add(InputLayer(input_shape=(MAX_LENGTH, )))
model.add(embedding_layer)
model.add(gru)
model.add(TimeDistributed(Dense(len(tag2index))))
model.add(Activation('softmax'))
 
model.compile(loss='categorical_crossentropy',
             optimizer=Adam(0.001),
             metrics=['accuracy', Val_F1])
 
model.summary()

TypeError: ignored

In [180]:
model.fit(train_sentences_X, cat_train_tags_y, batch_size=128, epochs=40, validation_data=(valid_sentences_X, cat_val_tags_y), callbacks=[ToggleMetrics()])

Epoch 1/40


ValueError: ignored

40 epochs, batch 128
- loss: 0.0893 - accuracy: 0.9754 - val_loss: 0.1006 - val_accuracy: 0.9721


https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

You don't have to compute the whole set of new Glove Embeddings, you must load the pre-trained ones.

For the baseline, it must have only two trainable layers: the BiLSTM and the Dense/FC one. The Dense layer is the "classification head" with softmax activation. You must not add an additional dense layer on top of the baseline. You can use the embedding layer before the BiLSTM, but it must be not trainable.

For the application of the Dense Layer, it is recommended to use a Time-Distributed Dense. In any case, doing otherwise is NOT considered an error.

There is a typo regarding the vocabularies for OOV: V4=V1+OOV1+OOV2+OOV3

Since in this specific case we already know the test set, it is possible to use an Embeddings Matrix with all the words from each split, and without any word that is not present in the documents. It's important to generate the OOVs embeddings separately, but it's not a problem to put them in the same matrix: the test set OOVs will never be used during training and so they won't affect the rest of the network in any way.

Evaluation: for the early stopping you need to use accuracy since it's not possible to distribute and aggregate F1 across batches, but for any other evaluation, including choosing the best model on validation set, you must use F1.

Punctuation: you must keep the punctuation in the documents, since it may be helpful for the model, you simply must ignore it when you perform the evaluation of the model, not considering the punctuation classes among the ones you use to compute F1 macro score.
