# NLP2022 - Homework 2


This notebook contains code for a fast processing of data and experiments execution for the second homework of the course Natural Language Processing 2022. It has been completely wrote by Dennis Rotondi 1834864 using the methodologies learned throughout the course.

In [1]:
# imports and deterministic stuff
import os, sys
sys.path.append(os.path.join("..")) #to access hw2 functions
sys.path.append(os.path.join("../..")) #to access model folder
os.environ['WANDB_NOTEBOOK_NAME'] = './nlp_hw2.ipynb'

import torch
import numpy as np
import random
import pytorch_lightning as pl
from collections import OrderedDict, Counter

from utils import read_dataset
import wandb
from pytorch_lightning.loggers.wandb import WandbLogger

np.random.seed(0)
random.seed(0)
torch.cuda.manual_seed(0)
torch.manual_seed(0)
torch.backends.cudnn.deterministic = True  # Note that this Deterministic mode can have a performance impact
torch.backends.cudnn.benchmark = False
_ = pl.seed_everything(0)

Global seed set to 0


## Dataset Analysis

As for the bonus exercise and hw1, I want to start by looking at the data I have to better understand how to proceed in the pre-processing operations. I've read that there are problems with some (sentence-ground truth) pairs, since we are not allowed to do any change I'll directly discharge them for the training phase if needed. I'll do my analysis mostly for the english dataset since it is mandatory and larger.

In [None]:
en_file = "../../data/EN/train.json"

sentences, labels = read_dataset(en_file)
print("Number of training sentences (EN): "+ str(len(sentences.keys())))
# I'm just playing with the field of a sentence_id to understand our data samples.
sentence_id = '1996/a/50/18_supp__323:5'
print("## SENTENCE {} ##".format(sentence_id))
for key in sentences[sentence_id]:
    print(key)
    print(sentences[sentence_id][key])
print("## LABEL ##")
for key in labels[sentence_id]:
    print(key)
    print(labels[sentence_id][key])

# let's check and count the different frames and roles
verbatlas_frames = Counter()
predicate_roles = Counter()

for k in labels:
    verbatlas_frames.update(labels[k]['predicates'])
    for idx in labels[k]['roles']:
        predicate_roles.update(labels[k]['roles'][idx])

In [None]:
print("## VF ##")
print(verbatlas_frames)
# list of frames in the training dataset
l_vf = list(verbatlas_frames.keys())
print(l_vf)
print(len(l_vf))
print("## RL ##")
print(predicate_roles)
p_r = list(predicate_roles.keys())
print(p_r)
print(len(p_r))

We are clearly not using all the 466 [verbatlas](https://verbatlas.org/) frames but less than 3/4 of them: 303. Working with fewer clusters surely increases the overall performances because the system can only focus on a subset of them. In the next code cell I want to check if in the dev-set I do not have other frames.

In [None]:
dev_sentences, dev_labels = read_dataset("../../data/EN/dev.json")
print("Number of training sentences (EN): "+ str(len(dev_sentences.keys())))
for k in dev_labels:
    verbatlas_frames.update(dev_labels[k]['predicates'])
    for idx in dev_labels[k]['roles']:
        predicate_roles.update(dev_labels[k]['roles'][idx])

l_vf_dev = list(verbatlas_frames.keys())
print(len(l_vf_dev))

So there are only 4 more frames in the dev_set wrt the train_set, this information is useful for further consideration when I'll deal with the optional part of this homework.

Now that I'm starting to understand the samples, it's clear that our dataset does not need much pre-processing, since we already have words tokens and associated lemmas for each sentence. Some more useful statistics are on how long are the sentences on average, how many predicates they have and how the distribution of pos-tagging tokens correlate with roles and predicates. I'll rapidly compute them in what follows. 

In [None]:
token_size=len(tokens_s)
sentences_size=list()
k=0
for s,l in zip(tokens_s,labels_s):
    sentences_size.append(len(s))
    for w,lab in zip(s,l):
        if not w in vocab and lab!="O":
            k+=1
print("important words lost",k)
k=0
for s,l in zip(tokens_s,labels_s):
    for w,lab in zip(s,l):
        if not w in vocab and lab!="O":
            k+=1
            break
print("percentage of dirty sentences",k/token_size) #sentences that contains an OOV word but with a significant label !=0

sent_np=np.asarray(sentences_size)
print("mean", sent_np.mean())
print("std", sent_np.std())
print("min", sent_np.min())
print("max", sent_np.max())

plt.figure(figsize=(8,8)) #to increase the plot size
_ = plt.hist(sent_np, bins = 'auto') 
plt.title("Histogram of sentences size available") 
plt.show()

flat_labels = sum(labels_s,[]) #to flat the list
count = Counter(flat_labels)
plt.figure(figsize=(10,10))
_ = plt.bar(count.keys(),count.values()) 
plt.title("Bar Plot of labels frequency") 
plt.show() #it's possible to notice that most of them are between size 7 and 30

## TODO

## Training

Now it's time to train our model. Pytorch-lightning allow that in such a way that it's easy to modularize everything and train with few lines of code all the different models. Moreover using wandb as logger I auto-plot the training evolution in high quality plots and it's also possible to save the training history of the different trials. This will be very useful for comparing the experiments in the report.   

In [2]:
from datasets_srl import SRL_DataModule
from implementation import HParams, SRL_34
from dataclasses import dataclass, asdict
from pprint import pprint
from utils import read_dataset, evaluate_argument_classification, evaluate_argument_identification
from mergedeep import merge

# these are some parameters that allow as I said to modularize the training. We need to store the hypermarameters of the model (lr, wd, ...), the language
# and the task on which we want to perform the training.
hparams = asdict(HParams())
print(hparams)
languages = ["EN", "ES", "FR"]
tasks = ["34", "234", "1234"]
models = {"34": SRL_34}


language = languages[0]
task = tasks[0]
epochs = 20
SRL_Model = models[task]
# after reading the dataset I merge the two dicts (sentences and labels) since there is a field in common (predicate)
# and it's only a waste of space keeping it in memory 2 copies of it.
sentences = merge(*read_dataset("../../data/"+language+"/train.json"))
sentences_test = merge(*read_dataset("../../data/"+language+"/dev.json"))

working with notebook need an 'absolute' import
{'need_train': True, 'batch_size': 256, 'n_cpu': 8, 'language_model_name': 'bert-base-uncased', 'lr': 0.001, 'wd': 0, 'embedding_dim': 768, 'hidden_dim': 512, 'bidirectional': True, 'num_layers': 2, 'dropout': 0.2, 'trainable_embeddings': False, 'role_classes': 27, 'srl_34_ckpt': 'model/srl_34_EN.ckpt'}


In [3]:
data = SRL_DataModule(hparams, task, sentences, sentences_test)
model = SRL_Model(hparams=hparams)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
from pytorch_lightning.callbacks.early_stopping import EarlyStopping
from pytorch_lightning.callbacks import ModelCheckpoint
# Define the logger
# https://www.wandb.com/articles/pytorch-lightning-with-weights-biases.
# NOTE: to use wandb properly you need to login in wandb (need an account) 
# or use a different logger eg. TensorBoard, I'm used to this one so I'll go for it.
wandb.require("service")
wandb_logger = WandbLogger(project="SRL_"+task, log_model = True)
wandb_logger.experiment.watch(model, log = 'all', log_freq = 1000)
# Define the trainer
metric_to_monitor = 'avg_val_loss'
# we employ the early stopping technique to avoid hours of usuless training, pl gives it for free
early_stop_callback = EarlyStopping(monitor = metric_to_monitor, min_delta = 0.00, patience = 3, verbose = True, mode = "min")
# it is also useful to keep track of the best model during the epochs (if you remember I did all this manually last hw)or use a different logger,
# we have a callback even for this.
checkpoint_callback = ModelCheckpoint(
                        save_top_k = 1,
                        monitor = metric_to_monitor,
                        mode = "min",
                        dirpath = "../../model",
                        filename = "SRL_"+task+"-{epoch:02d}-{avg_val_loss:.4f}",
                        verbose = True
                    )
# the trainer collect all the useful informations so far for the training 
trainer = pl.Trainer(logger = wandb_logger,
                    max_epochs = epochs, 
                    gpus = 1,
                    callbacks = [early_stop_callback, checkpoint_callback])    
# Start the training
trainer.fit(model, data)
# Log the trained model
trainer.save_checkpoint("../../model/SRL_"+task+"_last.ckpt")
wandb.save("wandb/")
wandb.finish()

[34m[1mwandb[0m: Currently logged in as: [33mdenondi[0m. Use [1m`wandb login --relogin`[0m to force relogin


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type      | Params
------------------------------------------------
0 | transformer_model | BertModel | 109 M 
1 | lstm              | LSTM      | 11.6 M
2 | dropout           | Dropout   | 0     
3 | classifier        | Linear    | 27.7 K
------------------------------------------------
11.6 M    Trainable params
109 M     Non-trainable params
121 M     Total params
484.243   Total estimated model params size (MB)


Epoch 0: 100%|██████████| 60/60 [02:01<00:00,  2.03s/it, loss=0.263, v_num=2yqq]

Metric avg_val_loss improved. New best score: 0.216


Epoch 0: 100%|██████████| 60/60 [02:01<00:00,  2.03s/it, loss=0.263, v_num=2yqq]

Epoch 0, global step 50: 'avg_val_loss' reached 0.21583 (best 0.21583), saving model to '/home/dennis/Desktop/nlp2022-hw2/model/SRL_34-epoch=00-avg_val_loss_vae=0.0000.ckpt' as top 1


Epoch 1: 100%|██████████| 60/60 [02:09<00:00,  2.16s/it, loss=0.181, v_num=2yqq]

Metric avg_val_loss improved by 0.059 >= min_delta = 0.0. New best score: 0.156


Epoch 1: 100%|██████████| 60/60 [02:09<00:00,  2.16s/it, loss=0.181, v_num=2yqq]

Epoch 1, global step 100: 'avg_val_loss' reached 0.15633 (best 0.15633), saving model to '/home/dennis/Desktop/nlp2022-hw2/model/SRL_34-epoch=01-avg_val_loss_vae=0.0000.ckpt' as top 1


Epoch 2: 100%|██████████| 60/60 [02:02<00:00,  2.05s/it, loss=0.143, v_num=2yqq]

Metric avg_val_loss improved by 0.033 >= min_delta = 0.0. New best score: 0.123


Epoch 2: 100%|██████████| 60/60 [02:02<00:00,  2.05s/it, loss=0.143, v_num=2yqq]

Epoch 2, global step 150: 'avg_val_loss' reached 0.12291 (best 0.12291), saving model to '/home/dennis/Desktop/nlp2022-hw2/model/SRL_34-epoch=02-avg_val_loss_vae=0.0000.ckpt' as top 1


Epoch 3: 100%|██████████| 60/60 [02:01<00:00,  2.02s/it, loss=0.123, v_num=2yqq]

Metric avg_val_loss improved by 0.016 >= min_delta = 0.0. New best score: 0.107


Epoch 3: 100%|██████████| 60/60 [02:01<00:00,  2.02s/it, loss=0.123, v_num=2yqq]

Epoch 3, global step 200: 'avg_val_loss' reached 0.10728 (best 0.10728), saving model to '/home/dennis/Desktop/nlp2022-hw2/model/SRL_34-epoch=03-avg_val_loss_vae=0.0000.ckpt' as top 1


Epoch 4: 100%|██████████| 60/60 [02:01<00:00,  2.02s/it, loss=0.104, v_num=2yqq]

Metric avg_val_loss improved by 0.015 >= min_delta = 0.0. New best score: 0.092


Epoch 4: 100%|██████████| 60/60 [02:01<00:00,  2.02s/it, loss=0.104, v_num=2yqq]

Epoch 4, global step 250: 'avg_val_loss' reached 0.09221 (best 0.09221), saving model to '/home/dennis/Desktop/nlp2022-hw2/model/SRL_34-epoch=04-avg_val_loss_vae=0.0000.ckpt' as top 1


Epoch 5: 100%|██████████| 60/60 [02:01<00:00,  2.03s/it, loss=0.0878, v_num=2yqq]

Metric avg_val_loss improved by 0.010 >= min_delta = 0.0. New best score: 0.082


Epoch 5: 100%|██████████| 60/60 [02:01<00:00,  2.03s/it, loss=0.0878, v_num=2yqq]

Epoch 5, global step 300: 'avg_val_loss' reached 0.08195 (best 0.08195), saving model to '/home/dennis/Desktop/nlp2022-hw2/model/SRL_34-epoch=05-avg_val_loss_vae=0.0000.ckpt' as top 1


Epoch 6: 100%|██████████| 60/60 [02:01<00:00,  2.02s/it, loss=0.0802, v_num=2yqq]

Metric avg_val_loss improved by 0.007 >= min_delta = 0.0. New best score: 0.075


Epoch 6: 100%|██████████| 60/60 [02:01<00:00,  2.02s/it, loss=0.0802, v_num=2yqq]

Epoch 6, global step 350: 'avg_val_loss' reached 0.07539 (best 0.07539), saving model to '/home/dennis/Desktop/nlp2022-hw2/model/SRL_34-epoch=06-avg_val_loss_vae=0.0000.ckpt' as top 1


Epoch 7: 100%|██████████| 60/60 [02:06<00:00,  2.11s/it, loss=0.0709, v_num=2yqq]

Metric avg_val_loss improved by 0.003 >= min_delta = 0.0. New best score: 0.072


Epoch 7: 100%|██████████| 60/60 [02:06<00:00,  2.11s/it, loss=0.0709, v_num=2yqq]

Epoch 7, global step 400: 'avg_val_loss' reached 0.07190 (best 0.07190), saving model to '/home/dennis/Desktop/nlp2022-hw2/model/SRL_34-epoch=07-avg_val_loss_vae=0.0000.ckpt' as top 1


Epoch 8: 100%|██████████| 60/60 [01:12<00:00,  1.21s/it, loss=0.0653, v_num=2yqq]

Metric avg_val_loss improved by 0.004 >= min_delta = 0.0. New best score: 0.068


Epoch 8: 100%|██████████| 60/60 [01:12<00:00,  1.21s/it, loss=0.0653, v_num=2yqq]

Epoch 8, global step 450: 'avg_val_loss' reached 0.06824 (best 0.06824), saving model to '/home/dennis/Desktop/nlp2022-hw2/model/SRL_34-epoch=08-avg_val_loss_vae=0.0000.ckpt' as top 1


Epoch 9: 100%|██████████| 60/60 [00:49<00:00,  1.21it/s, loss=0.0568, v_num=2yqq]

Epoch 9, global step 500: 'avg_val_loss' was not in top 1


Epoch 10: 100%|██████████| 60/60 [00:47<00:00,  1.27it/s, loss=0.0515, v_num=2yqq]

Metric avg_val_loss improved by 0.005 >= min_delta = 0.0. New best score: 0.063


Epoch 10: 100%|██████████| 60/60 [00:47<00:00,  1.27it/s, loss=0.0515, v_num=2yqq]

Epoch 10, global step 550: 'avg_val_loss' reached 0.06317 (best 0.06317), saving model to '/home/dennis/Desktop/nlp2022-hw2/model/SRL_34-epoch=10-avg_val_loss_vae=0.0000.ckpt' as top 1


Epoch 11: 100%|██████████| 60/60 [00:48<00:00,  1.25it/s, loss=0.0482, v_num=2yqq]

Metric avg_val_loss improved by 0.001 >= min_delta = 0.0. New best score: 0.062


Epoch 11: 100%|██████████| 60/60 [00:48<00:00,  1.25it/s, loss=0.0482, v_num=2yqq]

Epoch 11, global step 600: 'avg_val_loss' reached 0.06212 (best 0.06212), saving model to '/home/dennis/Desktop/nlp2022-hw2/model/SRL_34-epoch=11-avg_val_loss_vae=0.0000.ckpt' as top 1


Epoch 12:   0%|          | 0/60 [00:00<?, ?it/s, loss=0.0482, v_num=2yqq]         

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
model = SRL_34.load_from_checkpoint("../../"+hparams["srl_34_ckpt"]).to(device)

In [None]:
predict = model.predict(sentences_test, require_ids=True)

In [None]:
print("Argument Classification")
print(evaluate_argument_classification(sentences_test, predict))
print("Argument Identification")
print(evaluate_argument_identification(sentences_test, predict))

In [None]:
a = torch.load("../../model/srl_34_EN.ckpt")

In [None]:
print(a)

## TODO: NOW the confusion matrix 

## TOREMOVE

In [None]:
from transformers import AutoModel
from transformers import AutoTokenizer

auto_model = AutoModel.from_pretrained("bert-base-uncased", output_hidden_states=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [None]:
sequence = "Using a Transformer network is simple"
sequence2 = "we ciao"
# tokens = tokenizer.tokenize(sequence)

# # print(tokens)

# ids = tokenizer.convert_tokens_to_ids(tokens)
# # print(ids)

tokenized_inputs = tokenizer([[sequence, "simple"],[sequence2, "ciao"]],padding=True, return_tensors="pt")  # "pt" -> return PyTorch torch.Tensor objects, rather than a list of tokens

print(tokenized_inputs)
print(tokenized_inputs['input_ids'].shape)
print(tokenized_inputs.word_ids(0))
print(tokenized_inputs.word_ids(1))
# NOTE: in the dataset use those word ids to average and simply filter for example... MAY NOT WORK:...
sequence3 = ["we", "ciao"]
print("second - use this one!!")
tokenized_inputs2 = tokenizer.batch_encode_plus([([sequence.split(), ["simple"]]),(sequence3,["ciao"])], add_special_tokens=True, is_split_into_words=True, padding=True, return_tensors="pt")  # "pt" -> return PyTorch torch.Tensor objects, rather than a list of tokens
print(tokenized_inputs2)
print(tokenized_inputs2['input_ids'].shape)
print(tokenized_inputs2.word_ids(0))
a, b, c = tokenizer.batch_encode_plus([([sequence.split(), ["simple"]]),(sequence3,["ciao"])], add_special_tokens=True, is_split_into_words=True, padding=True, return_tensors="pt")  # "pt" -> return PyTorch torch.Tensor objects, rather than a list of tokens 
print("aaa")
print(b)
print("aaa")
# print(tokenized_inputs2.word_ids(1))
# sequence_a = "This is a short sequence."
# sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."
# print("test")
# padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)
# print(padded_sequences)

transformers_outputs = auto_model(**tokenized_inputs)#['input_ids']
# print(transformers_outputs)
transformers_outputs_sum = torch.stack(transformers_outputs.hidden_states[-4:], dim=0).sum(dim=0)
print(transformers_outputs_sum.shape)
# I should remove 2 sep and 1 cls, 1 additional token -> final size 7



# filter_toke = tokenized_inputs['input_ids'][:, 1:-3, ...]
# print(filter_toke.shape)
# labels  = tokenized_inputs.word_ids()[1:-3]
# samp_size = filter_toke.shape[1]
# M = torch.zeros(max(labels)+1, samp_size)
# M[labels, torch.arange(samp_size)] = 1
# print(M)
# M = torch.nn.functional.normalize(M, p=1, dim=1)
# print(M)
# torch.mm(M, filter_toke[0]).shape

# i want to have an ID to unde
# item["role_id"] = (item["role_labels"] == self.labels_to_id["_"]).long()

In [None]:
samples = torch.Tensor([[
                     [0.1, 0.1],    #-> group / class 1
                     [0.2, 0.2],    #-> group / class 2
                     [0.4, 0.4],    #-> group / class 2
                     [0.0, 0.0]     #-> group / class 0
              ],
              [
                     [0.1, 0.1],    #-> group / class 1
                     [0.2, 0.2],    #-> group / class 1
                     [0.0, 0.0],    #-> group / class 0
                     [12.0, 12.0]   #-> group / class 0
              ]])

from transformers_embedder.embedder import TransformersEmbedder
labels = torch.LongTensor([[1, 2, 2, 2],[1,2,0,2]])


print(TransformersEmbedder.merge_scatter(samples, labels))
print(TransformersEmbedder.merge_scatter(samples, labels).shape)


In [None]:
# code taken from Riccardo Orlando transformer embedding https://github.com/Riccorl/transformers-embedder
# it is needed to average the wordpieces after the tokenization to have more reliable embeddig. This is 
# useful because for OOV words (or other languages) we can capture more informations than simply using
# the first token. 
def merge_scatter(embeddings: torch.Tensor, indices: torch.Tensor) -> torch.Tensor:
    """
    Minimal version of ``scatter_mean``, from `pytorch_scatter
    <https://github.com/rusty1s/pytorch_scatter/>`_
    library, that is compatible for ONNX but works only for our case.
    It is used to compute word level embeddings from the transformer output.
    Args:
        embeddings (:obj:`torch.Tensor`):
            The embeddings tensor.
        indices (:obj:`torch.Tensor`):
            The sub-word indices.
    Returns:
        :obj:`torch.Tensor`
    """

    def broadcast(src: torch.Tensor, other: torch.Tensor):
        """
        Broadcast ``src`` to match the shape of ``other``.
        Args:
            src (:obj:`torch.Tensor`):
                The tensor to broadcast.
            other (:obj:`torch.Tensor`):
                The tensor to match the shape of.
        Returns:
            :obj:`torch.Tensor`: The broadcasted tensor.
        """
        for _ in range(src.dim(), other.dim()):
            src = src.unsqueeze(-1)
        src = src.expand_as(other)
        return src

    def scatter_sum(src: torch.Tensor, index: torch.Tensor) -> torch.Tensor:
        """
        Sums the elements in ``src`` that have the same indices as in ``index``.
        Args:
            src (:obj:`torch.Tensor`):
                The tensor to sum.
            index (:obj:`torch.Tensor`):
                The indices to sum.
        Returns:
            :obj:`torch.Tensor`: The summed tensor.
        """
        index = broadcast(index, src)
        size = list(src.size())
        size[1] = index.max() + 1
        print(size)
        print(src.dtype)
        out = torch.zeros(size, dtype=src.dtype, device=src.device)
        return out.scatter_add_(1, index, src)

    # replace padding indices with the maximum value inside the batch
    indices[indices == -1] = torch.max(indices)
    merged = scatter_sum(embeddings, indices)
    ones = torch.ones(
        indices.size(), dtype=embeddings.dtype, device=embeddings.device
    )
    count = scatter_sum(ones, indices)
    count.clamp_(1)
    count = broadcast(count, merged)
    merged.true_divide_(count)
    return merged[:,:-1,:] #added by me to remove a batch!

In [None]:
torch.tensor([1,2,None],dtype=torch.float)

In [None]:
pip install transformers-embedder

In [None]:
torch.cuda.is_available()

self.vae=VAE.load_from_checkpoint(hparams.vae.pth_folder)
self.vae.freeze() #we do not want to train it