<a href="https://colab.research.google.com/github/DomMcOyle/NLP-Assigments-22-23/blob/Assignment-2/Assignment2DP(G)Rwith%20history.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Team Members**: Domenico Dell'Olio, Giovanni Pio Delvecchio, Raffaele Disabato


## Overview

### Problem

Question Answering (QA) on [CoQA](https://stanfordnlp.github.io/coqa/) dataset: a conversational QA dataset.

### Task

Given a question $Q$, a text passage $P$, the task is to generate the answer $A$.<br>
$\rightarrow A$ can be: (i) a free-form text or (ii) unanswerable;

**Note**: a question $Q$ can refer to previous dialogue turns. <br>
$\rightarrow$ dialogue history $H$ may be a valuable input to provide the correct answer $A$.

### Models

We are going to experiment with transformer-based models to define the following models:

1.  $A = f_\theta(Q, P)$

2. $A = f_\theta(Q, P, H)$

where $f_\theta$ is the transformer-based model we have to define with $\theta$ parameters.

## The CoQA dataset

<center>
    <img src="https://drive.google.com/uc?export=view&id=16vrgyfoV42Z2AQX0QY7LHTfrgektEKKh" width="750"/>
</center>

For detailed information about the dataset, feel free to check the original [paper](https://arxiv.org/pdf/1808.07042.pdf).



## Rationales

Each QA pair is paired with a rationale $R$: it is a text span extracted from the given text passage $P$. <br>
$\rightarrow$ $R$ is not a requested output, but it can be used as an additional information at training time!

## Dataset Statistics

* **127k** QA pairs.
* **8k** conversations.
* **7** diverse domains: Children's Stories, Literature, Mid/High School Exams, News, Wikipedia, Reddit, Science.
* Average conversation length: **15 turns** (i.e., QA pairs).
* Almost **half** of CoQA questions refer back to **conversational history**.
* Only **train** and **validation** sets are available.

## Dataset snippet

The dataset is stored in JSON format. Each dialogue is represented as follows:

```
{
    "source": "mctest",
    "id": "3dr23u6we5exclen4th8uq9rb42tel",
    "filename": "mc160.test.41",
    "story": "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. 
    Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. [...]" % <-- $P$
    "questions": [
        {
            "input_text": "What color was Cotton?",   % <-- $Q_1$
            "turn_id": 1
        },
        {
            "input_text": "Where did she live?",
            "turn_id": 2
        },
        [...]
    ],
    "answers": [
        {
            "span_start": 59,   % <-- $R_1$ start index
            "spand_end": 93,    % <-- $R_1$ end index
            "span_text": "a little white kitten named Cotton",   % <-- $R_1$
            "input_text" "white",   % <-- $A_1$      
            "turn_id": 1
        },
        [...]
    ]
}
```

### Simplifications

Each dialogue also contains an additional field ```additional_answers```. For simplicity, we **ignore** this field and only consider one groundtruth answer $A$ and text rationale $R$.

CoQA only contains 1.3% of unanswerable questions. For simplicity, we **ignore** those QA pairs.

# Assignment Evaluation

The following assignment points will be awarded for each task as follows:

* Task 1, Pre-processing $\rightarrow$ 0.5 points.
* Task 2, Dataset Splitting $\rightarrow$ 0.5 points.
* Task 3 and 4, Models Definition $\rightarrow$ 1.0 points.
* Task 5 and 6, Models Training and Evaluation $\rightarrow$ 2.0 points.
* Task 7, Analysis $\rightarrow$ 1.0 points.
* Report $\rightarrow$ 1.0 points.

**Total** = 6 points <br>

We may award an additional 0.5 points for outstanding submissions. 
 
**Speed Bonus** = 0.5 extra points <br>

# Report

We apply the rules described in Assignment 1 regarding the report.
* Write a clear and concise report following the given overleaf template (**max 2 pages**).
* Report validation and test results in a table.$^1$
* **Avoid reporting** code snippets or copy-paste terminal outputs $\rightarrow$ **Provide a clean schema** of what you want to show

# Comments and Organization

Remember to properly comment your code (it is not necessary to comment each single line) and don't forget to describe your work!

Structure your code for readability and maintenance. If you work with Colab, use sections. 

This allows you to build clean and modular code, as well as easy to read and to debug (notebooks can be quite tricky time to time).

# FAQ (READ THIS!)

---

**Question**: Does Task 3 also include data tokenization and conversion step?

**Answer:** Yes! These steps are usually straightforward since ```transformers``` also offers a specific tokenizer for each model.

**Example**: 

```
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_text = tokenizer(text)
%% Alternatively
inputs = tokenizer.tokenize(text, add_special_tokens=True, max_length=min(max_length, 512))
input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask']
```

**Suggestion**: Hugginface's documentation is full of tutorials and user-friendly APIs.

---
---

**Question**: I'm hitting **out of memory error** when training my models, do you have any suggestions?

**Answer**: Here are some common workarounds:

1. Try decreasing the mini-batch size
2. Try applying a different padding strategy (if you are applying padding): e.g. use quantiles instead of maximum sequence length

---
---

## Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

In [None]:
# required libraries to execute the notebook
!pip install transformers
!pip install tensorflow-addons
!pip install datasets
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, http

## [Task 1] Remove unaswerable QA pairs

Write your own script to remove unaswerable QA pairs from both train and validation sets.

The following cell contains all the needed imports for the assignment.

In [None]:
import json
import random
import urllib.request
import pickle
import re
import os
import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa

from google.colab import drive
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from datasets import Dataset, load_from_disk
from evaluate import load
from transformers import TFAutoModel, AutoTokenizer
from transformers import logging

logging.set_verbosity_error() # removes warning related to transformers

We also make use of Google Drive to save all the results and to cope with the Colab time restrictions.

In [None]:
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## Dataset Download
The following two cells contain already given helper functions to download of the dataset.

In [None]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [None]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test') 

Downloading CoQA train data split... (it may take a while)


coqa-train-v1.0.json: 49.0MB [00:06, 8.14MB/s]


Download completed!
Downloading CoQA test data split... (it may take a while)


coqa-dev-v1.0.json: 9.09MB [00:01, 7.01MB/s]                            

Download completed!





#### Dataset extraction and cleaning
in the follwing cell we define the function to set the seed for the environment, in order to allow reproducibility, and the function to format the dataset in (Question + Passage[+History], Answer) pairs. The latter also automatically removes unanswerable QA pairs and can add the history of the previous turns in the dialogue to the passage. In particular the history is added in inverted order: $Q_i$, $P_i$ [SEP] $Q_{i-1}$[SEP]$A_{i-1}$[SEP] $Q_{i-2}$[SEP]$A_{i-2}$...[SEP]$A_{1}$

In this way, when the input is truncated, we have an higher probabilty of retaining information releated to the current question, as usually the questions referring back to history are linked to immediately previous turns.

Also this function extract the text source for each QA pair, which will be useful during the analysis step.

In [None]:
def set_seed(SEED):
  """
  Function to set the random seed and ensure reproducibilty of results.
  :params:
    SEED: integer representing the seed for pseudorandom generation to be set
  """
  random.seed(SEED) # if you're using random
  np.random.seed(SEED) # if you're using numpy
  tf.random.set_seed(SEED) # setting the seed for tensorflow too
  os.environ['TF_DETERMINISTIC_OPS'] = '1'

def extract_data(split_dataset,add_history=False,sep_char="[SEP]"):
  """
  function extracting data from the list of dictionaries in the CoQA dataset. It
  removes unanswerable pairs and eventually adds history in an inverted way:
  Pi sep_char Qi-1 sep_char Ai-1 ... A1. It also returns the source for each input
  :params:
    split_dataset: list of dictionaries from where to extract the pairs of question and passage and corresponding the answer
    add_history: boolean flag that allows adding the history to the XQA output (default=False)
    sep_char: string or character used as separation token by the input tokenizer for the encoder.
      It is used to separate the different parts of the input
  :returns:
    XQA: list containing sublists [question, passage(+history)]
    YQA: list containing answers
    story_source: list containing the story source for each pair
  """  
  XQA = [] # list that will contain pairs (P,Q)
  YQA = [] # list that will contain the Answers
  story_source = [] #list that will contain the category/source for each example
  for d in split_dataset: # scan each document
    for i in range(len(d["questions"])): # scan each question
      if d["answers"][i]["span_end"]!=-1: # discard unanswerable questions
        single_example = [] # prepare the single example...
        single_example.append(d["questions"][i]["input_text"]) #... with the question ...
        single_example.append(d["story"]) # ...and the passage
        if add_history:
          for j in range(i-1,-1,-1): # add the history from the last pairs
            if d["answers"][j]["span_end"]!=-1: # excluding unanswerable questions
              single_example[1] = single_example[1] + sep_char + d["questions"][j]["input_text"]+ sep_char + d["answers"][j]["input_text"]
              
        XQA.append(single_example) # and append it
        YQA.append(d["answers"][i]["input_text"]) # add the answer
        story_source.append(d["source"]) # add the source
  return XQA, YQA, story_source

## [Task 2] Train, Validation and Test splits

CoQA only provides a train and validation set since the test set is hidden for evaluation purposes.

We'll consider the provided validation set as a test set. <br>
$\rightarrow$ Write your own script to:
* Split the train data in train and validation splits (80% train and 20% val)
* Perform splits such that a dialogue appears in one split only! (i.e., split at dialogue level)
* Perform splitting using the following seed for reproducibility: 42

#### Reproducibility Memo

Check back tutorial 2 on how to fix a specific random seed for reproducibility!

We set the seed before splitting with the function previously defined.

In [None]:
seed = 42 
set_seed(seed)

In the following block data is split as required (80% train, 20% validation) and extracted with or without history. Some examples from each split are printed below.

In [None]:
# HISTORY FLAG
add_history=True

with open('coqa/train.json') as f:
  # loading the training json
  train_json = json.load(f)

with open('coqa/test.json') as f:
  # loading the test json
  test_json = json.load(f)

# splitting training data
train_data, val_data = train_test_split(train_json["data"],
                                        train_size=0.8,
                                        shuffle=True,
                                        random_state=seed)
# extracting X as list of pairs [Question, Passage] and Y as a list of strings (Answers) 
XQA_train, YQA_train, source_train = extract_data(train_data, add_history)
XQA_val, YQA_val, source_val = extract_data(val_data, add_history)
XQA_test, YQA_test, source_test = extract_data(test_json["data"], add_history)
del(train_json)
del(test_json)

print("Fourth training example:")
print(XQA_train[3])
print(YQA_train[3])
print(source_train[3])
print("Fourth validation example:")
print(XQA_val[3])
print(YQA_val[3])
print(source_val[3])
print("Fourth test example:")
print(XQA_test[3])
print(YQA_test[3])
print(source_test[3])

Fourth training example:
['When was the last one held?', 'TUNIS, Tunisia (CNN) -- Polls closed late Sunday in Tunisia, the torchbearer of the so-called Arab Spring, but voters will not see results of national elections until Tuesday, officials said. \n\nOn Sunday, long lines of voters snaked around schools-turned-polling-stations in Tunis\'s upscale Menzah neighborhood, some waiting for hours to cast a vote in the nation\'s first national elections since the country\'s independence in 1956. \n\n"It\'s a wonderful day. It\'s the first time we can choose our own representatives," said Walid Marrakchi, a civil engineer who waited more than two hours, and who brought along his 3-year-old son Ahmed so he could "get used to freedom and democracy." \n\nTunisia\'s election is the first since a popular uprising in January overthrew long-time dictator Zine El Abidine Ben Ali and triggered a wave of revolutions -- referred to as the Arab Spring -- across the region. \n\nMore than 60 political par

During the data exploration phase, we noticed an extremely wrong example in the dataset due to its length. We decided to fix it before training the net.

In [None]:
## broken example fix:
print(XQA_train[61])
print(YQA_train[61])
YQA_train[61] = 'October'
print(YQA_train[61])

['what month?', 'Microsoft Word is a word processor developed by Microsoft. It was first released on October 25, 1983 under the name "Multi-Tool Word" for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS (1983), Apple Macintosh running Classic Mac OS (1985), AT&T Unix PC (1985), Atari ST (1988), OS/2 (1989), Microsoft Windows (1989), SCO Unix (1994), and macOS (2001). Commercial versions of Word are licensed as a standalone product or as a component of Microsoft Office, Windows RT or the discontinued Microsoft Works suite. Microsoft Word Viewer and Office Online are freeware editions of Word with limited features. \n\nIn 1981, Microsoft hired Charles Simonyi, the primary developer of Bravo, the first GUI word processor, which was developed at Xerox PARC. Simonyi started work on a word processor called "Multi-Tool Word" and soon hired Richard Brodie, a former Xerox intern, who became the primary software engineer. \n\nMicros

In the following block the two tokenizers (one for the encoder and one for the decoder) are created. The encoder tokenizer (<code>input_tokenizer</code>) is pre-trained, while the decoder tokenizer (<code>output_tokenizer</code>) is fit on the training set answers alone (with the addtion of the tokens <code>\<start></code> and <code>\<end></code>). Moreover, to simplify generation, the tokenizer lowercases characters and filters everything but the <code>'</code>, <code>\<</code> and <code>\></code> tokens.


In order to keep the computational complexity low and allow training under Colab restrictions, we limited the input maximum dimension to 512 tokens (which is the maximum for the two bert variants considered), truncating on the passage string, and the output maximum dimension to 20 tokens, which is a bit more than the $99^{th}$ percentile of the lengths of the answers. Both inputs and outputs are padded.

In [None]:
## MODEL NAME
# choose the model name for the input tokenizer
#model_name = 'distilroberta-base'
model_name = 'prajjwal1/bert-tiny'

## FILTER
filter = '!"#$%&()*+,./:;=?@[\\]^_`{|}~\t\n'

def filter_string(x):
  """
  function required to apply filtering on a string in the same way it was applied by the tokenizer.
  it is mainly used when storing the dataset splits.
  """
  return re.sub('[' + filter + ']',"",x)
  
#create and fit the tokenizer
output_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters=filter, oov_token='<UNK>')
output_tokenizer.fit_on_texts(["<start> " + i + " <end>" for i in YQA_train])

# import the required tokenizer
input_tokenizer = AutoTokenizer.from_pretrained(model_name)

print("Max input output found: " + str(max([len(i) for i in output_tokenizer.texts_to_sequences(YQA_train)])))
print(np.argmax([len(i) for i in YQA_train]))
print(XQA_train[7529])
print(YQA_train[7529])

print("99° percentile of training set answer length:" + str(np.quantile([len(i) for i in output_tokenizer.texts_to_sequences(YQA_train)], 0.99)))

# actual percentile is 17, given that each string has the beginnning and ending token
max_sequence_length = 20
max_input_length = 512

# this suffix is used when saving the pre-processed dataset split
dataset_suffix = "_hist" if add_history else ""

Downloading:   0%|          | 0.00/285 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Max input output found: 83
7529
["What symptoms of addiction does Orzack's center list?", 'Caught in the Web A few months ago, it wasn\'t unusual for 47-year-old Carla Toebe to spend 15 hours per day online. She\'d wake up early, turn on her laptop and chat on Internet dating sites and instant-messaging programs - leaving her bed for only brief intervals. Her household bills piled up, along with the dishes and dirty laundry, but it took near-constant complaints from her four daughters before she realized she had a problem. "I was starting to feel like my whole world was falling apart - kind of slipping into a depression," said Carla. "I knew that if I didn\'t get off the dating sites, I\'d just keep going," detaching herself further from the outside world. Toebe\'s conclusion: She felt like she was "addicted" to the Internet. She\'s not alone. Concern about excessive Internet use isn\'t new. As far back as 1995, articles in medical journals and the establishment of a Pennsylvania treat

The following three cells are used to create a HuggingFace <code>Dataset</code> for each split, after having chosen the model that will be used (tiny bert and distilroberta have different input tokenizer) and if it will contain also the history. The <code>Dataset</code> class allows managing big datasets by reading only what's needed and when's needed from the disk.

The test and validation split will have an additional column ("references") which will contain a dictionary containing the keys "answer" and "id". "answer" has as value another dictionary containing the text of the answer (lowercased and stripped from punctuation as done by the tokenizer) and a placeholder "answer_start". While the "id" has the index of the row in the dataset cast into string as value, which identifies uniquely the answer in the split. This organization was needed to use the SQUAD-F1 implementation by the library <code>evaluate</code> of HuggingFace. 

In [None]:
# Generate training split dataset
# EXECUTE ONLY IF NEEDED

# create dictionary
train_ds = Dataset.from_dict({"xqa": XQA_train, 
                              "yqa": ["<start> " + i + " <end>" for i in YQA_train],
                              "source":source_train})
# tokenize input
train_ds = train_ds.map(lambda x: input_tokenizer(x["xqa"], 
                                                  return_tensors="tf", 
                                                  padding="max_length", 
                                                  truncation="longest_first", 
                                                  max_length=max_input_length), 
                        batched=True)
# tokenize output
train_ds = train_ds.map(lambda x: {"y_token": output_tokenizer.texts_to_sequences(x["yqa"])}, 
                        batched=True)
# pad output
train_ds = train_ds.map(lambda x: {"y_padded": tf.keras.preprocessing.sequence.pad_sequences(x["y_token"],
                                                                     padding='post',
                                                                     maxlen=max_sequence_length)},
                         batched=True)
# remove working columns
train_ds = train_ds.remove_columns(["xqa", "yqa", "y_token"])
# format tensors
train_ds = train_ds.with_format(type="tensorflow")

# save on drive
if model_name == 'prajjwal1/bert-tiny':
  train_ds.save_to_disk("gdrive/MyDrive/ckpt/train_ds" + dataset_suffix)
else:
  train_ds.save_to_disk("gdrive/MyDrive/ckpt/train_ds_rob" + dataset_suffix)

  0%|          | 0/86 [00:00<?, ?ba/s]

  0%|          | 0/86 [00:00<?, ?ba/s]

  0%|          | 0/86 [00:00<?, ?ba/s]

Saving the dataset (0/1 shards):   0%|          | 0/85807 [00:00<?, ? examples/s]

In [None]:
# Generate validaion split dataset
# EXECUTE ONLY IF NEEDED

# create dictionary
val_ds = Dataset.from_dict({"xqa": XQA_val,
                            "yqa": [filter_string(i.lower()) for i in YQA_val],
                            "id_placeholder": list(range(len(YQA_val))),
                            "source":source_val})
# tokenize input
val_ds = val_ds.map(lambda x: input_tokenizer(x["xqa"],
                                              return_tensors="tf",
                                              padding="max_length",
                                              truncation="longest_first",
                                              max_length=max_input_length),
                     batched=True)
# add references
val_ds = val_ds.map(lambda x:{"references": 
                              {'answers':{'text':[x["yqa"]], 'answer_start': [42]},
                                'id': str(x["id_placeholder"]) } })

# remove working columns
val_ds = val_ds.remove_columns(["xqa","yqa", "id_placeholder"])

# save to drive
if model_name == 'prajjwal1/bert-tiny':  
  val_ds = val_ds.with_format(type="tensorflow", columns=["input_ids", "attention_mask","token_type_ids"], output_all_columns=True)
  val_ds.save_to_disk("gdrive/MyDrive/ckpt/val_ds" + dataset_suffix)
else:
  val_ds = val_ds.with_format(type="tensorflow", columns=["input_ids", "attention_mask"], output_all_columns=True)
  val_ds.save_to_disk("gdrive/MyDrive/ckpt/val_ds_rob" + dataset_suffix)

  0%|          | 0/22 [00:00<?, ?ba/s]

  0%|          | 0/21479 [00:00<?, ?ex/s]

Saving the dataset (0/1 shards):   0%|          | 0/21479 [00:00<?, ? examples/s]

In [None]:
# Generate test split dataset
# EXECUTE ONLY IF NEEDED

# create dictionary
test_ds = Dataset.from_dict({"xqa": XQA_test,
                             "yqa": [filter_string(i.lower()) for i in YQA_test],
                             "id_placeholder": list(range(len(YQA_test))),
                             "source":source_test})
# tokenize input
test_ds = test_ds.map(lambda x: input_tokenizer(x["xqa"],
                                                return_tensors="tf",
                                                padding="max_length",
                                                truncation="longest_first",
                                                max_length=max_input_length),
                       batched=True)
# add references
test_ds = test_ds.map(lambda x:{"references":
                                {'answers':{'text':[x["yqa"]], 'answer_start': [42]},
                                  'id': str(x["id_placeholder"]) } })

# remove working columns
test_ds = test_ds.remove_columns(["xqa","yqa", "id_placeholder"])

# save to drive
if model_name == 'prajjwal1/bert-tiny':  
  test_ds = test_ds.with_format(type="tensorflow", columns=["input_ids", "attention_mask","token_type_ids"], output_all_columns=True)
  test_ds.save_to_disk("gdrive/MyDrive/ckpt/test_ds" + dataset_suffix)
else:
  test_ds = test_ds.with_format(type="tensorflow", columns=["input_ids", "attention_mask"], output_all_columns=True)
  test_ds.save_to_disk("gdrive/MyDrive/ckpt/test_ds_rob" + dataset_suffix)

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/7918 [00:00<?, ?ex/s]

Saving the dataset (0/1 shards):   0%|          | 0/7918 [00:00<?, ? examples/s]

## [Task 3] Model definition

Write your own script to define the following transformer-based models from [huggingface](https://HuggingFace.co/).

* [M1] DistilRoBERTa (distilberta-base)
* [M2] BERTTiny (bert-tiny)

**Note**: Remember to install the ```transformers``` python package!

**Note**: We consider small transformer models for computational reasons!

The following cell defines the model we are proposing for the task. It is a Bert2LSTM (or seq2seq) model with Attention, adapted from the tutorial [<code>tf_seq2seq_lstm.py</code>](https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt). Differently from the tutorial, the input part is handled by a frozen Bert encoder model, which outputs the encoding of the input from its last layer as well as a further elaborated encoding of the first token. The latter is discarded and substuted with two newly learned tensors used as the first hidden and cell state for the following LSTM decoder. The first one is obtained with a dense layer starting from the embedding of the first token, the second one is obtained after applying a 1D average pooling on the whole output and then passed to a dense layer.

Also, the numbers of decoding cells have been expanded to 128 for the tiny bert model and 256 for the distilroberta model (considering that tiny bert outputs tensors with 128 dimensions for each token and distilroberta output tensors with 768 dimensions).

Finally the model can both generate answers with a greedy sampling strategy or with a beam search. The latter has been tweaked after some tests on the validation set by setting the number of beams to 3 and adding a penalty weight for long sequences equal to 1.5.

In [None]:
# ---------------- #
# MODEL DEFINITION #
# ---------------- #


# check if training can be performed on GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)


class MyTrainer(object):
    """
    Simple wrapper class to train the model

    train_op -> uses tf.GradientTape to compute the loss
    batch_fit -> receives a batch and performs forward-backward passes (gradient included) 
    """

    def __init__(self, encoder, decoder, max_length):
      """
      init function for the model. It requires the encoder and decoder distance, as well as
      the maximum allowed length for the output to be generated. It also sets the loss function
      and the optimizer
      :params:
        encoder: Encoder class instance
        decoder: Decoder class instance
        max_length: maximum allowed lenght for the output
      """
      self.encoder = encoder
      self.decoder = decoder
      self.max_length = max_length
      self.ce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, 
                                                                reduction='none') # from logits means that it returns values after a 
                                                                                  # softmax application, thus it is useless to
                                                                                  # add a softmax activation layer if this parameter is set to 
                                                                                  # true (or even dangerous because it squashes the values)
      self.optimizer = tf.keras.optimizers.Adam(learning_rate=1e-03)

    @tf.function
    def compute_loss(self, logits, target):
      """
      Function to compute the loss.
      :params:
        logits: values computed by the network
        target: labels to compare with logits
      """
      loss = self.ce(y_true=target, y_pred=logits)
      mask = tf.logical_not(tf.math.equal(target, 0))
      mask = tf.cast(mask, dtype=loss.dtype)
      loss *= mask # pointwise product
      return tf.reduce_mean(loss)

    @tf.function
    def train_op(self, inputs):
      """
      Function executing a single step of training.
      :param:
        inputs: dictionary containing the tokenized inputs ('input_ids'), the 
          attention mask ('attention_mask') and the token type indexes ('token_type_ids')
          if the model is tiny bert-based.
      """
      with tf.GradientTape() as tape:
            
          if self.encoder.use_token_type_ids:
            encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': inputs['input_ids'],
                                                                  'attention_mask': inputs['attention_mask'],
                                                                  'token_type_ids': inputs['token_type_ids']})
          else:
            encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': inputs['input_ids'],
                                                                  'attention_mask': inputs['attention_mask']})
          decoder_input = inputs['y_padded'][:, :-1]  # ignore <end>
          real_target = inputs['y_padded'][:, 1:]  # ignore <start>

          # setup in order to perform attention queries over the embedding space
          self.decoder.attention.setup_memory(encoder_output) 

          # decoder initialization, check build_initial_state for additional insights
          decoder_initial_state = self.decoder.build_initial_state(self.decoder.batch_size, [encoder_h, encoder_s])

          # the input is then passed to the initialized decoder and we obtain predictions
          # in rnn_output format because the model is BERT-emdedding-sequence-sequence, so the
          # last layer is still a sequence of cells (a RNN)
          predicted = self.decoder({'input_ids': decoder_input,
                                      'initial_state': decoder_initial_state}).rnn_output

          # we compute the losses over the computed predictions
          loss = self.compute_loss(logits=predicted, target=real_target)
      # gradients of the loss computed for this minibatch considering trainable
      # parameters of encoder and decoder
      grads = tape.gradient(loss, self.encoder.trainable_variables + self.decoder.trainable_variables)
      return loss, grads

    @tf.function
    def batch_fit(self, inputs):
      """
      function executing a single step of training, updating the gradients.
      :param:
        inputs: dictionary containing the tokenized inputs ('input_ids'), the 
          attention mask ('attention_mask') and the token type indexes ('token_type_ids')
          if the model is tiny bert-based.
      """
      loss, grads = self.train_op(inputs=inputs)
      # applies gradients to the trainable variables using Adam
      self.optimizer.apply_gradients(zip(grads, self.encoder.trainable_variables + self.decoder.trainable_variables))
      return loss

    def generate(self, output_tokenizer, input_ids,token_type_ids, attention_mask=None):
      """
      function to generate (a batch of) answers with a greedy sampling technquique
      :params:
        output_tokenizer: the tokenizer instance used to split and analyze output sequences
        input_ids: indexes of the input tokens
        token_type_ids: indexes indicating which part of the input a token belongs to (only tiny bert)
        attention_mask: indexes indicating which part of the input is padding
      """
      batch_size = input_ids.shape[0] # input_ids is the minibatch

      if self.encoder.use_token_type_ids:
        encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': input_ids,
                                                                  'attention_mask': attention_mask,
                                                                  'token_type_ids': token_type_ids})
      else:
        encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': input_ids,
                                                                  'attention_mask': attention_mask})
      # padding the placeholders  
      start_tokens = tf.fill([batch_size], output_tokenizer.word_index['<start>'])
      end_token = output_tokenizer.word_index['<end>']


      # We could not do this at training time, since the Sampler used at training
      # is not designed to project the token in an embedding space before computing
      # the next one. The aforementioned embedding space
      # is changing at each backpropagation step anyways, thus we stick with
      # the computation of the argmax of the logits using TrainingSampler.
      greedy_sampler = tfa.seq2seq.GreedyEmbeddingSampler() 

      # we have a decoder for training and a decoder for test time, thus
      # we need to re-define the training decoder each time we want to
      # train a new batch
      decoder_instance = tfa.seq2seq.BasicDecoder(cell=self.decoder.wrapped_decoder_cell,
                                                    sampler=greedy_sampler,
                                                    output_layer=self.decoder.generation_dense,
                                                    maximum_iterations=self.max_length)
      
      self.decoder.attention.setup_memory(encoder_output)

      # decoder_initial_state is still an output of the encoder, we pass it to
      # the decoder_instance in order to get the outputs
      decoder_initial_state = self.decoder.build_initial_state(batch_size, [encoder_h, encoder_s])
        
      decoder_embedding_matrix = self.decoder.embedding.variables[0]
      outputs, _, _ = decoder_instance(decoder_embedding_matrix,
                                         start_tokens=start_tokens,
                                         end_token=end_token,
                                         initial_state=decoder_initial_state)
      return outputs

    def translate(self, generated, output_tokenizer):
      """
      function to translate the sequence of indexes produced by the greedy decoder
      in sentences.
      :params:
        generated: generated sequences of token indexes
        output_tokenizer: the tokenizer instance used to split and analyze output sequences
      """
      return output_tokenizer.sequences_to_texts(generated.sample_id.numpy())

    def beam_translate(self, generated, output_tokenizer):
      """
      function to translate the sequence of indexes produced by the beam search decoder
      in sentences. it recovers only the best scoring sequence
      :params:
        generated: generated sequences of token indexes
        output_tokenizer: the tokenizer instance used to split and analyze output sequences
      """
      return output_tokenizer.sequences_to_texts(generated[0][:,0,:])

    def beam_generate(self, output_tokenizer, input_ids,token_type_ids, attention_mask=None, beam_width=3, length_penalty=1.5):
      """
      function to generate (a batch of) answers with a beam search techinques, allowing to control the number
      of beams and the penalty to be added to long sequences
      :params:
      output_tokenizer: the tokenizer instance used to split and analyze output sequences
      input_ids: indexes of the input tokens
      token_type_ids: indexes indicating which part of the input a token belongs to (only tiny bert)
      attention_mask: indexes indicating which part of the input is padding
      beam_width: number of sequences to consider at each step (default=3)
      lenght_penalty: length penalty parameter to be added to reweight the different sequences generated (default=1.5)
      """
      batch_size = input_ids.shape[0] # input_ids is the minibatch

      if self.encoder.use_token_type_ids:
        encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': input_ids,
                                                                  'attention_mask': attention_mask,
                                                                  'token_type_ids': token_type_ids})
      else:
        encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': input_ids,
                                                                  'attention_mask': attention_mask})
      # padding the placeholders    
      start_tokens = tf.fill([batch_size], output_tokenizer.word_index['<start>'])
      end_token = output_tokenizer.word_index['<end>']
        
      # From official documentation:
      # NOTE If you are using the BeamSearchDecoder with a cell wrapped in AttentionWrapper, then you must ensure that:
      # The encoder output has been tiled to beam_width via tfa.seq2seq.tile_batch (NOT tf.tile).
      # The batch_size argument passed to the get_initial_state method of this wrapper is equal to true_batch_size * beam_width.
      # The initial state created with get_initial_state above contains a cell_state value containing properly tiled final state from the encoder.

      encoder_output = tfa.seq2seq.tile_batch(encoder_output, multiplier=beam_width)
      self.decoder.attention.setup_memory(encoder_output)

      # set decoder_inital_state which is an AttentionWrapperState considering beam_width
      hidden_state = tfa.seq2seq.tile_batch([encoder_h, encoder_s], multiplier=beam_width)
      decoder_initial_state = self.decoder.build_initial_state(beam_width*batch_size, hidden_state)

      # Instantiate BeamSearchDecoder
      decoder_instance = tfa.seq2seq.BeamSearchDecoder(self.decoder.wrapped_decoder_cell,
                                                          beam_width=beam_width,
                                                          output_layer=self.decoder.generation_dense,
                                                          length_penalty_weight=length_penalty,
                                                          maximum_iterations=self.max_length)
      decoder_embedding_matrix = self.decoder.embedding.variables[0]

      # The BeamSearchDecoder object's call() function takes care of everything.
      outputs, final_state, sequence_lengths = decoder_instance(decoder_embedding_matrix, 
                                                                  start_tokens=start_tokens,
                                                                  end_token=end_token,
                                                                  initial_state=decoder_initial_state)
      # outputs is tfa.seq2seq.FinalBeamSearchDecoderOutput object. 
      # The final beam predictions are stored in outputs.predicted_id
      # outputs.beam_search_decoder_output is a tfa.seq2seq.BeamSearchDecoderOutput object which keep tracks of beam_scores and parent_ids while performing a beam decoding step
      # final_state = tfa.seq2seq.BeamSearchDecoderState object.
       # Sequence Length = [inference_batch_size, beam_width] details the maximum length of the beams that are generated


      # outputs.predicted_id.shape = (inference_batch_size, time_step_outputs, beam_width)
      # outputs.beam_search_decoder_output.scores.shape = (inference_batch_size, time_step_outputs, beam_width)
      # Convert the shape of outputs and beam_scores to (inference_batch_size, beam_width, time_step_outputs)
      final_outputs = tf.transpose(outputs.predicted_ids, perm=(0,2,1))
      beam_scores = tf.transpose(outputs.beam_search_decoder_output.scores, perm=(0,2,1))

      return final_outputs.numpy(), beam_scores.numpy()

      


class Encoder(tf.keras.Model):
    """
    Wrapper class for the Bert Encoder
    """
    def __init__(self, model_name, decoder_units):
      """
      Constructor method for the encoder. It requires the string indicating the model
      to be loaded and the number of decoder units used in the decoder, in order
      to correctly resize its outputs.
      :params:
       model_name: string containg the name of the bert encoder to load
        decoder_units: number of RNN decoder units used in the decoder
      """    
      super(Encoder, self).__init__()
      self.model = TFAutoModel.from_pretrained(model_name, from_pt=True, trainable=False)
      self.model.trainable=False # the encoder model is frozen
      self.reducer = tf.keras.layers.Dense(decoder_units) # reducer used to obtain the hidden state
      self.reducer2 = tf.keras.layers.Dense(decoder_units) # reducer used to obtain the cell state
      self.avg_pool = tf.keras.layers.AveragePooling1D(pool_size = 512) # used to create the cell state
      self.use_token_type_ids = model_name=='prajjwal1/bert-tiny' # flag used to see if token type ids must be considered.

    def call(self, inputs, training=False, **kwargs):
      """
      call function of the encoder model. 
      It takes the input, passes them to the bert encoder, then it takes the input of the last layer.
      From here, it takes the encoding of the first token and passes it to "reducer" layer 
      to obtain the hidden_state encoding and then
      it takes everything, passes it through an average pooling and the "reducer2" layer to obtain 
      the cell state encoding.
      :params:
        inputs: inputs to the model
        training: boolean flag required by the call function
      """
      model_output = self.model(inputs)
        
      # all_outputs has shape (batch_size * 512 * 128/768)
      all_outputs = model_output[0] # output of the last layer of the model
        
      # cls coding
      hidden_pooled = all_outputs[:, 0, :]
      cell_state = self.avg_pool(all_outputs)
      cell_state = tf.reshape(cell_state, [all_outputs.shape[0], all_outputs.shape[2]])
        
      # pooled output has shape (batch_size * 128/256)
      hidden_state = self.reducer(hidden_pooled)
      cell_state = self.reducer2(cell_state)

      return all_outputs, hidden_state, cell_state


class Decoder(tf.keras.Model):
    """
    Wrapper class for the LSTM decoder
    """
    def __init__(self, vocab_size, max_sequence_length, embedding_dim, decoder_units, batch_size):
      """
      Constructor method for the decoder. It requires the vocabulary size to create the last
      layer, the maximums lenght of a sequence to be generated, the dimension of the decoder
      embeddings, the number of decoder units and the batch_size. It also set ups the attention mechanism, 
      a basic decoder and the training sampler.
      :param:
        vocab_size: number of different tokens in the output encoder
        max_sequence_length: maximum length to be considered when generating an answer
        embedding_dim: dimension of the embedding vectors in the encoder
        decoder_units: number of RNN cells to be used for decoding
        batch_size: dimension of the batch to be considered during training
      """
      super(Decoder, self).__init__()
      self.max_sequence_length = max_sequence_length
      self.batch_size = batch_size
      self.decoder_units = decoder_units
      self.embedding = tf.keras.layers.Embedding(input_dim=vocab_size,
                                                 output_dim=embedding_dim)
      self.decoder_lstm_cell = tf.keras.layers.LSTMCell(self.decoder_units)
      self.attention = tfa.seq2seq.BahdanauAttention(units=self.decoder_units,
                                                       memory=None,
                                                       memory_sequence_length=self.batch_size * [max_sequence_length])

      self.wrapped_decoder_cell = tfa.seq2seq.AttentionWrapper(self.decoder_lstm_cell,
                                                               self.attention,
                                                               attention_layer_size=self.decoder_units) 
      # dense layer needed to generate the distribution values over 
      # the size of the vocabulary (probability for each word)
      self.generation_dense = tf.keras.layers.Dense(vocab_size)

      self.sampler = tfa.seq2seq.sampler.TrainingSampler()
      self.decoder = tfa.seq2seq.BasicDecoder(self.wrapped_decoder_cell,
                                              sampler=self.sampler,
                                              output_layer=self.generation_dense)

    def build_initial_state(self, batch_size, encoder_state):
      """
      function used to build the initial state of the encoder.
      after initializing the tensors within the attention layer to 0 we add
      the designated initialization that allow us to query the embedding space,
      which is passed as encoder_state.
      :params:
        batch_size: size of the batch of data that is being used
        encoder_state: output of the last layer of the encoder
      """
      initial_state = self.wrapped_decoder_cell.get_initial_state(batch_size=batch_size, dtype=tf.float32)
      initial_state = initial_state.clone(cell_state=encoder_state) 
      return initial_state

    def call(self, inputs, training=False, **kwargs):
      """
      call fucntion of the decoder. 
      :params:
      inputs: it is a dictionary with entries: 
        - "input_ids" : _encoder_output_
        - "initial_state" : _result_of_build_initial_state_
      training: boolean flag required by the call function
      """
      input_ids = inputs['input_ids']
      input_emb = self.embedding(input_ids)
      decoder_output, _, _ = self.decoder(input_emb,
                                            initial_state=inputs['initial_state'],
                                            sequence_length=self.batch_size * [self.max_sequence_length - 1])
      return decoder_output




Without considering tuning trials (changing the number of neurons, adjusting the beam search method, changing the decoder embedding dimension...), we experimented different variatons which however worsened the result or did not bring any remarkable improvement:
- Trying the <code>GRUCell</code> instead of the <code>LSTMCell</code>.
- Trying to augment the number of layers in the decoder with the <code>StackedRNN</code>.
- Trying the Luong attention instead of the Bahdanau one.
- Adding a time distributed layer over the encoder output.
- Tried to add a word-based the POS tagging to the input as an additional information.
- Training the encoder (both for all three epochs or just for the last two).

In the following cells, utility functions for testing and training the model are created.

In [None]:
def train_loop(trainer, dataset, epochs, batch_size, ckpt_manager):
  """
  Function executing a full training loop with the model. It automatically manages
  batches, but it discards the elements in the last batch if its dimension is different
  from batch_size. At each epochs it saves a checkpoint of the model and prints
  the current mean loss
  :params:
    trainer: MyTrainer class instance to use for training
    dataset: split of the dataset to be used for training
    epochs: number of epochs to execute during training
    batch_size: dimension of the batches to be used
    ckpt_manager: instance of CheckpointManager, to incrementally save a checkpoint at each epoch
  """
  steps_per_epoch = len(dataset)//batch_size
  
  for epoch in tqdm(range(epochs)):
    cumulative_loss = 0

    for batch_index in tqdm(range(steps_per_epoch), position=0, leave=True):
      loss = trainer.batch_fit(dataset[batch_index*batch_size:batch_index*batch_size+batch_size])
      cumulative_loss += loss

    ckpt_manager.save()
    mean_loss = cumulative_loss / batch_index
    print(f"Current mean {mean_loss}")


def predict_loop(trainer, dataset, inference_batch_size,model_name,output_tokenizer, beam_search=False):
  """
  Function executing a prediction loop over a given dataset. It automatically
  manages batches, without discarding the last batch. It also allows to choose between
  beam search and greedy decoding. The output is formatted so that it can directly be used
  to compute the SQUAD-F1 score with the evaluate package function.
  :params:
    trainer: MyTrainer class instance to use for prediction
    dataset: split of the dataset to be used for prediction
    inference_batch_size: dimension of the batch considered during inference
    model_name: name of the model used for the encoder. It allows to use
      token type ids when available
    output_tokenizer: tokenizer used to tokenize the answers for the decoder
    beam_search: boolean indicating whether to use (True) or not the beam_search
      (default=False)
  """
  ttids=None # ttids stands for "token type ids"
  
  if beam_search: # here we discriminate between the translate/generate function to use
    generation_func = trainer.beam_generate
    translation_func = trainer.beam_translate
  else:
    generation_func = trainer.generate
    translation_func = trainer.translate
  
  inference_step = len(dataset) // inference_batch_size
  predictions = []
  for step_index in tqdm(range(inference_step)):
    starting_index = step_index*inference_batch_size # for the batch
    ending_index = step_index*inference_batch_size + inference_batch_size # for the batch

    if model_name == 'prajjwal1/bert-tiny':  # if the ttids are available, they are set
      ttids = dataset["token_type_ids"][starting_index : ending_index]

    generated = generation_func(output_tokenizer=output_tokenizer, 
                                  input_ids=dataset["input_ids"][starting_index : ending_index],
                                  token_type_ids=ttids,
                                  attention_mask=dataset["attention_mask"][starting_index : ending_index])
    
    translated = translation_func(generated, output_tokenizer=output_tokenizer)

    #this transformation on indexes is needed in order to have coherent ids in the field "id"
    list_to_add = [{'prediction_text': translated[i - starting_index].split("<end>")[0],
                    'id':str(i)} for i in range(starting_index, ending_index)]

    predictions.extend(list_to_add)

  # this part of the function replicates the previous part for the last batch  
  if model_name == 'prajjwal1/bert-tiny':  
    ttids = dataset["token_type_ids"][(inference_step)*inference_batch_size :]
  
  generated = generation_func(output_tokenizer = output_tokenizer, 
                             input_ids=dataset["input_ids"][(inference_step)*inference_batch_size :],
                             token_type_ids=ttids,
                             attention_mask=dataset["attention_mask"][(inference_step)*inference_batch_size :])
  translated = translation_func(generated, output_tokenizer=output_tokenizer)

  predictions.extend([{'prediction_text': translated[i - (inference_step)*inference_batch_size].split("<end>")[0], 
                    'id':str(i)} for i in range((inference_step)*inference_batch_size, 
                                                len(dataset))])
  
  return predictions
  
def save_prediction(prediction, filename):
  """
  Function used to save the predictions as a pickle serialized file.
  :params:
    prediction: dictionary containing predictions to be serialized
    filename: name of the file to be used when saving
  """
  with open(filename, "wb") as f:
    pickle.dump(prediction, f)

In [None]:
# loading the squad metric from the evaluate package
squad_metric = load("squad")

def train_and_val(model_name,train_ds, val_ds, epochs, batch_size, decoder_units, max_sequence_length, output_tokenizer, pred_file_name, checkpoint_dir):
  """
  Function replicating the training and validation loop for each of the three seed required.
  At the end of the training for each seed it saves the predictions made both with greedy sampler and 
  beam search decoder. Moreover at the end of the whole training process it prints the results
  for each seed and their mean, considering both type of answer generation.
  :params:
    model_name: name of the model to be used as encoder
    train_ds: training split of the dataset
    val_ds: validation split of the dataset
    epochs: number of epochs to be done for each seed
    batch_size: dimension of the batch during training
    decoder_units: number of decoder units
    max_sequence_length: maximum length for an output sequence
    output_tokenizer: the tokenizer instance used to split and analyze output sequences
    pred_file_name: prefix for the filename used when saving predictions
    checkpoint_dir: directory where checkpoints and predictions will be saved

  """
  INF_BS = 64 #Inference batch_size
  results = []
  results_beam = []
  for train_seed in [42,1337,2022]:
    set_seed(train_seed) # setting the seed

    # creation of the model and util classes
    encoder = Encoder(model_name=model_name,
                          decoder_units=decoder_units)
      
    decoder = Decoder(vocab_size=len(output_tokenizer.word_index) + 1,
                          embedding_dim=100,
                          decoder_units=decoder_units,
                          batch_size=batch_size,
                          max_sequence_length=max_sequence_length)
    trainer = MyTrainer(encoder=encoder,
                          decoder=decoder,
                          max_length=max_sequence_length)
    
    checkpoint = tf.train.Checkpoint(optimizer=trainer.optimizer,
                                  encoder=encoder,
                                  decoder=decoder)
    manager = tf.train.CheckpointManager(checkpoint, checkpoint_dir + f"/{train_seed}", max_to_keep=1)

    # training
    train_loop(trainer, train_ds, epochs, batch_size, manager)

    # prediction
    prediction = predict_loop(trainer, val_ds, INF_BS, model_name,output_tokenizer)
    save_prediction(prediction, checkpoint_dir + pred_file_name + "_" + str(train_seed) + "_pred.pickle")

    prediction_beam = predict_loop(trainer, val_ds, INF_BS, model_name,output_tokenizer, beam_search=True)
    save_prediction(prediction_beam, checkpoint_dir + pred_file_name + "_" + str(train_seed) + "_beampred.pickle")

    results.append(squad_metric.compute(predictions=prediction, references=val_ds['references']))
    results_beam.append(squad_metric.compute(predictions=prediction_beam, references=val_ds['references']))

    # garbage collector directives to avoid memory cluttering
    del(manager)
    del(checkpoint)
    del(trainer)
    del(encoder)
    del(decoder) 

  # printing results
  print("***VALIDATION RESULTS***")
  print(results)
  print(results_beam)
  print(f"greedy exact match:{sum([res['exact_match'] for res in results])/len(results)}" )
  print(f"greedy SQUAD-F1:{sum([res['f1'] for res in results])/len(results)}" )
  print(f"beam exact match:{sum([res['exact_match'] for res in results_beam])/len(results_beam)}" )
  print(f"beam SQUAD-F1:{sum([res['f1'] for res in results_beam])/len(results_beam)}" )

def test_model(model_name, test_ds, batch_size, decoder_units, max_sequence_length, output_tokenizer, pred_file_name, checkpoint_dir, pred_file_suffix = ["_testpred.pickle", "_testbeampred.pickle"]):
  """
  Function allowing the test of a model for each of the three seeds requested.
  The models must be already trained. At the end it prints the results for each
  seed and their mean, both for greedy sampling and beam search decoding.
  :params:
    model_name: name of the model to be used as encoder
    test_ds: test split of the dataset
    batch_size: dimension of the batch during training
    decoder_units: number of decoder units
    max_sequence_length: maximum length for an output sequence
    output_tokenizer: the tokenizer instance used to split and analyze output sequences
    pred_file_name: prefix for the filename used when saving predictions
    checkpoint_dir: directory where checkpoints and predictions will be saved
    pred_file_suffix: list of two strings to be used when saving the predictions
      for a model. The first string is used for greedy sampling generation and the second
      for beam search.
  """
  INF_BS = 64 #Inference batch_size
  
  
  results = []
  results_beam = []
  for train_seed in [42,1337,2022]:
    # creation of the model and util classes
    encoder = Encoder(model_name=model_name,
                        decoder_units=decoder_units)
      
    decoder = Decoder(vocab_size=len(output_tokenizer.word_index) + 1,
                          embedding_dim=100,
                          decoder_units=decoder_units,
                          batch_size=batch_size,
                          max_sequence_length=max_sequence_length)
  
    trainer = MyTrainer(encoder=encoder,
                          decoder=decoder,
                          max_length=max_sequence_length)
    
    checkpoint = tf.train.Checkpoint(optimizer=trainer.optimizer,
                                  encoder=encoder,
                                  decoder=decoder)
    
  
    # required step in order to load correctly the decoder embedding matrix
    decoder.embedding.build(input_shape=None)
    checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir+f"/{train_seed}")).expect_partial()

    prediction = predict_loop(trainer, test_ds, INF_BS, model_name, output_tokenizer)
    save_prediction(prediction, checkpoint_dir + pred_file_name + "_" + str(train_seed) +  pred_file_suffix[0])

    prediction_beam = predict_loop(trainer, test_ds, INF_BS, model_name ,output_tokenizer, beam_search=True)
    save_prediction(prediction_beam, checkpoint_dir + pred_file_name + "_" + str(train_seed) + pred_file_suffix[1])

    results.append(squad_metric.compute(predictions=prediction, references=test_ds['references']))
    results_beam.append(squad_metric.compute(predictions=prediction_beam, references=test_ds['references']))

  print("***TEST RESULTS***")
  print(results)
  print(results_beam)
  print(f"greedy exact match:{sum([res['exact_match'] for res in results])/len(results)}" )
  print(f"greedy SQUAD-F1:{sum([res['f1'] for res in results])/len(results)}" )
  print(f"beam exact match:{sum([res['exact_match'] for res in results_beam])/len(results_beam)}" )
  print(f"beam SQUAD-F1:{sum([res['f1'] for res in results_beam])/len(results_beam)}" )

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

## [Task 4] Question generation with text passage $P$ and question $Q$

We want to define $f_\theta(P, Q)$. 

Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$ and $Q_i$ and generate $A_i$.

In the following cells, training and validation has been set up thanks to the previously defined functions. Before training, some hyperparameters are set:
- Batch size to 14. In this way we can use a small batch size allowing to learn under Colab restrictions, and to retain as much examples as possible (only 1 is discarded)
- Epochs to 3. This is one of the pre-requirements of the assignment.
- Maximum output sequence lenght to 20, as previously explained.
- Number of decoding units for tiny bert to 128. Also other configuration were tested: 32, 64, 256
- number of decoding units for distilroberta to 256. Other tested configurations included: 64, 128, 512.
 
_Note_: **We re-computed the validation results because, in a second moment, we tweaked the beam search parameters.**

In [None]:
BATCH_SIZE = 14
EPOCHS = 3
MAX_SEQUENCE_LENGTH = 20
TINY_DEC_UNITS = 128
ROB_DEC_UNITS = 256

In [None]:
# BERT TINY NO HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/tiny'


train_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/train_ds")
val_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/val_ds")

train_and_val('prajjwal1/bert-tiny',
              train_ds,
              val_ds,
              epochs=EPOCHS,
              batch_size=BATCH_SIZE,
              decoder_units=TINY_DEC_UNITS,
              max_sequence_length=MAX_SEQUENCE_LENGTH,
              output_tokenizer=output_tokenizer,
              pred_file_name='tiny',
              checkpoint_dir=checkpoint_dir)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.embeddings.position_ids', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

Current mean 1.064605712890625


100%|██████████| 6129/6129 [11:14<00:00,  9.08it/s]
 67%|██████▋   | 2/3 [22:34<11:16, 676.95s/it]

Current mean 0.8742226958274841


100%|██████████| 6129/6129 [11:13<00:00,  9.10it/s]
100%|██████████| 3/3 [33:48<00:00, 676.17s/it]


Current mean 0.7598474025726318


100%|██████████| 335/335 [02:32<00:00,  2.20it/s]
100%|██████████| 335/335 [04:44<00:00,  1.18it/s]


***VALIDATION RESULTS***
[{'exact_match': 11.904651054518366, 'f1': 14.289189542326932}]
[{'exact_match': 12.123469435262349, 'f1': 14.27187375444946}]
greedy exact match:11.904651054518366
greedy SQUAD-F1:14.289189542326932
beam exact match:12.123469435262349
beam SQUAD-F1:14.27187375444946


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.embeddings.position_ids', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

Current mean 1.0730599164962769


100%|██████████| 6129/6129 [11:11<00:00,  9.12it/s]
 67%|██████▋   | 2/3 [22:35<11:16, 676.64s/it]

Current mean 0.8743070363998413


100%|██████████| 6129/6129 [11:12<00:00,  9.11it/s]
100%|██████████| 3/3 [33:48<00:00, 676.17s/it]


Current mean 0.7575758695602417


100%|██████████| 335/335 [02:29<00:00,  2.24it/s]
100%|██████████| 335/335 [04:39<00:00,  1.20it/s]


***VALIDATION RESULTS***
[{'exact_match': 11.904651054518366, 'f1': 14.289189542326932}, {'exact_match': 11.960519577261511, 'f1': 14.549653043162628}]
[{'exact_match': 12.123469435262349, 'f1': 14.27187375444946}, {'exact_match': 12.197960798919874, 'f1': 14.561944356953724}]
greedy exact match:11.932585315889938
greedy SQUAD-F1:14.41942129274478
beam exact match:12.160715117091112
beam SQUAD-F1:14.416909055701591


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.embeddings.position_ids', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

Current mean 1.065598726272583


100%|██████████| 6129/6129 [11:08<00:00,  9.17it/s]
 67%|██████▋   | 2/3 [22:26<11:12, 672.46s/it]

Current mean 0.8747037649154663


100%|██████████| 6129/6129 [11:08<00:00,  9.16it/s]
100%|██████████| 3/3 [33:35<00:00, 671.89s/it]


Current mean 0.7593579292297363


100%|██████████| 335/335 [02:27<00:00,  2.28it/s]
100%|██████████| 335/335 [04:38<00:00,  1.20it/s]


***VALIDATION RESULTS***
[{'exact_match': 11.904651054518366, 'f1': 14.289189542326932}, {'exact_match': 11.960519577261511, 'f1': 14.549653043162628}, {'exact_match': 11.904651054518366, 'f1': 14.643605437027999}]
[{'exact_match': 12.123469435262349, 'f1': 14.27187375444946}, {'exact_match': 12.197960798919874, 'f1': 14.561944356953724}, {'exact_match': 12.095535173890777, 'f1': 14.485326311156506}]
greedy exact match:11.923273895432748
greedy SQUAD-F1:14.494149340839186
beam exact match:12.138988469357665
beam SQUAD-F1:14.439714807519897


In [None]:
# ----------------------------------- #
# VALIDATION RESULTS TO BE CONSIDERED #
# ----------------------------------- #
# We tuned the beam search parameters, thus we need to redo the prediction
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/tiny'

val_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/val_ds")

test_model('prajjwal1/bert-tiny',
            val_ds,
            batch_size=BATCH_SIZE,
            decoder_units=TINY_DEC_UNITS,
            max_sequence_length=MAX_SEQUENCE_LENGTH,
            output_tokenizer=output_tokenizer,
            pred_file_name='tiny',
            checkpoint_dir=checkpoint_dir,
            pred_file_suffix=["_pred.pickle", "_beampred.pickle"])

100%|██████████| 335/335 [02:14<00:00,  2.49it/s]
100%|██████████| 335/335 [04:27<00:00,  1.25it/s]
100%|██████████| 335/335 [02:05<00:00,  2.66it/s]
100%|██████████| 335/335 [04:29<00:00,  1.24it/s]
100%|██████████| 335/335 [02:15<00:00,  2.46it/s]
100%|██████████| 335/335 [04:31<00:00,  1.24it/s]


***TEST RESULTS***
[{'exact_match': 11.904651054518366, 'f1': 14.289189542326932}, {'exact_match': 11.960519577261511, 'f1': 14.549653043162628}, {'exact_match': 11.904651054518366, 'f1': 14.643605437027999}]
[{'exact_match': 12.086223753433586, 'f1': 14.345524377010411}, {'exact_match': 12.188649378462685, 'f1': 14.766801701750634}, {'exact_match': 12.048978071604823, 'f1': 14.652631003217861}]
greedy exact match:11.923273895432748
greedy SQUAD-F1:14.494149340839186
beam exact match:12.107950401167031
beam SQUAD-F1:14.588319027326301


In [None]:
# DISTILROBERTA NO HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/rob'


train_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/train_ds_rob")
val_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/val_ds_rob")

train_and_val('distilroberta-base',
              train_ds,
              val_ds,
              epochs=EPOCHS,
              batch_size=BATCH_SIZE,
              decoder_units=ROB_DEC_UNITS,
              max_sequence_length=MAX_SEQUENCE_LENGTH,
              output_tokenizer=output_tokenizer,
              pred_file_name='rob',
              checkpoint_dir=checkpoint_dir)



Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

100%|██████████| 6129/6129 [39:19<00:00,  2.60it/s]
 33%|███▎      | 1/3 [39:21<1:18:42, 2361.43s/it]

Current mean 1.0559253692626953


100%|██████████| 6129/6129 [39:09<00:00,  2.61it/s]
 67%|██████▋   | 2/3 [1:18:32<39:15, 2355.59s/it]

Current mean 0.8417209386825562


100%|██████████| 6129/6129 [39:08<00:00,  2.61it/s]
100%|██████████| 3/3 [1:57:43<00:00, 2354.53s/it]


Current mean 0.6954158544540405


100%|██████████| 335/335 [15:30<00:00,  2.78s/it]
TensorFlow Addons has compiled its custom ops against TensorFlow 2.11.0, and there are no compatibility guarantees between the two versions. 
This means that you might get segfaults when loading the custom op, or other kind of low-level errors.
 If you do, do not file an issue on Github. This is a known limitation.

It might help you to fallback to pure Python ops by setting environment variable `TF_ADDONS_PY_OPS=1` or using `tfa.options.disable_custom_kernel()` in your code. To do that, see https://github.com/tensorflow/addons#gpucpu-custom-ops 

You can also change the TensorFlow version installed on your system. You would need a TensorFlow version equal to or above 2.11.0 and strictly below 2.12.0.
 Note that nightly versions of TensorFlow, as well as non-pip TensorFlow like `conda install tensorflow` or compiled from source are not supported.

The last solution is to find the TensorFlow Addons version that has custom ops compatible 

***VALIDATION RESULTS***
[{'exact_match': 14.241817589273243, 'f1': 17.78899494962537}]
[{'exact_match': 14.497881651845988, 'f1': 17.887112794009507}]
greedy exact match:14.241817589273243
greedy SQUAD-F1:17.78899494962537
beam exact match:14.497881651845988
beam SQUAD-F1:17.887112794009507


In [None]:
# ----------------------------------- #
# VALIDATION RESULTS TO BE CONSIDERED #
# ----------------------------------- #
# training with distilroberta was stopped by colab limits
# so there we print the validation results
# also we needed to tweak the beamsearch decoder
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/rob'

val_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/val_ds_rob")

test_model('distilroberta-base',
            val_ds,
            batch_size=BATCH_SIZE,
            decoder_units=ROB_DEC_UNITS,
            max_sequence_length=MAX_SEQUENCE_LENGTH,
            output_tokenizer=output_tokenizer,
            pred_file_name='rob',
            checkpoint_dir=checkpoint_dir,
            pred_file_suffix=["_pred.pickle", "_beampred.pickle"])



Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

100%|██████████| 335/335 [09:07<00:00,  1.63s/it]
TensorFlow Addons has compiled its custom ops against TensorFlow 2.11.0, and there are no compatibility guarantees between the two versions. 
This means that you might get segfaults when loading the custom op, or other kind of low-level errors.
 If you do, do not file an issue on Github. This is a known limitation.

It might help you to fallback to pure Python ops by setting environment variable `TF_ADDONS_PY_OPS=1` or using `tfa.options.disable_custom_kernel()` in your code. To do that, see https://github.com/tensorflow/addons#gpucpu-custom-ops 

You can also change the TensorFlow version installed on your system. You would need a TensorFlow version equal to or above 2.11.0 and strictly below 2.12.0.
 Note that nightly versions of TensorFlow, as well as non-pip TensorFlow like `conda install tensorflow` or compiled from source are not supported.

The last solution is to find the TensorFlow Addons version that has custom ops compatible 

***TEST RESULTS***
[{'exact_match': 13.995064947157688, 'f1': 17.371109268490425}, {'exact_match': 13.627263839098655, 'f1': 17.008573844493572}, {'exact_match': 14.241817589273243, 'f1': 17.78899494962537}]
[{'exact_match': 14.227850458587458, 'f1': 17.520259280371043}, {'exact_match': 13.841426509614042, 'f1': 17.019186607021123}, {'exact_match': 14.442013129102845, 'f1': 17.962340146201353}]
greedy exact match:13.954715458509861
greedy SQUAD-F1:17.389559354203122
beam exact match:14.170430032434782
beam SQUAD-F1:17.500595344531174


## [Task 5] Question generation with text passage $P$, question $Q$ and dialogue history $H$

We want to define $f_\theta(P, Q, H)$. Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$, $Q_i$, and $H = \{ Q_0, A_0, \dots, Q_{i-1}, A_{i-1} \}$ to generate $A_i$.

With small tweaks, the same cells are proposed when passing also the history. The hyperparameters are kept the same.

In [None]:
# BERT TINY WITH HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/tiny/hist'


train_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/train_ds_hist")
val_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/val_ds_hist")

train_and_val('prajjwal1/bert-tiny',
              train_ds,
              val_ds,
              epochs=EPOCHS,
              batch_size=BATCH_SIZE,
              decoder_units=TINY_DEC_UNITS,
              max_sequence_length=MAX_SEQUENCE_LENGTH,
              output_tokenizer=output_tokenizer,
              pred_file_name='tiny_hist',
              checkpoint_dir=checkpoint_dir)

100%|██████████| 6129/6129 [11:40<00:00,  8.75it/s]
 33%|███▎      | 1/3 [11:40<23:21, 700.90s/it]

Current mean 1.0670831203460693


100%|██████████| 6129/6129 [11:24<00:00,  8.95it/s]
 67%|██████▋   | 2/3 [23:06<11:31, 691.80s/it]

Current mean 0.8746263980865479


100%|██████████| 6129/6129 [11:24<00:00,  8.95it/s]
100%|██████████| 3/3 [34:31<00:00, 690.45s/it]


Current mean 0.759445309638977


100%|██████████| 335/335 [02:36<00:00,  2.14it/s]
100%|██████████| 335/335 [04:51<00:00,  1.15it/s]
100%|██████████| 6129/6129 [11:28<00:00,  8.90it/s]
 33%|███▎      | 1/3 [11:29<22:58, 689.14s/it]

Current mean 1.0788750648498535


100%|██████████| 6129/6129 [11:23<00:00,  8.97it/s]
 67%|██████▋   | 2/3 [22:53<11:26, 686.14s/it]

Current mean 0.8751189112663269


100%|██████████| 6129/6129 [11:23<00:00,  8.97it/s]
100%|██████████| 3/3 [34:17<00:00, 685.77s/it]


Current mean 0.7595735192298889


100%|██████████| 335/335 [02:33<00:00,  2.19it/s]
100%|██████████| 335/335 [04:47<00:00,  1.16it/s]
100%|██████████| 6129/6129 [11:25<00:00,  8.94it/s]
 33%|███▎      | 1/3 [11:25<22:51, 685.87s/it]

Current mean 1.0686047077178955


100%|██████████| 6129/6129 [11:22<00:00,  8.99it/s]
 67%|██████▋   | 2/3 [22:48<11:23, 683.97s/it]

Current mean 0.8764885663986206


100%|██████████| 6129/6129 [11:21<00:00,  8.99it/s]
100%|██████████| 3/3 [34:10<00:00, 683.57s/it]


Current mean 0.7604389190673828


100%|██████████| 335/335 [02:32<00:00,  2.20it/s]
100%|██████████| 335/335 [04:49<00:00,  1.16it/s]


***VALIDATION RESULTS***
[{'exact_match': 11.913962474975557, 'f1': 14.435647357982468}, {'exact_match': 12.00242096931887, 'f1': 14.831662593488241}, {'exact_match': 11.81153684994646, 'f1': 14.194647495643768}]
[{'exact_match': 12.216583639834257, 'f1': 14.615724631732936}, {'exact_match': 12.309697844406164, 'f1': 14.890212870467654}, {'exact_match': 12.104846594347968, 'f1': 14.181929087624674}]
greedy exact match:11.909306764746963
greedy SQUAD-F1:14.487319149038157
beam exact match:12.21037602619613
beam SQUAD-F1:14.562622196608421


In [None]:
# ----------------------------------- #
# VALIDATION RESULTS TO BE CONSIDERED #
# ----------------------------------- #
# We tuned the beam search parameters, thus we need to redo the prediction
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/tiny/hist'

val_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/val_ds_hist")

test_model('prajjwal1/bert-tiny',
            val_ds,
            batch_size=BATCH_SIZE,
            decoder_units=TINY_DEC_UNITS,
            max_sequence_length=MAX_SEQUENCE_LENGTH,
            output_tokenizer=output_tokenizer,
            pred_file_name='tiny_hist',
            checkpoint_dir=checkpoint_dir,
            pred_file_suffix=["_pred.pickle", "_beampred.pickle"])

100%|██████████| 335/335 [02:07<00:00,  2.63it/s]
100%|██████████| 335/335 [04:29<00:00,  1.24it/s]
100%|██████████| 335/335 [02:04<00:00,  2.69it/s]
100%|██████████| 335/335 [04:28<00:00,  1.25it/s]
100%|██████████| 335/335 [02:04<00:00,  2.70it/s]
100%|██████████| 335/335 [04:26<00:00,  1.26it/s]


***TEST RESULTS***
[{'exact_match': 11.913962474975557, 'f1': 14.435647357982468}, {'exact_match': 12.00242096931887, 'f1': 14.831662593488241}, {'exact_match': 11.81153684994646, 'f1': 14.194647495643768}]
[{'exact_match': 12.188649378462685, 'f1': 14.725357473942974}, {'exact_match': 12.291075003491782, 'f1': 15.05921752376658}, {'exact_match': 12.076912332976395, 'f1': 14.256192295950903}]
greedy exact match:11.909306764746963
greedy SQUAD-F1:14.487319149038157
beam exact match:12.18554557164362
beam SQUAD-F1:14.680255764553486


In [None]:
# DISTILROBERTA WITH HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/rob/hist'

train_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/train_ds_rob_hist")
val_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/val_ds_rob_hist")

train_and_val('distilroberta-base',
              train_ds,
              val_ds,
              epochs=EPOCHS,
              batch_size=BATCH_SIZE,
              decoder_units=ROB_DEC_UNITS,
              max_sequence_length=MAX_SEQUENCE_LENGTH,
              output_tokenizer=output_tokenizer,
              pred_file_name='rob_hist',
              checkpoint_dir=checkpoint_dir)

100%|██████████| 6129/6129 [41:01<00:00,  2.49it/s]
 33%|███▎      | 1/3 [41:03<1:22:07, 2463.72s/it]

Current mean 1.0596537590026855


100%|██████████| 6129/6129 [40:57<00:00,  2.49it/s]
 67%|██████▋   | 2/3 [1:22:02<41:01, 2461.09s/it]

Current mean 0.8442547917366028


100%|██████████| 6129/6129 [40:57<00:00,  2.49it/s]
100%|██████████| 3/3 [2:03:02<00:00, 2460.73s/it]


Current mean 0.6984569430351257


100%|██████████| 335/335 [08:57<00:00,  1.60s/it]
100%|██████████| 335/335 [11:07<00:00,  1.99s/it]


***VALIDATION RESULTS***
[{'exact_match': 13.841426509614042, 'f1': 17.282489916713295}]
[{'exact_match': 14.19526048698729, 'f1': 17.43092894424413}]
greedy exact match:13.841426509614042
greedy SQUAD-F1:17.282489916713295
beam exact match:14.19526048698729
beam SQUAD-F1:17.43092894424413


In [None]:
# ----------------------------------- #
# VALIDATION RESULTS TO BE CONSIDERED #
# ----------------------------------- #
# training with distilroberta with history was stopped by colab limits
# so there we print the validation results
# also we needed to tweak the beam search
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/rob/hist'

val_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/val_ds_rob_hist")

test_model('distilroberta-base',
            val_ds,
            batch_size=BATCH_SIZE,
            decoder_units=ROB_DEC_UNITS,
            max_sequence_length=MAX_SEQUENCE_LENGTH,
            output_tokenizer=output_tokenizer,
            pred_file_name='rob_hist',
            checkpoint_dir=checkpoint_dir,
            pred_file_suffix=["_pred.pickle", "_beampred.pickle"])

100%|██████████| 335/335 [09:12<00:00,  1.65s/it]
100%|██████████| 335/335 [11:24<00:00,  2.04s/it]
100%|██████████| 335/335 [09:19<00:00,  1.67s/it]
100%|██████████| 335/335 [11:24<00:00,  2.04s/it]
100%|██████████| 335/335 [09:14<00:00,  1.66s/it]
100%|██████████| 335/335 [11:23<00:00,  2.04s/it]


***TEST RESULTS***
[{'exact_match': 13.841426509614042, 'f1': 17.282489916713295}, {'exact_match': 14.041622049443642, 'f1': 17.43531794791082}, {'exact_match': 13.720378043670562, 'f1': 16.958132144994597}]
[{'exact_match': 14.148703384701337, 'f1': 17.54099227373074}, {'exact_match': 14.26509614041622, 'f1': 17.516095728513942}, {'exact_match': 13.892639322128591, 'f1': 17.059402514653605}]
greedy exact match:13.867808867576082
greedy SQUAD-F1:17.22531333653957
beam exact match:14.102146282415383
beam SQUAD-F1:17.372163505632763


## [Task 6] Train and evaluate $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$

Write your own script to train and evaluate your $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$ models.

### Instructions

* Perform multiple train/evaluation seed runs: [42, 2022, 1337].$^1$
* Evaluate your models with the following metrics: SQUAD F1-score.$^2$
* Fine-tune each transformer-based models for **3 epochs**.
* Report evaluation SQUAD F1-score computed on the validation and test sets.

$^1$ Remember what we said about code reproducibility in Tutorial 2!

$^2$ You can use ```allennlp``` python package for a quick implementation of SQUAD F1-score: ```from allennlp_models.rc.tools import squad```. 

The following four cells are used to obtain the evaluation of the SQUAD F1 on test. In the end, we used the <code>evaluate</code> package function because <code>allennlp</code> gave us some issues with version compatibilities.

In [None]:
# BERT TINY NO HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/tiny'

test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds")

test_model('prajjwal1/bert-tiny',
            test_ds,
            batch_size=BATCH_SIZE,
            decoder_units=TINY_DEC_UNITS,
            max_sequence_length=MAX_SEQUENCE_LENGTH,
            output_tokenizer=output_tokenizer,
            pred_file_name='tiny',
            checkpoint_dir=checkpoint_dir)

100%|██████████| 123/123 [00:36<00:00,  3.38it/s]
100%|██████████| 123/123 [01:23<00:00,  1.47it/s]
100%|██████████| 123/123 [00:32<00:00,  3.80it/s]
100%|██████████| 123/123 [01:23<00:00,  1.47it/s]
100%|██████████| 123/123 [00:32<00:00,  3.83it/s]
100%|██████████| 123/123 [01:24<00:00,  1.46it/s]


***TEST RESULTS***
[{'exact_match': 12.086385450871433, 'f1': 14.29309767458751}, {'exact_match': 12.136903258398586, 'f1': 14.5009777728258}, {'exact_match': 12.225309421571104, 'f1': 14.74764799280808}]
[{'exact_match': 12.237938873452892, 'f1': 14.208540419671078}, {'exact_match': 12.313715584743623, 'f1': 14.663591861382345}, {'exact_match': 12.301086132861833, 'f1': 14.786576170032186}]
greedy exact match:12.149532710280374
greedy SQUAD-F1:14.51390781340713
beam exact match:12.284246863686116
beam SQUAD-F1:14.552902817028537


In [None]:
# DISTILROBERTA NO HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/rob'

test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds_rob")

test_model('distilroberta-base',
            test_ds,
            batch_size=BATCH_SIZE,
            decoder_units=ROB_DEC_UNITS,
            max_sequence_length=MAX_SEQUENCE_LENGTH,
            output_tokenizer=output_tokenizer,
            pred_file_name='rob',
            checkpoint_dir=checkpoint_dir)

100%|██████████| 123/123 [03:13<00:00,  1.57s/it]
100%|██████████| 123/123 [03:58<00:00,  1.94s/it]
100%|██████████| 123/123 [03:13<00:00,  1.58s/it]
100%|██████████| 123/123 [03:58<00:00,  1.94s/it]
100%|██████████| 123/123 [03:14<00:00,  1.58s/it]
100%|██████████| 123/123 [03:58<00:00,  1.94s/it]


***TEST RESULTS***
[{'exact_match': 14.561758019701944, 'f1': 17.68935145499627}, {'exact_match': 14.170245011366507, 'f1': 17.32424936434831}, {'exact_match': 15.016418287446324, 'f1': 18.295156490867786}]
[{'exact_match': 14.839605961101288, 'f1': 17.945499231389874}, {'exact_match': 14.30916898206618, 'f1': 17.25142225187429}, {'exact_match': 15.218489517554938, 'f1': 18.448232837953014}]
greedy exact match:14.582807106171591
greedy SQUAD-F1:17.76958577007079
beam exact match:14.789088153574134
beam SQUAD-F1:17.88171810707239


In [None]:
# BERT TINY WITH HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/tiny/hist'

test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds_hist")

test_model('prajjwal1/bert-tiny',
            test_ds,
            batch_size=BATCH_SIZE,
            decoder_units=TINY_DEC_UNITS,
            max_sequence_length=MAX_SEQUENCE_LENGTH,
            output_tokenizer=output_tokenizer,
            pred_file_name='tiny_hist',
            checkpoint_dir=checkpoint_dir)



Downloading:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

100%|██████████| 123/123 [00:33<00:00,  3.70it/s]
100%|██████████| 123/123 [01:28<00:00,  1.39it/s]
100%|██████████| 123/123 [00:32<00:00,  3.76it/s]
100%|██████████| 123/123 [01:23<00:00,  1.47it/s]
100%|██████████| 123/123 [00:32<00:00,  3.80it/s]
100%|██████████| 123/123 [01:25<00:00,  1.44it/s]


***TEST RESULTS***
[{'exact_match': 12.124273806516797, 'f1': 14.392046008515898}, {'exact_match': 12.086385450871433, 'f1': 14.668835115967326}, {'exact_match': 12.073755998989643, 'f1': 14.23343061197177}]
[{'exact_match': 12.427380651679718, 'f1': 14.822903986602807}, {'exact_match': 12.414751199797928, 'f1': 14.988061621881704}, {'exact_match': 12.225309421571104, 'f1': 14.261585098102486}]
greedy exact match:12.09480508545929
greedy SQUAD-F1:14.431437245484998
beam exact match:12.355813757682917
beam SQUAD-F1:14.690850235528998


In [None]:
# DISTILROBERTA WITH HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/rob/hist'

test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds_rob_hist")

test_model('distilroberta-base',
            test_ds,
            batch_size=BATCH_SIZE,
            decoder_units=ROB_DEC_UNITS,
            max_sequence_length=MAX_SEQUENCE_LENGTH,
            output_tokenizer=output_tokenizer,
            pred_file_name='rob_hist',
            checkpoint_dir=checkpoint_dir)



Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

100%|██████████| 123/123 [03:10<00:00,  1.55s/it]
TensorFlow Addons has compiled its custom ops against TensorFlow 2.11.0, and there are no compatibility guarantees between the two versions. 
This means that you might get segfaults when loading the custom op, or other kind of low-level errors.
 If you do, do not file an issue on Github. This is a known limitation.

It might help you to fallback to pure Python ops by setting environment variable `TF_ADDONS_PY_OPS=1` or using `tfa.options.disable_custom_kernel()` in your code. To do that, see https://github.com/tensorflow/addons#gpucpu-custom-ops 

You can also change the TensorFlow version installed on your system. You would need a TensorFlow version equal to or above 2.11.0 and strictly below 2.12.0.
 Note that nightly versions of TensorFlow, as well as non-pip TensorFlow like `conda install tensorflow` or compiled from source are not supported.

The last solution is to find the TensorFlow Addons version that has custom ops compatible 

***TEST RESULTS***
[{'exact_match': 14.587016923465521, 'f1': 17.751458741508817}, {'exact_match': 14.75119979792877, 'f1': 17.969615907436243}, {'exact_match': 14.776458701692347, 'f1': 17.758830650349083}]
[{'exact_match': 14.864864864864865, 'f1': 17.909845840523555}, {'exact_match': 14.90275322051023, 'f1': 17.98198616842104}, {'exact_match': 15.003788835564537, 'f1': 17.846164446972164}]
greedy exact match:14.704891807695546
greedy SQUAD-F1:17.826635099764715
beam exact match:14.923802306979878
beam SQUAD-F1:17.912665485305585


By seeing the F1-scores on both the validation set and test set, we can highlight some trends:
- Tiny bert models usually perform worse than the Distil-Roberta counterparts. This naturally follows by the fact that the latter has more capacity than the former.
- History models usually perform slightly better than their counterparts without history. This is expected, as history gives more information on which the answer can be based on.
- The Beam search sampling method slightly enhances the results obtained with the greedy one, especially by removing no-sense answers made of repeated words or block of them.

## [Task 7] Error Analysis

Perform a simple and short error analysis as follows:
* Group dialogues by ```source``` and report the worst 5 model errors for each source (w.r.t. SQUAD F1-score).
* Inspect observed results and try to provide some comments (e.g., do the models make errors when faced with a particular question type?)$^1$

$^1$ Check the [paper](https://arxiv.org/pdf/1808.07042.pdf) for some valuable information about question/answer types (e.g., Table 6, Table 8) 

The following functions allow to recover the previously produced predictions and analyize them by source, in order to print the 5 worst errors or the 5 best predictions. The F1-score considered for an answer is the average across the seeds of the score obtaned by **beam search predictions**.  Eventually we also added the possibilty to discard "Yes"/"No" questions as most of them are easier to predict.

In [None]:
def find_worst_errors(prediction_prefix, prediction_suffix, ref_dataset):
  """
  Function producing a dictionary where the keys are the five different sources
  while the values are lists of pairs (prediction id, average f1 score across seeds).
  :params:
    prediction_prefix: prefix of the name of the file containing the prediction (until the seed)
    prediction_suffix: suffix of the name of the file containing the prediction (after the seed)
    ref_dataset: dataset containing the references to compute the f1-Score
  """
  squad = load("squad")
  predictions = [] # list of lists containing the predictions from the different seeds

  for seed in [42,1337,2022]: # for each seed
    with open(prediction_prefix + str(seed) + prediction_suffix, "rb") as f:
      # load the respective predictions
      new_list = pickle.load(f)
      # sort predictions by id (to align them)
      new_list.sort(key=lambda x: int(x["id"]))
      predictions.append(new_list)
      
  # extract the categories    
  categories = np.unique(ref_dataset["source"])

  source_dict = {cat:[] for cat in categories}
  refd = ref_dataset["references"] # load the dataset containng the references
  for pred in tqdm(range(len(predictions[0]))):
    #ATTENTION: the following instructions is based on the assumption that
    # the id of each example is the row of the id itself, as it follows
    #from our dataset construciton
    ref = refd[int(predictions[0][pred]["id"])]
    
    if ref["id"] != predictions[0][pred]["id"] or ref["id"] != predictions[1][pred]["id"] or ref["id"] != predictions[2][pred]["id"]:
      # small check to ensuer correctness of the process
      print("error with ids: example" + ref["id"])
    
    f1 = squad.compute(predictions=[predictions[0][pred]],
                       references=[ref])["f1"]

    f1 += squad.compute(predictions=[predictions[1][pred]],
                                     references = [ref])["f1"]

    f1 += squad.compute(predictions=[predictions[2][pred]],
                        references = [ref])["f1"]
    
    f1 = f1/3
    # add to the list of the respective source the pair (id, mean f1-score)
    source_dict[ref_dataset["source"][int(predictions[0][pred]["id"])]].append((predictions[0][pred]["id"] , f1))


  return source_dict

def print_orderered_predictions(source_dict, prediction_prefix, prediction_suffix, ref_dataset, question_dataset, kind="worst", qty=5, skip_yn=False):
  """
  Function taking as input a dictionary as produced by find_worst_errors and printing a number of best or worst predictions by source
  :params:
    source_dict: dictionary produced by find_worst_errors
    prediction_prefix: prefix of the name of the file containing the prediction (until the seed)
    prediction_suffix: suffix of the name of the file containing the prediction (after the seed)
    ref_dataset: dataset containing the references to print together with the predictions
    question_dataset: dataset containing the questions and the passage
    kind: kind of report to print. if "worst" it prints worst prediction, if "best", prints best predictions (default="worst")
    qty: quantity of predictions to print per seed. (default=5)
    skip_yn: flag allowing to skip "yes/no" questions (default=False)
  """
  predictions = []  # list of lists containing the predictions from the different seeds
  refd = ref_dataset["references"]
  for seed in [42,1337,2022]:
    with open(prediction_prefix + str(seed) + prediction_suffix, "rb") as f:
      # load the predictions
      new_list = pickle.load(f)
      # sort them by id to align them
      new_list.sort(key=lambda x: int(x["id"]))
      predictions.append(new_list)
  
  for key in source_dict.keys(): # for each key
    source_dict[key].sort(key=lambda x : x[1], reverse=True if kind=="best" else False) # sort the list
    print("-------------" + key + "-------------")
    i = 0
    j = 0
    while i<qty and j<len(source_dict[key]): # print qty or less question + answers
      id = source_dict[key][j][0]
      if skip_yn and refd[int(id)]["answers"]["text"][0] in ["yes", "no"]:
        j = j + 1
      else:
        print("question + passage: " + str(question_dataset[int(id)]))
        print("true answer: " + str(refd[int(id)]["answers"]["text"]))
        print("answer with seed 42: " + predictions[0][int(id)]["prediction_text"])
        print("answer with seed 1337: " + predictions[1][int(id)]["prediction_text"])
        print("answer with seed 2022: " + predictions[2][int(id)]["prediction_text"])
        print("f1: " + str(source_dict[key][j][1]))
        i = i + 1
        j = j + 1


In [None]:
# BERT TINY NO HISTORY
test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds")
source_dict_tnh = find_worst_errors("/content/gdrive/MyDrive/ckpt/dom/tinytiny_", "_testbeampred.pickle", test_ds)

100%|██████████| 7918/7918 [02:46<00:00, 47.42it/s]


In [None]:
print_orderered_predictions(source_dict_tnh, "/content/gdrive/MyDrive/ckpt/dom/tinytiny_", "_testbeampred.pickle", test_ds, XQA_test)

-------------cnn-------------
question + passage: ['Whom?', '(CNN) -- Dennis Farina, the dapper, mustachioed cop-turned-actor best known for his tough-as-nails work in such TV series as "Law & Order," "Crime Story," and "Miami Vice," has died. He was 69. \n\n"We are deeply saddened by the loss of a great actor and a wonderful man," said his publicist, Lori De Waal, in a statement Monday. "Dennis Farina was always warmhearted and professional, with a great sense of humor and passion for his profession. He will be greatly missed by his family, friends and colleagues." \n\nFarina, who had a long career as a police officer in Chicago, got into acting through director Michael Mann, who used him as a consultant and cast him in his 1981 movie, "Thief." That role led to others in such Mann-created shows as "Miami Vice" (in which Farina played a mobster) and "Crime Story" (in which he starred as Lt. Mike Torello). \n\nFarina also had roles, generally as either cops or gangsters, in a number of 

In [None]:
print_orderered_predictions(source_dict_tnh, "/content/gdrive/MyDrive/ckpt/dom/tinytiny_", "_testbeampred.pickle", test_ds, XQA_test, kind="best", skip_yn=True)

-------------cnn-------------
question + passage: ['What news agency does Tony work for?', 'Fort Lauderdale, Florida (CNN) -- Just taking a sip of water or walking to the bathroom is excruciatingly painful for 15-year-old Michael Brewer, who was burned over 65 percent of his body after being set on fire, allegedly by a group of teenagers. \n\n"It hurts my heart to see him in pain, but it enlightens at the same time to know my son is strong enough to make it through on a daily basis," his mother, Valerie Brewer, told CNN on Wednesday. \n\nBrewer and her husband, Michael Brewer, Sr., spoke to CNN\'s Tony Harris, a day after a 13-year-old boy who witnessed last month\'s attack publicly read a written statement: \n\n"I want to express my deepest sympathy to Mikey and his family," Jeremy Jarvis said. "I will pray for Mikey to grow stronger every day and for Mikey\'s speedy recovery." \n\nJarvis\' older brother has been charged in the October 12 attack in Deerfield Beach, Florida. \n\nWhen a

In [None]:
# DISTILROBERTA NO HISTORY
test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds_rob")
source_dict_rnh = find_worst_errors("/content/gdrive/MyDrive/ckpt/dom/robrob_", "_testbeampred.pickle", test_ds)

100%|██████████| 7918/7918 [02:37<00:00, 50.14it/s]


In [None]:
print_orderered_predictions(source_dict_rnh, "/content/gdrive/MyDrive/ckpt/dom/robrob_", "_testbeampred.pickle", test_ds, XQA_test)

-------------cnn-------------
question + passage: ['Whom?', '(CNN) -- Dennis Farina, the dapper, mustachioed cop-turned-actor best known for his tough-as-nails work in such TV series as "Law & Order," "Crime Story," and "Miami Vice," has died. He was 69. \n\n"We are deeply saddened by the loss of a great actor and a wonderful man," said his publicist, Lori De Waal, in a statement Monday. "Dennis Farina was always warmhearted and professional, with a great sense of humor and passion for his profession. He will be greatly missed by his family, friends and colleagues." \n\nFarina, who had a long career as a police officer in Chicago, got into acting through director Michael Mann, who used him as a consultant and cast him in his 1981 movie, "Thief." That role led to others in such Mann-created shows as "Miami Vice" (in which Farina played a mobster) and "Crime Story" (in which he starred as Lt. Mike Torello). \n\nFarina also had roles, generally as either cops or gangsters, in a number of 

In [None]:
print_orderered_predictions(source_dict_rnh, "/content/gdrive/MyDrive/ckpt/dom/robrob_", "_testbeampred.pickle", test_ds, XQA_test, kind="best", skip_yn=True)

-------------cnn-------------
question + passage: ['Which country was she visiting?', 'NEW YORK (CNN) -- Natasha Richardson, a film star, Tony-winning stage actress and member of the famed Redgrave acting family, died Wednesday after suffering injuries in a ski accident, according to a family statement. She was 45. \n\nNatasha Richardson fell on a beginners\' slope in Canada. \n\nRichardson, wife of actor Liam Neeson, was injured Monday in a fall on a ski slope at a Quebec resort about 80 miles northwest of Montreal. \n\nRichardson\'s family released a statement saying, "Liam Neeson, his sons, and the entire family are shocked and devastated by the tragic death of their beloved Natasha. They are profoundly grateful for the support, love and prayers of everyone, and ask for privacy during this very difficult time." \n\nAccording to a statement from Mont Tremblant Ski Resort, Richardson fell during a lesson on a beginners\' trail. Watch a report on Richardson\'s life » \n\n"She did not s

In [None]:
# TINYBERT WITH HISTORY
test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds_hist")
source_dict_th = find_worst_errors("/content/gdrive/MyDrive/ckpt/dom/tiny/histtiny_hist_", "_testbeampred.pickle", test_ds)

100%|██████████| 7918/7918 [02:47<00:00, 47.18it/s]


In [None]:
print_orderered_predictions(source_dict_th, "/content/gdrive/MyDrive/ckpt/dom/tiny/histtiny_hist_", "_testbeampred.pickle", test_ds, XQA_test)

-------------cnn-------------
question + passage: ['Whom?', '(CNN) -- Dennis Farina, the dapper, mustachioed cop-turned-actor best known for his tough-as-nails work in such TV series as "Law & Order," "Crime Story," and "Miami Vice," has died. He was 69. \n\n"We are deeply saddened by the loss of a great actor and a wonderful man," said his publicist, Lori De Waal, in a statement Monday. "Dennis Farina was always warmhearted and professional, with a great sense of humor and passion for his profession. He will be greatly missed by his family, friends and colleagues." \n\nFarina, who had a long career as a police officer in Chicago, got into acting through director Michael Mann, who used him as a consultant and cast him in his 1981 movie, "Thief." That role led to others in such Mann-created shows as "Miami Vice" (in which Farina played a mobster) and "Crime Story" (in which he starred as Lt. Mike Torello). \n\nFarina also had roles, generally as either cops or gangsters, in a number of 

In [None]:
print_orderered_predictions(source_dict_th, "/content/gdrive/MyDrive/ckpt/dom/tiny/histtiny_hist_", "_testbeampred.pickle", test_ds, XQA_test, kind="best", skip_yn=True)

-------------cnn-------------
question + passage: ['What news agency does Tony work for?', 'Fort Lauderdale, Florida (CNN) -- Just taking a sip of water or walking to the bathroom is excruciatingly painful for 15-year-old Michael Brewer, who was burned over 65 percent of his body after being set on fire, allegedly by a group of teenagers. \n\n"It hurts my heart to see him in pain, but it enlightens at the same time to know my son is strong enough to make it through on a daily basis," his mother, Valerie Brewer, told CNN on Wednesday. \n\nBrewer and her husband, Michael Brewer, Sr., spoke to CNN\'s Tony Harris, a day after a 13-year-old boy who witnessed last month\'s attack publicly read a written statement: \n\n"I want to express my deepest sympathy to Mikey and his family," Jeremy Jarvis said. "I will pray for Mikey to grow stronger every day and for Mikey\'s speedy recovery." \n\nJarvis\' older brother has been charged in the October 12 attack in Deerfield Beach, Florida. \n\nWhen a

In [None]:
# rob WITH HISTORY
test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds_rob_hist")
source_dict_rh = find_worst_errors("/content/gdrive/MyDrive/ckpt/dom/rob/histrob_hist_", "_testbeampred.pickle", test_ds)

100%|██████████| 7918/7918 [02:41<00:00, 49.08it/s]


In [None]:
print_orderered_predictions(source_dict_rh, "/content/gdrive/MyDrive/ckpt/dom/rob/histrob_hist_", "_testbeampred.pickle", test_ds, XQA_test)

-------------cnn-------------
question + passage: ['Whom?', '(CNN) -- Dennis Farina, the dapper, mustachioed cop-turned-actor best known for his tough-as-nails work in such TV series as "Law & Order," "Crime Story," and "Miami Vice," has died. He was 69. \n\n"We are deeply saddened by the loss of a great actor and a wonderful man," said his publicist, Lori De Waal, in a statement Monday. "Dennis Farina was always warmhearted and professional, with a great sense of humor and passion for his profession. He will be greatly missed by his family, friends and colleagues." \n\nFarina, who had a long career as a police officer in Chicago, got into acting through director Michael Mann, who used him as a consultant and cast him in his 1981 movie, "Thief." That role led to others in such Mann-created shows as "Miami Vice" (in which Farina played a mobster) and "Crime Story" (in which he starred as Lt. Mike Torello). \n\nFarina also had roles, generally as either cops or gangsters, in a number of 

In [None]:
print_orderered_predictions(source_dict_rh, "/content/gdrive/MyDrive/ckpt/dom/rob/histrob_hist_", "_testbeampred.pickle", test_ds, XQA_test, kind="best", skip_yn=True)

-------------cnn-------------
question + passage: ['Which country was she visiting?', 'NEW YORK (CNN) -- Natasha Richardson, a film star, Tony-winning stage actress and member of the famed Redgrave acting family, died Wednesday after suffering injuries in a ski accident, according to a family statement. She was 45. \n\nNatasha Richardson fell on a beginners\' slope in Canada. \n\nRichardson, wife of actor Liam Neeson, was injured Monday in a fall on a ski slope at a Quebec resort about 80 miles northwest of Montreal. \n\nRichardson\'s family released a statement saying, "Liam Neeson, his sons, and the entire family are shocked and devastated by the tragic death of their beloved Natasha. They are profoundly grateful for the support, love and prayers of everyone, and ask for privacy during this very difficult time." \n\nAccording to a statement from Mont Tremblant Ski Resort, Richardson fell during a lesson on a beginners\' trail. Watch a report on Richardson\'s life » \n\n"She did not s

Other than printing the errors, we want also show which are the questions that have an improvement over generated answers after adding the history. In this print, we discard any answer that has been improved less than 33.3%, which means, those where at most one of the three answers becomes better. 

In [None]:
def print_enhancements(dictw, dictb, prediction_prefix, prediction_suffix, ref_dataset, question_dataset):
  predictionsw = []  # list of lists containing the predictions from the different seeds
  refd = ref_dataset["references"]
  for seed in [42,1337,2022]:
    with open(prediction_prefix[0] + str(seed) + prediction_suffix[0], "rb") as f:
      # load the predictions
      new_list = pickle.load(f)
      # sort them by id to align them
      new_list.sort(key=lambda x: int(x["id"]))
      predictionsw.append(new_list)
  
  predictionsb = []  # list of lists containing the predictions from the different seeds
  for seed in [42,1337,2022]:
    with open(prediction_prefix[1] + str(seed) + prediction_suffix[1], "rb") as f:
      # load the predictions
      new_list = pickle.load(f)
      # sort them by id to align them
      new_list.sort(key=lambda x: int(x["id"]))
      predictionsb.append(new_list)
    
  enhancements = 0
  en = 0
  decrements = 0
  dec = 0
  for key in dictw.keys():
    print("-------------" + key + "-------------")
    dictw[key].sort(key=lambda x : int(x[0])) # sort the list
    dictb[key].sort(key=lambda x : int(x[0])) # sort the list
    for i in range(len(dictw[key])):
      id = dictw[key][i][0]
      if dictw[key][i][0] != dictb[key][i][0]:
        print("error")
      if dictb[key][i][1] - dictw[key][i][1] > 100/3:
        print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
        print("question + passage: " + str(question_dataset[int(id)]))
        print("true answer: " + str(refd[int(id)]["answers"]["text"]))
        print("answer with seed 42: " + predictionsw[0][int(id)]["prediction_text"])
        print("answer with seed 1337: " + predictionsw[1][int(id)]["prediction_text"])
        print("answer with seed 2022: " + predictionsw[2][int(id)]["prediction_text"])
        print("previous f1: " + str(dictw[key][i][1]))
        print("answer with seed 42: " + predictionsb[0][int(id)]["prediction_text"])
        print("answer with seed 1337: " + predictionsb[1][int(id)]["prediction_text"])
        print("answer with seed 2022: " + predictionsb[2][int(id)]["prediction_text"])
        print("new f1: " + str(dictb[key][i][1]))
      if dictb[key][i][1] - dictw[key][i][1] > 0:
        en += dictb[key][i][1] - dictw[key][i][1]
        enhancements +=1
      elif dictb[key][i][1] - dictw[key][i][1] < 0:
        dec += dictb[key][i][1] - dictw[key][i][1]
        decrements +=1
  print("enhancements: " + str(en/enhancements))
  print("decrements: " + str(dec/decrements))
        

In [None]:
test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds_hist")
print_enhancements(source_dict_tnh, source_dict_th, ["/content/gdrive/MyDrive/ckpt/dom/tinytiny_","/content/gdrive/MyDrive/ckpt/dom/tiny/histtiny_hist_"], ["_testbeampred.pickle", "_testbeampred.pickle"],test_ds, XQA_test,)

-------------cnn-------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
question + passage: ['iiiiiiiiiiis that why he did not want to go down ?', '(CNN) -- Marathon man John Isner survived another epic on his return to competitive tennis after his record-breaking 11-hour Wimbledon match. \n\nThe American saved two match points before beating Gilles Muller, from Luxembourg, 4-6 7-6 7-6 to seal his place in the quarterfinals of the Atlanta Tennis Championships. \n\nIt is Isner\'s first tournament since Wimbledon back in June when his opening round victory over Frenchman Nicolas Mahut in London clocked in as the longest match in tennis history. \n\nIsner\'s battle with Mahut stretched over three days and 183 games before he finally triumphed, 6-4 3-6 6-7 7-6 70-68. He later told CNN: "I really didn\'t think it was going to end." \n\nIsner reflects on \'crazy\' Wimbledon match \n\nThe match turned Isner into a household name in the sport, and after his straight sets defeat to Thiemo de Bakker in the

In [None]:
test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds_rob_hist")
print_enhancements(source_dict_rnh, source_dict_rh, ["/content/gdrive/MyDrive/ckpt/dom/robrob_","/content/gdrive/MyDrive/ckpt/dom/rob/histrob_hist_"], ["_testbeampred.pickle", "_testbeampred.pickle"],test_ds, XQA_test,)

-------------cnn-------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
question + passage: ['Where was he last week?', '(Financial Times) -- While most consumer goods companies are seeking opportunities in China, domestic condom maker Safedom is going in the opposite direction -- seeking European partners or acquisitions as part of a bid to go global. \n\nThe company, majority-owned by its founder, has grown rapidly in its short life. It will sell 200m condoms this year, all within China, and is targeting sales of 1bn next year; the same number that Durex, the world\'s biggest player, was producing in the country within three years. \n\nBrian Fu, chief executive, was in the UK last week "meeting potential partners and acquisitions". Funding for any deal will either come from existing shareholders, bank loans or possibly through an overseas stock market listing, he said. \n\nDespite the size of the market on its own doorstep -- and added attraction of a government-mandated one-child policy -- Saf

After having seen some samples of the answers generated we can sum up the strenghts and the weaknesses of the models:
- The models all seem to capture the semantic relation between question and answer, that is, it correctly interprets the class of the words to use depending on the kind of questions. This can be seen, for example, in any "Yes/No" question where even if a model gives a wrong answer it is usually "yes" instead of "no" or viceversa. Or it can be seen also in the "Date/time" kind of answers, where it responds appropriately choosing time-releated words.
- The models seem also to capture the semantic relations between words, probably thanks to the advanced bert encodings. In fact, especially in some wrong answers, we can see how the model still answers with a semantically near word: <br><code> passage: 'CHAPTER XXII Northward, along the leeward coast of Malaita, the _Ariel_ worked her leisurely way, threading the colour-riotous lagoon that lay between the shore-reefs and outer-reefs [...] <br> question : 'What lay between the shore-reefs and outer-reefs?' <br>
true answer: 'lagoon' -> pred: fish, the depth of boats <br>
question: 'What worked her way northward?' <br>
true answer: 'the ariel' -> pred: a ship, a crafty ship, a boat </code> <br>
The first answer was generated by the tiny bert model without history, the second with the model with Distil-roberta and using the history.
- Most of the correct generations are obtained on very short answers (1 to 2 words), and especially on 'Yes' kind of answers, which are the majority.
- The models show some problems with "Counting" questions, especially when required to count more than three or two. This is also probably releated to the answers containg "three" or "two" are the majority between the counting ones.
- The models show also some problems when answering to questions which true answer is "no", due to the intrinsic complexity of treating the negation with a language model. However, they still manages to answer correctly to some of them, and the variations trained with the history of the dialogue seem to improve on these answers.
- Model variations trained with the history effectively show improvements on some questions referring back to the conversational history.

Finally we can say that there is room to improvement, but the results are still satisfactory as our best model (distil roberta with history) achieve around 18% SQUAD-F1