<a href="https://colab.research.google.com/github/DomMcOyle/NLP-Assigments-22-23/blob/Assignment-2/Assignment2DP(G)Rwith%20history.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Keywords**: Transformers, Question Answering, CoQA

## Deadlines

* **December 11**, 2022: deadline for having assignments graded by January 11, 2023
* **January 11**, 2023: deadline for half-point speed bonus per assignment
* **After January 11**, 2023: assignments are still accepted, but there will be no speed bonus

## Overview

### Problem

Question Answering (QA) on [CoQA](https://stanfordnlp.github.io/coqa/) dataset: a conversational QA dataset.

### Task

Given a question $Q$, a text passage $P$, the task is to generate the answer $A$.<br>
$\rightarrow A$ can be: (i) a free-form text or (ii) unanswerable;

**Note**: a question $Q$ can refer to previous dialogue turns. <br>
$\rightarrow$ dialogue history $H$ may be a valuable input to provide the correct answer $A$.

### Models

We are going to experiment with transformer-based models to define the following models:

1.  $A = f_\theta(Q, P)$

2. $A = f_\theta(Q, P, H)$

where $f_\theta$ is the transformer-based model we have to define with $\theta$ parameters.

## The CoQA dataset

<center>
    <img src="https://drive.google.com/uc?export=view&id=16vrgyfoV42Z2AQX0QY7LHTfrgektEKKh" width="750"/>
</center>

For detailed information about the dataset, feel free to check the original [paper](https://arxiv.org/pdf/1808.07042.pdf).



## Rationales

Each QA pair is paired with a rationale $R$: it is a text span extracted from the given text passage $P$. <br>
$\rightarrow$ $R$ is not a requested output, but it can be used as an additional information at training time!

## Dataset Statistics

* **127k** QA pairs.
* **8k** conversations.
* **7** diverse domains: Children's Stories, Literature, Mid/High School Exams, News, Wikipedia, Reddit, Science.
* Average conversation length: **15 turns** (i.e., QA pairs).
* Almost **half** of CoQA questions refer back to **conversational history**.
* Only **train** and **validation** sets are available.

## Dataset snippet

The dataset is stored in JSON format. Each dialogue is represented as follows:

```
{
    "source": "mctest",
    "id": "3dr23u6we5exclen4th8uq9rb42tel",
    "filename": "mc160.test.41",
    "story": "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. 
    Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. [...]" % <-- $P$
    "questions": [
        {
            "input_text": "What color was Cotton?",   % <-- $Q_1$
            "turn_id": 1
        },
        {
            "input_text": "Where did she live?",
            "turn_id": 2
        },
        [...]
    ],
    "answers": [
        {
            "span_start": 59,   % <-- $R_1$ start index
            "spand_end": 93,    % <-- $R_1$ end index
            "span_text": "a little white kitten named Cotton",   % <-- $R_1$
            "input_text" "white",   % <-- $A_1$      
            "turn_id": 1
        },
        [...]
    ]
}
```

### Simplifications

Each dialogue also contains an additional field ```additional_answers```. For simplicity, we **ignore** this field and only consider one groundtruth answer $A$ and text rationale $R$.

CoQA only contains 1.3% of unanswerable questions. For simplicity, we **ignore** those QA pairs.

## [Task 1] Remove unaswerable QA pairs

Write your own script to remove unaswerable QA pairs from both train and validation sets.

In [1]:
!pip install transformers
!pip install tensorflow-addons
!pip install datasets
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us

In [2]:
import json
import random
import urllib.request
import torch
import pickle
import re
import os
import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa

from google.colab import drive
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from datasets import Dataset, load_from_disk
from evaluate import load
from transformers import TFAutoModel, AutoTokenizer
from transformers import logging

logging.set_verbosity_error()

## Dataset Download


In [3]:
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [4]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [5]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')  # <-- Why test? See next slides for an answer!

Downloading CoQA train data split... (it may take a while)


coqa-train-v1.0.json: 49.0MB [00:08, 5.83MB/s]                            


Download completed!
Downloading CoQA test data split... (it may take a while)


coqa-dev-v1.0.json: 9.09MB [00:03, 2.72MB/s]                            

Download completed!





#### Data Inspection

Spend some time in checking accurately the dataset format and how to retrieve the tasks' inputs and outputs!

In [6]:
def set_seed(SEED):
  random.seed(SEED) # if you're using random
  np.random.seed(SEED) # if you're using numpy
  torch.manual_seed(SEED) # torch.cuda.manual_seed_all(SEED) is not required
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False
  tf.random.set_seed(SEED) # setting the seed for tensorflow too
  os.environ['TF_DETERMINISTIC_OPS'] = '1'

def extract_data(split_dataset,add_history=False,sep_char="[SEP]"):
  """
  function extracting data from the list of dictionaries in the CoQA dataset
  :params:
    split_dataset: list of dictionaries from where to extract the pairs of question and passage and corresponding the answer
  """  
  XQA = [] # list that will contain pairs (P,Q)
  YQA = [] # list that will contain the Answers
  story_source = [] #list that will contain the category/source for each example
  for d in split_dataset: # scan each document
    for i in range(len(d["questions"])): # scan each question
      if d["answers"][i]["span_end"]!=-1: # discard unanswerable questions
        single_example = [] # prepare the single example...
        single_example.append(d["questions"][i]["input_text"]) #... with the question ...
        single_example.append(d["story"]) # ...and the passage
        if add_history:
          for j in range(i-1,-1,-1):
            if d["answers"][j]["span_end"]!=-1:
              single_example[1] = single_example[1] + sep_char + d["questions"][j]["input_text"]+ sep_char + d["answers"][j]["input_text"]
              
        XQA.append(single_example) # and append it
        YQA.append(d["answers"][i]["input_text"]) # add the answer
        story_source.append(d["source"]) # add the source
  return XQA, YQA, story_source

## [Task 2] Train, Validation and Test splits

CoQA only provides a train and validation set since the test set is hidden for evaluation purposes.

We'll consider the provided validation set as a test set. <br>
$\rightarrow$ Write your own script to:
* Split the train data in train and validation splits (80% train and 20% val)
* Perform splits such that a dialogue appears in one split only! (i.e., split at dialogue level)
* Perform splitting using the following seed for reproducibility: 42

#### Reproducibility Memo

Check back tutorial 2 on how to fix a specific random seed for reproducibility!

In [7]:
seed = 42 
set_seed(seed)

In [8]:
## MODEL NAME
#model_name = 'distilroberta-base'
model_name = 'prajjwal1/bert-tiny'
add_history=True

with open('coqa/train.json') as f:
  # loading the training json
  train_json = json.load(f)

with open('coqa/test.json') as f:
  # loading the test json
  test_json = json.load(f)

# splitting training data
train_data, val_data = train_test_split(train_json["data"],
                                        train_size=0.8,
                                        shuffle=True,
                                        random_state=seed)
# extracting X as list of pairs [Question, Passage] and Y as a list of strings (Answers) 
XQA_train, YQA_train, source_train = extract_data(train_data, add_history)
XQA_val, YQA_val, source_val = extract_data(val_data, add_history)
XQA_test, YQA_test, source_test = extract_data(test_json["data"], add_history)
del(train_json)
del(test_json)

print("Fourth training example:")
print(XQA_train[3])
print(YQA_train[3])
print(source_train[3])
print("Fourth validation example:")
print(XQA_val[3])
print(YQA_val[3])
print(source_val[3])
print("Fourth test example:")
print(XQA_test[3])
print(YQA_test[3])
print(source_test[3])
print(YQA_train[61])

Fourth training example:
['When was the last one held?', 'TUNIS, Tunisia (CNN) -- Polls closed late Sunday in Tunisia, the torchbearer of the so-called Arab Spring, but voters will not see results of national elections until Tuesday, officials said. \n\nOn Sunday, long lines of voters snaked around schools-turned-polling-stations in Tunis\'s upscale Menzah neighborhood, some waiting for hours to cast a vote in the nation\'s first national elections since the country\'s independence in 1956. \n\n"It\'s a wonderful day. It\'s the first time we can choose our own representatives," said Walid Marrakchi, a civil engineer who waited more than two hours, and who brought along his 3-year-old son Ahmed so he could "get used to freedom and democracy." \n\nTunisia\'s election is the first since a popular uprising in January overthrew long-time dictator Zine El Abidine Ben Ali and triggered a wave of revolutions -- referred to as the Arab Spring -- across the region. \n\nMore than 60 political par

In [9]:
## broken example fix:
print(XQA_train[61])
print(YQA_train[61])
YQA_train[61] = 'October'
print(YQA_train[61])

['what month?', 'Microsoft Word is a word processor developed by Microsoft. It was first released on October 25, 1983 under the name "Multi-Tool Word" for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS (1983), Apple Macintosh running Classic Mac OS (1985), AT&T Unix PC (1985), Atari ST (1988), OS/2 (1989), Microsoft Windows (1989), SCO Unix (1994), and macOS (2001). Commercial versions of Word are licensed as a standalone product or as a component of Microsoft Office, Windows RT or the discontinued Microsoft Works suite. Microsoft Word Viewer and Office Online are freeware editions of Word with limited features. \n\nIn 1981, Microsoft hired Charles Simonyi, the primary developer of Bravo, the first GUI word processor, which was developed at Xerox PARC. Simonyi started work on a word processor called "Multi-Tool Word" and soon hired Richard Brodie, a former Xerox intern, who became the primary software engineer. \n\nMicros

In [10]:
def filter_string(x):
  return re.sub('[!"#$%&()*+,./:;=?@[\\]^_`{|}~\t\n]',"",x)
  
# this tokenizer doen't filter anything, so a word and the concatenation of the
# same word with a punctuation will have different embeddings
output_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+,./:;=?@[\\]^_`{|}~\t\n', oov_token='<UNK>')#, analyzer = custom_analyzer) # here we use a custom analyzer
output_tokenizer.fit_on_texts(["<start> " + i + " <end>" for i in YQA_train])

input_tokenizer = AutoTokenizer.from_pretrained(model_name)

max_output_length = max([len(i) for i in YQA_train])
print("Max input output found: " + str(max([len(i) for i in output_tokenizer.texts_to_sequences(YQA_train)])))
#max_sequence_length = max(512, max_output_length)
print("99° percentile of training set answer length:" + str(np.quantile([len(i) for i in output_tokenizer.texts_to_sequences(YQA_train)], 0.99)))
# actual percentile is 17, given that each string has the beginnning and ending token
max_sequence_length = 20


print(np.argmax([len(i) for i in YQA_train]))
print(XQA_train[7529])
print(YQA_train[7529])

dataset_suffix = "_hist" if add_history else ""

Downloading:   0%|          | 0.00/285 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Max input output found: 83
99° percentile of training set answer length:13.0
7529
["What symptoms of addiction does Orzack's center list?", 'Caught in the Web A few months ago, it wasn\'t unusual for 47-year-old Carla Toebe to spend 15 hours per day online. She\'d wake up early, turn on her laptop and chat on Internet dating sites and instant-messaging programs - leaving her bed for only brief intervals. Her household bills piled up, along with the dishes and dirty laundry, but it took near-constant complaints from her four daughters before she realized she had a problem. "I was starting to feel like my whole world was falling apart - kind of slipping into a depression," said Carla. "I knew that if I didn\'t get off the dating sites, I\'d just keep going," detaching herself further from the outside world. Toebe\'s conclusion: She felt like she was "addicted" to the Internet. She\'s not alone. Concern about excessive Internet use isn\'t new. As far back as 1995, articles in medical jour

In [None]:
# generate dataset
train_ds = Dataset.from_dict({"xqa": XQA_train, "yqa": ["<start> " + i + " <end>" for i in YQA_train], "source":source_train})
train_ds = train_ds.map(lambda x: input_tokenizer(x["xqa"], return_tensors="tf", padding="max_length", truncation="longest_first", max_length=512), batched=True)
train_ds = train_ds.map(lambda x: {"y_token": output_tokenizer.texts_to_sequences(x["yqa"])}, batched=True)
train_ds = train_ds.map(lambda x: {"y_padded": tf.keras.preprocessing.sequence.pad_sequences(x["y_token"],
                                                                     padding='post',
                                                                     maxlen=max_sequence_length)}, batched=True
)
train_ds = train_ds.remove_columns(["xqa", "yqa", "y_token"])
train_ds = train_ds.with_format(type="tensorflow")
if model_name == 'prajjwal1/bert-tiny':
  train_ds.save_to_disk("gdrive/MyDrive/ckpt/train_ds" + dataset_suffix)
else:
  train_ds.save_to_disk("gdrive/MyDrive/ckpt/train_ds_rob" + dataset_suffix)

  0%|          | 0/86 [00:00<?, ?ba/s]

  0%|          | 0/86 [00:00<?, ?ba/s]

  0%|          | 0/86 [00:00<?, ?ba/s]

Saving the dataset (0/1 shards):   0%|          | 0/85807 [00:00<?, ? examples/s]

In [None]:
# predictions are in lower case, so we consider the labels in lower case
val_ds = Dataset.from_dict({"xqa": XQA_val,
                            "yqa": [filter_string(i.lower()) for i in YQA_val],
                            "id_placeholder": list(range(len(YQA_val))),
                            "source":source_val})
val_ds = val_ds.map(lambda x: input_tokenizer(x["xqa"], return_tensors="tf", padding="max_length", truncation="longest_first", max_length=512), batched=True)
val_ds = val_ds.map(lambda x:{"references": {'answers':{'text':[x["yqa"]], 'answer_start': [42]},
    'id': str(x["id_placeholder"]) } })

val_ds = val_ds.remove_columns(["xqa","yqa", "id_placeholder"])

if model_name == 'prajjwal1/bert-tiny':  
  val_ds = val_ds.with_format(type="tensorflow", columns=["input_ids", "attention_mask","token_type_ids"], output_all_columns=True)
  val_ds.save_to_disk("gdrive/MyDrive/ckpt/val_ds" + dataset_suffix)
else:
  val_ds = val_ds.with_format(type="tensorflow", columns=["input_ids", "attention_mask"], output_all_columns=True)
  val_ds.save_to_disk("gdrive/MyDrive/ckpt/val_ds_rob" + dataset_suffix)

  0%|          | 0/22 [00:00<?, ?ba/s]

  0%|          | 0/21479 [00:00<?, ?ex/s]

Saving the dataset (0/1 shards):   0%|          | 0/21479 [00:00<?, ? examples/s]

In [None]:
# predictions are in lower case, so we consider the labels in lower case
test_ds = Dataset.from_dict({"xqa": XQA_test,
                             "yqa": [filter_string(i.lower()) for i in YQA_test],
                             "id_placeholder": list(range(len(YQA_test))),
                             "source":source_test})
test_ds = test_ds.map(lambda x: input_tokenizer(x["xqa"], return_tensors="tf", padding="max_length", truncation="longest_first", max_length=512), batched=True)
test_ds = test_ds.map(lambda x:{"references": {'answers':{'text':[x["yqa"]], 'answer_start': [42]},
    'id': str(x["id_placeholder"]) } })

test_ds = test_ds.remove_columns(["xqa","yqa", "id_placeholder"])

if model_name == 'prajjwal1/bert-tiny':  
  test_ds = test_ds.with_format(type="tensorflow", columns=["input_ids", "attention_mask","token_type_ids"], output_all_columns=True)
  test_ds.save_to_disk("gdrive/MyDrive/ckpt/test_ds" + dataset_suffix)
else:
  test_ds = test_ds.with_format(type="tensorflow", columns=["input_ids", "attention_mask"], output_all_columns=True)
  test_ds.save_to_disk("gdrive/MyDrive/ckpt/test_ds_rob" + dataset_suffix)

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/7918 [00:00<?, ?ex/s]

Saving the dataset (0/1 shards):   0%|          | 0/7918 [00:00<?, ? examples/s]

## [Task 3] Model definition

Write your own script to define the following transformer-based models from [huggingface](https://HuggingFace.co/).

* [M1] DistilRoBERTa (distilberta-base)
* [M2] BERTTiny (bert-tiny)

**Note**: Remember to install the ```transformers``` python package!

**Note**: We consider small transformer models for computational reasons!

In [11]:
# THIS IS A SEPARATOR ##########################################################

"""
This was tested with:
tensorflow==2.6
tensorflow-gpu==2.6
tensorflow-addons==0.16.1
transformers==4.18.0
Keras==2.6.0

Note 1: Simple adaptation of tf_seq2seq_lstm.py script
Note 2: make sure Keras and Tensorflow versions match!

"""


# check if training can be performed on GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)


class MyTrainer(object):
    """
    Simple wrapper class

    train_op -> uses tf.GradientTape to compute the loss
    batch_fit -> receives a batch and performs forward-backward passes (gradient included) 
    """

    def __init__(self, encoder, decoder, max_length):
        self.encoder = encoder
        self.decoder = decoder
        self.max_length = max_length
        self.ce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, 
                                                                reduction='none') # from logits means that it returns values after a 
                                                                                  # softmax application, thus it is useless to
                                                                                  # add a softmax activation layer if this parameter is set to 
                                                                                  # true (or even dangerous because it squashes the values)
        self.optimizer = tf.keras.optimizers.Adam(learning_rate=1e-03)            # here it is possible to tweak the learning rate

    @tf.function
    def compute_loss(self, logits, target):
        loss = self.ce(y_true=target, y_pred=logits)
        mask = tf.logical_not(tf.math.equal(target, 0))
        mask = tf.cast(mask, dtype=loss.dtype)
        loss *= mask # pointwise product
        return tf.reduce_mean(loss)

    @tf.function
    def train_op(self, inputs):
        with tf.GradientTape() as tape:
            # NOTABENE: it is necessary to add token_type_ids to see how it performs
            if self.encoder.use_token_type_ids:
              encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': inputs['input_ids'],
                                                                  'attention_mask': inputs['attention_mask'],
                                                                  'token_type_ids': inputs['token_type_ids']})
            else:
              encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': inputs['input_ids'],
                                                                  'attention_mask': inputs['attention_mask']})
            decoder_input = inputs['y_padded'][:, :-1]  # ignore <end>
            real_target = inputs['y_padded'][:, 1:]  # ignore <start>

            # encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': inputs[0][0],
            #                                                    'attention_mask': inputs[0][1]})
            # decoder_input = inputs[1][:, :-1]
            # real_target = inputs[1][:, 1:]

            self.decoder.attention.setup_memory(encoder_output) # setup in order to perform attention queries over the 
                                                           # embedding space

            # decoder initialization, check build_initial_state for additional insights
            decoder_initial_state = self.decoder.build_initial_state(self.decoder.batch_size, [encoder_h, encoder_s])
            # the input is then passed to the initialized decoder and we obtain predictions
            # in rnn_output format because the model is BERT-emdedding-sequence-sequence, so the
            # last layer is still a sequence of cells (a RNN)
            predicted = self.decoder({'input_ids': decoder_input,
                                      'initial_state': decoder_initial_state}).rnn_output
            # we compute the losses over the computed predictions
            loss = self.compute_loss(logits=predicted, target=real_target)
        # gradients of the loss computed for this minibatch considering trainable
        # parameters of encoder and decoder
        grads = tape.gradient(loss, self.encoder.trainable_variables + self.decoder.trainable_variables)
        return loss, grads

    @tf.function
    def batch_fit(self, inputs):
        loss, grads = self.train_op(inputs=inputs)
        # applies gradients to the trainable variables using Adam
        self.optimizer.apply_gradients(zip(grads, self.encoder.trainable_variables + self.decoder.trainable_variables))
        return loss

    # @tf.function
    def generate(self, output_tokenizer, input_ids,token_type_ids, attention_mask=None):
        batch_size = input_ids.shape[0] # input_ids is the minibatch
        encoder_output, encoder_h, encoder_s = self.encoder({
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'token_type_ids': token_type_ids, 
        })
        if self.encoder.use_token_type_ids:
          encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': input_ids,
                                                                  'attention_mask': attention_mask,
                                                                  'token_type_ids': token_type_ids})
        else:
          encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': input_ids,
                                                                  'attention_mask': attention_mask})
        start_tokens = tf.fill([batch_size], output_tokenizer.word_index['<start>'])
        end_token = output_tokenizer.word_index['<end>']

        # samples the possible answer with greedy technique, we could possibly
        # use a variant here such as beam search at inference time 
        # We could not do this at training time, since the Sampler used at training
        # is not designed to project the token in an embedding space before computing
        # the next one. The aforementioned embedding space
        # is changing at each backpropagation step anyways, thus we stick with
        # the computation of the argmax of the logits using TrainingSampler.
        # NOTABENE: we can still change this sampler, find a way to penalize repetitions
        # and perform the beam search
        greedy_sampler = tfa.seq2seq.GreedyEmbeddingSampler() 
        # we have a decoder for training and a decoder for test time, thus
        # we need to re-define the training decoder each time we want to
        # train a new batch
        decoder_instance = tfa.seq2seq.BasicDecoder(cell=self.decoder.wrapped_decoder_cell,
                                                    sampler=greedy_sampler,
                                                    output_layer=self.decoder.generation_dense,
                                                    maximum_iterations=self.max_length)
        self.decoder.attention.setup_memory(encoder_output)

        # decoder_initial_state is still an output of the encoder, we pass it to
        # the decoder_instance in order to get the outputs
        decoder_initial_state = self.decoder.build_initial_state(batch_size, [encoder_h, encoder_s])
        
        decoder_embedding_matrix = self.decoder.embedding.variables[0]
        outputs, _, _ = decoder_instance(decoder_embedding_matrix,
                                         start_tokens=start_tokens,
                                         end_token=end_token,
                                         initial_state=decoder_initial_state)
        return outputs

    def translate(self, generated, output_tokenizer):
        return output_tokenizer.sequences_to_texts(generated.sample_id.numpy())

    def beam_translate(self, results, output_tokenizer):
        return output_tokenizer.sequences_to_texts(results[0][:,0,:])

    def beam_generate(self, output_tokenizer, input_ids,token_type_ids, attention_mask=None, beam_width=3, length_penalty=0.5):
        batch_size = input_ids.shape[0] # input_ids is the minibatch
        encoder_output, encoder_h, encoder_s = self.encoder({
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'token_type_ids': token_type_ids, 
        })
        if self.encoder.use_token_type_ids:
          encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': input_ids,
                                                                  'attention_mask': attention_mask,
                                                                  'token_type_ids': token_type_ids})
        else:
          encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': input_ids,
                                                                  'attention_mask': attention_mask})
        start_tokens = tf.fill([batch_size], output_tokenizer.word_index['<start>'])
        end_token = output_tokenizer.word_index['<end>']
        
        # From official documentation
        # NOTE If you are using the BeamSearchDecoder with a cell wrapped in AttentionWrapper, then you must ensure that:
        # The encoder output has been tiled to beam_width via tfa.seq2seq.tile_batch (NOT tf.tile).
        # The batch_size argument passed to the get_initial_state method of this wrapper is equal to true_batch_size * beam_width.
        # The initial state created with get_initial_state above contains a cell_state value containing properly tiled final state from the encoder.

        encoder_output = tfa.seq2seq.tile_batch(encoder_output, multiplier=beam_width)
        self.decoder.attention.setup_memory(encoder_output)

        # set decoder_inital_state which is an AttentionWrapperState considering beam_width
        hidden_state = tfa.seq2seq.tile_batch([encoder_h, encoder_s], multiplier=beam_width)
        decoder_initial_state = self.decoder.build_initial_state(beam_width*batch_size, hidden_state)

        # Instantiate BeamSearchDecoder
        decoder_instance = tfa.seq2seq.BeamSearchDecoder(self.decoder.wrapped_decoder_cell,
                                                          beam_width=beam_width,
                                                          output_layer=self.decoder.generation_dense,
                                                          length_penalty_weight=length_penalty,
                                                          maximum_iterations=self.max_length)
        decoder_embedding_matrix = self.decoder.embedding.variables[0]

        # The BeamSearchDecoder object's call() function takes care of everything.
        outputs, final_state, sequence_lengths = decoder_instance(decoder_embedding_matrix, 
                                                                  start_tokens=start_tokens,
                                                                  end_token=end_token,
                                                                  initial_state=decoder_initial_state)
        # outputs is tfa.seq2seq.FinalBeamSearchDecoderOutput object. 
        # The final beam predictions are stored in outputs.predicted_id
        # outputs.beam_search_decoder_output is a tfa.seq2seq.BeamSearchDecoderOutput object which keep tracks of beam_scores and parent_ids while performing a beam decoding step
        # final_state = tfa.seq2seq.BeamSearchDecoderState object.
        # Sequence Length = [inference_batch_size, beam_width] details the maximum length of the beams that are generated


        # outputs.predicted_id.shape = (inference_batch_size, time_step_outputs, beam_width)
        # outputs.beam_search_decoder_output.scores.shape = (inference_batch_size, time_step_outputs, beam_width)
        # Convert the shape of outputs and beam_scores to (inference_batch_size, beam_width, time_step_outputs)
        final_outputs = tf.transpose(outputs.predicted_ids, perm=(0,2,1))
        beam_scores = tf.transpose(outputs.beam_search_decoder_output.scores, perm=(0,2,1))

        return final_outputs.numpy(), beam_scores.numpy()

      


class Encoder(tf.keras.Model):

    def __init__(self, model_name, decoder_units):
        super(Encoder, self).__init__()
        self.model = TFAutoModel.from_pretrained(model_name, from_pt=True, trainable=False)
        self.model.trainable=False
        self.reducer = tf.keras.layers.Dense(decoder_units)
        self.reducer2 = tf.keras.layers.Dense(decoder_units)
        self.avg_pool = tf.keras.layers.AveragePooling1D(pool_size = 512)
        self.use_token_type_ids = model_name=='prajjwal1/bert-tiny'

    def call(self, inputs, training=False, **kwargs):
        model_output = self.model(inputs)
        
        # all_outputs has shape (batch_size * 512 * 128)
        all_outputs = model_output[0] # output of the last layer of the model
        #pooled_output = model_output[1] # last layer but processed by a linear 
                                        # layer and a tanh
        
        # cls coding
        hidden_pooled = all_outputs[:, 0, :]
        cell_state = self.avg_pool(all_outputs)
        cell_state = tf.reshape(cell_state, [all_outputs.shape[0], all_outputs.shape[2]])

        # NOTABENE: it could be possible to add something to improve the encoding7
        
        # pooled output has shape (batch_size * 128)
        hidden_state = self.reducer(hidden_pooled)
        cell_state = self.reducer2(cell_state)
        #return all_outputs, self.reducer(model_output[1]), self.reducer(model_output[1])
        return all_outputs, hidden_state, cell_state


class Decoder(tf.keras.Model):

    def __init__(self, vocab_size, max_sequence_length, embedding_dim, decoder_units, batch_size):
        super(Decoder, self).__init__()

        self.max_sequence_length = max_sequence_length
        self.batch_size = batch_size

        self.decoder_units = decoder_units
        # NOTABENE: it is possible to change the embedding dimension and the number of LSTM cells
        self.embedding = tf.keras.layers.Embedding(input_dim=vocab_size,
                                                   output_dim=embedding_dim)
        # NOTABENE: It could be possible to swap LSTMCell with GRUCell
        self.decoder_lstm_cell = tf.keras.layers.LSTMCell(self.decoder_units)
        # NOTABENE: Just one type of attention, it could be changed to seek for different
        # results
        self.attention = tfa.seq2seq.BahdanauAttention(units=self.decoder_units,
                                                       memory=None,
                                                       memory_sequence_length=self.batch_size * [max_sequence_length])

        self.wrapped_decoder_cell = tfa.seq2seq.AttentionWrapper(self.decoder_lstm_cell,
                                                                 self.attention,
                                                                 attention_layer_size=self.decoder_units) # adds the attention mechanism after a single
                                                                                # LSTM cell, because we pass a word at the time
        # dense layer needed to generate the distribution values over 
        # the size of the vocabulary (probability for each word)
        self.generation_dense = tf.keras.layers.Dense(vocab_size)
        # Above we describe why this cannot be changed and why it resambles
        # the greedysampler
        self.sampler = tfa.seq2seq.sampler.TrainingSampler()
        self.decoder = tfa.seq2seq.BasicDecoder(self.wrapped_decoder_cell,
                                                sampler=self.sampler,
                                                output_layer=self.generation_dense)

    def build_initial_state(self, batch_size, encoder_state):
        # after initializing the tensors within the attention layer to 0 we add
        # the designated initialization that allow us to query the embedding space,
        # which is passed as encoder_state.
        # We load the embedding of a single batch and we actually don't freeze 
        # the parameters related to BERT, that are modified and can possibly 
        # overfit. 
        initial_state = self.wrapped_decoder_cell.get_initial_state(batch_size=batch_size, dtype=tf.float32)
        initial_state = initial_state.clone(cell_state=encoder_state) 
        return initial_state

    def call(self, inputs, training=False, **kwargs):
        # as shown in calls, inputs is a dictionary with entries: 
        # "input_ids" : _encoder_output_
        # "initial_state" : _result_of_build_initial_state_
        input_ids = inputs['input_ids']
        input_emb = self.embedding(input_ids)
        decoder_output, _, _ = self.decoder(input_emb,
                                            initial_state=inputs['initial_state'],
                                            sequence_length=self.batch_size * [self.max_sequence_length - 1])
        return decoder_output


# THIS IS A SEPARATOR ##########################################################

In [12]:
def train_loop(trainer, dataset, epochs, batch_size, ckpt_manager):
  steps_per_epoch = len(dataset)//batch_size
  
  for epoch in tqdm(range(epochs)):
    batch_index = 0
    cumulative_loss = 0

    for batch_index in tqdm(range(steps_per_epoch), position=0, leave=True):
      loss = trainer.batch_fit(dataset[batch_index*batch_size:batch_index*batch_size+batch_size])
      cumulative_loss += loss

    ckpt_manager.save()
    mean_loss = cumulative_loss / batch_index
    print(f"Current mean {mean_loss}")


def predict_loop(trainer, dataset, inference_batch_size,model_name,output_tokenizer, beam_search=False):
  ttids=None
  if beam_search:
    generation_func = trainer.beam_generate
    translation_func = trainer.beam_translate
  else:
    generation_func = trainer.generate
    translation_func = trainer.translate
  
  inference_step = len(dataset) // inference_batch_size
  predictions = []
  for step_index in tqdm(range(inference_step)):
    starting_index = step_index*inference_batch_size
    ending_index = step_index*inference_batch_size + inference_batch_size
    if model_name == 'prajjwal1/bert-tiny':  
      ttids = dataset["token_type_ids"][starting_index : ending_index]
    generated = generation_func(output_tokenizer=output_tokenizer, 
                                  input_ids=dataset["input_ids"][starting_index : ending_index],
                                  token_type_ids=ttids,
                                  attention_mask=dataset["attention_mask"][starting_index : ending_index])
    translated = translation_func(generated, output_tokenizer=output_tokenizer)
  # all this mess with indexes is needed in order to have coherent ids in the field "id"
    list_to_add = [{'prediction_text': translated[i - starting_index].split("<end>")[0], 'id':str(i)} for i in range(starting_index, ending_index)]
    predictions.extend(list_to_add)
  if model_name == 'prajjwal1/bert-tiny':  
    ttids = dataset["token_type_ids"][(inference_step)*inference_batch_size :]
  
  generated = generation_func(output_tokenizer = output_tokenizer, 
                             input_ids=dataset["input_ids"][(inference_step)*inference_batch_size :],
                             token_type_ids=ttids,
                             attention_mask=dataset["attention_mask"][(inference_step)*inference_batch_size :])
  translated = translation_func(generated, output_tokenizer=output_tokenizer)

  predictions.extend([{'prediction_text': translated[i - (inference_step)*inference_batch_size].split("<end>")[0], 
                    'id':str(i)} for i in range((inference_step)*inference_batch_size, 
                                                len(dataset))])
  return predictions
  
def save_prediction(prediction, filename):
  with open(filename, "wb") as f:
    pickle.dump(prediction, f)

In [18]:
squad_metric = load("squad")

def train_and_val(model_name,train_ds, val_ds, epochs, batch_size, decoder_units, max_sequence_length, output_tokenizer, pred_file_name, checkpoint_dir):
  INF_BS = 64 #Inference batch_size
  results = []
  results_beam = []
  for train_seed in [42,1337,2022]:
    set_seed(train_seed)

    encoder = Encoder(model_name=model_name,
                          decoder_units=decoder_units)
      
    # Testing the decoder
    decoder = Decoder(vocab_size=len(output_tokenizer.word_index) + 1,
                          embedding_dim=100,
                          decoder_units=decoder_units,
                          batch_size=batch_size,
                          max_sequence_length=max_sequence_length)
    # Training
    trainer = MyTrainer(encoder=encoder,
                          decoder=decoder,
                          max_length=max_sequence_length)
    
    checkpoint = tf.train.Checkpoint(optimizer=trainer.optimizer,
                                  encoder=encoder,
                                  decoder=decoder)
    manager = tf.train.CheckpointManager(checkpoint, checkpoint_dir + f"/{train_seed}", max_to_keep=1)

    train_loop(trainer, train_ds, epochs, batch_size, manager)

    prediction = predict_loop(trainer, val_ds, INF_BS, model_name,output_tokenizer)
    save_prediction(prediction, checkpoint_dir + pred_file_name + "_" + str(train_seed) + "_pred.pickle")

    prediction_beam = predict_loop(trainer, val_ds, INF_BS, model_name,output_tokenizer, beam_search=True)
    save_prediction(prediction_beam, checkpoint_dir + pred_file_name + "_" + str(train_seed) + "_beampred.pickle")

    results.append(squad_metric.compute(predictions=prediction, references=val_ds['references']))
    results_beam.append(squad_metric.compute(predictions=prediction_beam, references=val_ds['references']))

    del(manager)
    del(checkpoint)
    del(trainer)
    del(encoder)
    del(decoder) 

  print("***VALIDATION RESULTS***")
  print(results)
  print(results_beam)
  print(f"greedy exact match:{sum([res['exact_match'] for res in results])/len(results)}" )
  print(f"greedy SQUAD-F1:{sum([res['f1'] for res in results])/len(results)}" )
  print(f"beam exact match:{sum([res['exact_match'] for res in results_beam])/len(results_beam)}" )
  print(f"beam SQUAD-F1:{sum([res['f1'] for res in results_beam])/len(results_beam)}" )

def test_model(model_name, test_ds, batch_size, decoder_units, max_sequence_length, output_tokenizer, pred_file_name, checkpoint_dir, pred_file_suffix = ["_testpred.pickle", "_testbeampred.pickle"]):
  INF_BS = 64 #Inference batch_size
  
  
  results = []
  results_beam = []
  for train_seed in [42,1337,2022]:
    encoder = Encoder(model_name=model_name,
                        decoder_units=decoder_units)
      
    # Testing the decoder
    decoder = Decoder(vocab_size=len(output_tokenizer.word_index) + 1,
                          embedding_dim=100,
                          decoder_units=decoder_units,
                          batch_size=batch_size,
                          max_sequence_length=max_sequence_length)
  
    # Training
    trainer = MyTrainer(encoder=encoder,
                          decoder=decoder,
                          max_length=max_sequence_length)
    
    checkpoint = tf.train.Checkpoint(optimizer=trainer.optimizer,
                                  encoder=encoder,
                                  decoder=decoder)
    
  
    # required step in order to load correctly the decoder embedding matrix
    decoder.embedding.build(input_shape=None)
    checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir+f"/{train_seed}")).expect_partial()

    prediction = predict_loop(trainer, test_ds, INF_BS, model_name, output_tokenizer)
    save_prediction(prediction, checkpoint_dir + pred_file_name + "_" + str(train_seed) +  pred_file_suffix[0])

    prediction_beam = predict_loop(trainer, test_ds, INF_BS, model_name ,output_tokenizer, beam_search=True)
    save_prediction(prediction_beam, checkpoint_dir + pred_file_name + "_" + str(train_seed) + pred_file_suffix[1])

    results.append(squad_metric.compute(predictions=prediction, references=test_ds['references']))
    results_beam.append(squad_metric.compute(predictions=prediction_beam, references=test_ds['references']))

  print("***TEST RESULTS***")
  print(results)
  print(results_beam)
  print(f"greedy exact match:{sum([res['exact_match'] for res in results])/len(results)}" )
  print(f"greedy SQUAD-F1:{sum([res['f1'] for res in results])/len(results)}" )
  print(f"beam exact match:{sum([res['exact_match'] for res in results_beam])/len(results_beam)}" )
  print(f"beam SQUAD-F1:{sum([res['f1'] for res in results_beam])/len(results_beam)}" )

## [Task 4] Question generation with text passage $P$ and question $Q$

We want to define $f_\theta(P, Q)$. 

Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$ and $Q_i$ and generate $A_i$.

In [15]:
BATCH_SIZE = 14
EPOCHS = 3
MAX_SEQUENCE_LENGTH = 20
TINY_DEC_UNITS = 128
ROB_DEC_UNITS = 512

In [None]:
# BERT TINY NO HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/tiny'


train_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/train_ds")
val_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/val_ds")

train_and_val('prajjwal1/bert-tiny',
              train_ds,
              val_ds,
              epochs=EPOCHS,
              batch_size=BATCH_SIZE,
              decoder_units=TINY_DEC_UNITS,
              max_sequence_length=MAX_SEQUENCE_LENGTH,
              output_tokenizer=output_tokenizer,
              pred_file_name='tiny',
              checkpoint_dir=checkpoint_dir)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.embeddings.position_ids', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

Current mean 1.064605712890625


100%|██████████| 6129/6129 [11:14<00:00,  9.08it/s]
 67%|██████▋   | 2/3 [22:34<11:16, 676.95s/it]

Current mean 0.8742226958274841


100%|██████████| 6129/6129 [11:13<00:00,  9.10it/s]
100%|██████████| 3/3 [33:48<00:00, 676.17s/it]


Current mean 0.7598474025726318


100%|██████████| 335/335 [02:32<00:00,  2.20it/s]
100%|██████████| 335/335 [04:44<00:00,  1.18it/s]


***VALIDATION RESULTS***
[{'exact_match': 11.904651054518366, 'f1': 14.289189542326932}]
[{'exact_match': 12.123469435262349, 'f1': 14.27187375444946}]
greedy exact match:11.904651054518366
greedy SQUAD-F1:14.289189542326932
beam exact match:12.123469435262349
beam SQUAD-F1:14.27187375444946


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.embeddings.position_ids', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

Current mean 1.0730599164962769


100%|██████████| 6129/6129 [11:11<00:00,  9.12it/s]
 67%|██████▋   | 2/3 [22:35<11:16, 676.64s/it]

Current mean 0.8743070363998413


100%|██████████| 6129/6129 [11:12<00:00,  9.11it/s]
100%|██████████| 3/3 [33:48<00:00, 676.17s/it]


Current mean 0.7575758695602417


100%|██████████| 335/335 [02:29<00:00,  2.24it/s]
100%|██████████| 335/335 [04:39<00:00,  1.20it/s]


***VALIDATION RESULTS***
[{'exact_match': 11.904651054518366, 'f1': 14.289189542326932}, {'exact_match': 11.960519577261511, 'f1': 14.549653043162628}]
[{'exact_match': 12.123469435262349, 'f1': 14.27187375444946}, {'exact_match': 12.197960798919874, 'f1': 14.561944356953724}]
greedy exact match:11.932585315889938
greedy SQUAD-F1:14.41942129274478
beam exact match:12.160715117091112
beam SQUAD-F1:14.416909055701591


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.embeddings.position_ids', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

Current mean 1.065598726272583


100%|██████████| 6129/6129 [11:08<00:00,  9.17it/s]
 67%|██████▋   | 2/3 [22:26<11:12, 672.46s/it]

Current mean 0.8747037649154663


100%|██████████| 6129/6129 [11:08<00:00,  9.16it/s]
100%|██████████| 3/3 [33:35<00:00, 671.89s/it]


Current mean 0.7593579292297363


100%|██████████| 335/335 [02:27<00:00,  2.28it/s]
100%|██████████| 335/335 [04:38<00:00,  1.20it/s]


***VALIDATION RESULTS***
[{'exact_match': 11.904651054518366, 'f1': 14.289189542326932}, {'exact_match': 11.960519577261511, 'f1': 14.549653043162628}, {'exact_match': 11.904651054518366, 'f1': 14.643605437027999}]
[{'exact_match': 12.123469435262349, 'f1': 14.27187375444946}, {'exact_match': 12.197960798919874, 'f1': 14.561944356953724}, {'exact_match': 12.095535173890777, 'f1': 14.485326311156506}]
greedy exact match:11.923273895432748
greedy SQUAD-F1:14.494149340839186
beam exact match:12.138988469357665
beam SQUAD-F1:14.439714807519897


In [16]:
# DISTILROBERTA NO HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/rob'


train_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/train_ds_rob")
val_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/val_ds_rob")

train_and_val('distilroberta-base',
              train_ds,
              val_ds,
              epochs=EPOCHS,
              batch_size=BATCH_SIZE,
              decoder_units=ROB_DEC_UNITS,
              max_sequence_length=MAX_SEQUENCE_LENGTH,
              output_tokenizer=output_tokenizer,
              pred_file_name='rob',
              checkpoint_dir=checkpoint_dir)



Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

100%|██████████| 6129/6129 [43:59<00:00,  2.32it/s]
 33%|███▎      | 1/3 [44:02<1:28:04, 2642.12s/it]

Current mean 1.0579156875610352


100%|██████████| 6129/6129 [43:22<00:00,  2.36it/s]
 67%|██████▋   | 2/3 [1:27:27<43:40, 2620.37s/it]

Current mean 0.8134294748306274


100%|██████████| 6129/6129 [43:19<00:00,  2.36it/s]
100%|██████████| 3/3 [2:10:48<00:00, 2616.30s/it]


Current mean 0.6510803699493408


100%|██████████| 335/335 [15:39<00:00,  2.80s/it]
TensorFlow Addons has compiled its custom ops against TensorFlow 2.11.0, and there are no compatibility guarantees between the two versions. 
This means that you might get segfaults when loading the custom op, or other kind of low-level errors.
 If you do, do not file an issue on Github. This is a known limitation.

It might help you to fallback to pure Python ops by setting environment variable `TF_ADDONS_PY_OPS=1` or using `tfa.options.disable_custom_kernel()` in your code. To do that, see https://github.com/tensorflow/addons#gpucpu-custom-ops 

You can also change the TensorFlow version installed on your system. You would need a TensorFlow version equal to or above 2.11.0 and strictly below 2.12.0.
 Note that nightly versions of TensorFlow, as well as non-pip TensorFlow like `conda install tensorflow` or compiled from source are not supported.

The last solution is to find the TensorFlow Addons version that has custom ops compatible 

***VALIDATION RESULTS***
[{'exact_match': 13.492248242469389, 'f1': 16.418112559477272}]
[{'exact_match': 13.706410912984776, 'f1': 16.497910645677326}]
greedy exact match:13.492248242469389
greedy SQUAD-F1:16.418112559477272
beam exact match:13.706410912984776
beam SQUAD-F1:16.497910645677326


In [19]:
# training with distilroberta was stopped by colab limits
# so there we print the validation results
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/rob'

test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/val_ds_rob")

test_model('distilroberta-base',
            test_ds,
            batch_size=BATCH_SIZE,
            decoder_units=ROB_DEC_UNITS,
            max_sequence_length=MAX_SEQUENCE_LENGTH,
            output_tokenizer=output_tokenizer,
            pred_file_name='rob',
            checkpoint_dir=checkpoint_dir,
            pred_file_suffix=["_pred.pickle", "_beampred.pickle"])

100%|██████████| 335/335 [15:43<00:00,  2.82s/it]
100%|██████████| 335/335 [17:56<00:00,  3.21s/it]
100%|██████████| 335/335 [15:41<00:00,  2.81s/it]
100%|██████████| 335/335 [18:00<00:00,  3.23s/it]
100%|██████████| 335/335 [15:40<00:00,  2.81s/it]
100%|██████████| 335/335 [17:58<00:00,  3.22s/it]


***TEST RESULTS***
[{'exact_match': 13.082545742352996, 'f1': 16.315722027476525}, {'exact_match': 13.040644350295638, 'f1': 15.798458346275067}, {'exact_match': 13.492248242469389, 'f1': 16.418112559477272}]
[{'exact_match': 13.385166907211696, 'f1': 16.647637451077436}, {'exact_match': 13.184971367382094, 'f1': 15.577594002019113}, {'exact_match': 13.706410912984776, 'f1': 16.497910645677326}]
greedy exact match:13.205146111706007
greedy SQUAD-F1:16.177430977742954
beam exact match:13.425516395859523
beam SQUAD-F1:16.24104736625796


## [Task 5] Question generation with text passage $P$, question $Q$ and dialogue history $H$

We want to define $f_\theta(P, Q, H)$. Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$, $Q_i$, and $H = \{ Q_0, A_0, \dots, Q_{i-1}, A_{i-1} \}$ to generate $A_i$.

In [None]:
# BERT TINY WITH HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/tiny/hist'


train_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/train_ds_hist")
val_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/val_ds_hist")

train_and_val('prajjwal1/bert-tiny',
              train_ds,
              val_ds,
              epochs=EPOCHS,
              batch_size=BATCH_SIZE,
              decoder_units=TINY_DEC_UNITS,
              max_sequence_length=MAX_SEQUENCE_LENGTH,
              output_tokenizer=output_tokenizer,
              pred_file_name='tiny_hist',
              checkpoint_dir=checkpoint_dir)

100%|██████████| 6129/6129 [11:40<00:00,  8.75it/s]
 33%|███▎      | 1/3 [11:40<23:21, 700.90s/it]

Current mean 1.0670831203460693


100%|██████████| 6129/6129 [11:24<00:00,  8.95it/s]
 67%|██████▋   | 2/3 [23:06<11:31, 691.80s/it]

Current mean 0.8746263980865479


100%|██████████| 6129/6129 [11:24<00:00,  8.95it/s]
100%|██████████| 3/3 [34:31<00:00, 690.45s/it]


Current mean 0.759445309638977


100%|██████████| 335/335 [02:36<00:00,  2.14it/s]
100%|██████████| 335/335 [04:51<00:00,  1.15it/s]
100%|██████████| 6129/6129 [11:28<00:00,  8.90it/s]
 33%|███▎      | 1/3 [11:29<22:58, 689.14s/it]

Current mean 1.0788750648498535


100%|██████████| 6129/6129 [11:23<00:00,  8.97it/s]
 67%|██████▋   | 2/3 [22:53<11:26, 686.14s/it]

Current mean 0.8751189112663269


100%|██████████| 6129/6129 [11:23<00:00,  8.97it/s]
100%|██████████| 3/3 [34:17<00:00, 685.77s/it]


Current mean 0.7595735192298889


100%|██████████| 335/335 [02:33<00:00,  2.19it/s]
100%|██████████| 335/335 [04:47<00:00,  1.16it/s]
100%|██████████| 6129/6129 [11:25<00:00,  8.94it/s]
 33%|███▎      | 1/3 [11:25<22:51, 685.87s/it]

Current mean 1.0686047077178955


100%|██████████| 6129/6129 [11:22<00:00,  8.99it/s]
 67%|██████▋   | 2/3 [22:48<11:23, 683.97s/it]

Current mean 0.8764885663986206


100%|██████████| 6129/6129 [11:21<00:00,  8.99it/s]
100%|██████████| 3/3 [34:10<00:00, 683.57s/it]


Current mean 0.7604389190673828


100%|██████████| 335/335 [02:32<00:00,  2.20it/s]
100%|██████████| 335/335 [04:49<00:00,  1.16it/s]


***VALIDATION RESULTS***
[{'exact_match': 11.913962474975557, 'f1': 14.435647357982468}, {'exact_match': 12.00242096931887, 'f1': 14.831662593488241}, {'exact_match': 11.81153684994646, 'f1': 14.194647495643768}]
[{'exact_match': 12.216583639834257, 'f1': 14.615724631732936}, {'exact_match': 12.309697844406164, 'f1': 14.890212870467654}, {'exact_match': 12.104846594347968, 'f1': 14.181929087624674}]
greedy exact match:11.909306764746963
greedy SQUAD-F1:14.487319149038157
beam exact match:12.21037602619613
beam SQUAD-F1:14.562622196608421


In [None]:
# DISTILROBERTA WITH HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/rob/hist'

train_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/train_ds_rob_hist")
val_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/val_ds_rob_hist")

train_and_val('distilroberta-base',
              train_ds,
              val_ds,
              epochs=EPOCHS,
              batch_size=BATCH_SIZE,
              decoder_units=ROB_DEC_UNITS,
              max_sequence_length=MAX_SEQUENCE_LENGTH,
              output_tokenizer=output_tokenizer,
              pred_file_name='rob_hist',
              checkpoint_dir=checkpoint_dir)

## [Task 6] Train and evaluate $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$

Write your own script to train and evaluate your $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$ models.

### Instructions

* Perform multiple train/evaluation seed runs: [42, 2022, 1337].$^1$
* Evaluate your models with the following metrics: SQUAD F1-score.$^2$
* Fine-tune each transformer-based models for **3 epochs**.
* Report evaluation SQUAD F1-score computed on the validation and test sets.

$^1$ Remember what we said about code reproducibility in Tutorial 2!

$^2$ You can use ```allennlp``` python package for a quick implementation of SQUAD F1-score: ```from allennlp_models.rc.tools import squad```. 

In [None]:
# BERT TINY NO HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/tiny'

test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds")

test_model('prajjwal1/bert-tiny',
            test_ds,
            batch_size=BATCH_SIZE,
            decoder_units=TINY_DEC_UNITS,
            max_sequence_length=MAX_SEQUENCE_LENGTH,
            output_tokenizer=output_tokenizer,
            pred_file_name='tiny',
            checkpoint_dir=checkpoint_dir)

100%|██████████| 123/123 [00:40<00:00,  3.04it/s]
100%|██████████| 123/123 [01:29<00:00,  1.37it/s]
100%|██████████| 123/123 [00:39<00:00,  3.09it/s]
100%|██████████| 123/123 [01:27<00:00,  1.41it/s]
100%|██████████| 123/123 [00:39<00:00,  3.10it/s]
100%|██████████| 123/123 [01:29<00:00,  1.37it/s]


***TEST RESULTS***
[{'exact_match': 12.086385450871433, 'f1': 14.29309767458751}, {'exact_match': 12.136903258398586, 'f1': 14.5009777728258}, {'exact_match': 12.225309421571104, 'f1': 14.74764799280808}]
[{'exact_match': 12.250568325334681, 'f1': 14.126559362075886}, {'exact_match': 12.32634503662541, 'f1': 14.498713003437578}, {'exact_match': 12.32634503662541, 'f1': 14.599285738107293}]
greedy exact match:12.149532710280374
greedy SQUAD-F1:14.51390781340713
beam exact match:12.301086132861833
beam SQUAD-F1:14.408186034540252


In [None]:
# DISTILROBERTA NO HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/rob'

test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds_rob")

test_model('distilroberta-base',
            test_ds,
            batch_size=BATCH_SIZE,
            decoder_units=ROB_DEC_UNITS,
            max_sequence_length=MAX_SEQUENCE_LENGTH,
            output_tokenizer=output_tokenizer,
            pred_file_name='rob',
            checkpoint_dir=checkpoint_dir)

100%|██████████| 123/123 [05:35<00:00,  2.73s/it]
 80%|███████▉  | 98/123 [05:07<01:18,  3.14s/it]

In [None]:
# BERT TINY WITH HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/tiny/hist'

test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds_hist")

test_model('prajjwal1/bert-tiny',
            test_ds,
            batch_size=BATCH_SIZE,
            decoder_units=TINY_DEC_UNITS,
            max_sequence_length=MAX_SEQUENCE_LENGTH,
            output_tokenizer=output_tokenizer,
            pred_file_name='tiny_hist',
            checkpoint_dir=checkpoint_dir)

100%|██████████| 123/123 [00:41<00:00,  2.98it/s]
100%|██████████| 123/123 [01:31<00:00,  1.35it/s]
100%|██████████| 123/123 [00:39<00:00,  3.10it/s]
100%|██████████| 123/123 [01:29<00:00,  1.38it/s]
100%|██████████| 123/123 [00:40<00:00,  3.01it/s]
100%|██████████| 123/123 [01:30<00:00,  1.36it/s]


***TEST RESULTS***
[{'exact_match': 12.124273806516797, 'f1': 14.392046008515898}, {'exact_match': 12.086385450871433, 'f1': 14.668835115967326}, {'exact_match': 12.073755998989643, 'f1': 14.23343061197177}]
[{'exact_match': 12.47789845920687, 'f1': 14.779871710239322}, {'exact_match': 12.452639555443294, 'f1': 14.840252331149317}, {'exact_match': 12.250568325334681, 'f1': 14.217070933285694}]
greedy exact match:12.09480508545929
greedy SQUAD-F1:14.431437245484998
beam exact match:12.393702113328281
beam SQUAD-F1:14.612398324891444


In [None]:
# DISTILROBERTA WITH HISTORY
checkpoint_dir = './gdrive/MyDrive/ckpt/dom/rob/hist'

test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds_rob_hist")

test_model('distilroberta-base',
            test_ds,
            batch_size=BATCH_SIZE,
            decoder_units=ROB_DEC_UNITS,
            max_sequence_length=MAX_SEQUENCE_LENGTH,
            output_tokenizer=output_tokenizer,
            pred_file_name='rob_hist',
            checkpoint_dir=checkpoint_dir)

In [None]:
#predictions = [{'prediction_text': translated[i - 128].split("<end>")[0], 'id':str(i)} for i in range(128, 256)]
print(type(prediction[0]))
for i in prediction[0:10]:
  print(i["prediction_text"])
  print(YQA_val[int(i["id"])])
  print(XQA_val[int(i["id"])])

## [Task 7] Error Analysis

Perform a simple and short error analysis as follows:
* Group dialogues by ```source``` and report the worst 5 model errors for each source (w.r.t. SQUAD F1-score).
* Inspect observed results and try to provide some comments (e.g., do the models make errors when faced with a particular question type?)$^1$

$^1$ Check the [paper](https://arxiv.org/pdf/1808.07042.pdf) for some valuable information about question/answer types (e.g., Table 6, Table 8) 

In [None]:
def find_worst_errors(prediction_prefix, prediction_suffix, ref_dataset):
  squad = load("squad")
  predictions = []
  for seed in [42,1337,2022]:
    with open(prediction_prefix + str(seed) + prediction_suffix, "rb") as f:
      new_list = pickle.load(f)
      new_list.sort(key=lambda x: int(x["id"]))
      predictions.append(new_list)
      
  categories = np.unique(ref_dataset["source"])

  source_dict = {cat:[] for cat in categories}
  refd = ref_dataset["references"]
  for pred in tqdm(range(len(predictions[0]))):
    #ATTENTION: the following instructions is based on the assumption that
    # the id of each example is the row of the id itself, as it follows
    #from our dataset construciton
    ref = refd[int(predictions[0][pred]["id"])]
    if ref["id"] != predictions[0][pred]["id"] or ref["id"] != predictions[1][pred]["id"] or ref["id"] != predictions[2][pred]["id"]:
      print("error with ids: example" + ref["id"])
    
    f1 = squad.compute(predictions=[predictions[0][pred]],
                       references=[ref])["f1"]

    f1 += squad.compute(predictions=[predictions[1][pred]],
                                     references = [ref])["f1"]

    f1 += squad.compute(predictions=[predictions[2][pred]],
                        references = [ref])["f1"]
    
    f1 = f1/3
    source_dict[ref_dataset["source"][int(predictions[0][pred]["id"])]].append((predictions[0][pred]["id"] , f1))


  return source_dict

def print_orderered_predictions(source_dict, prediction_prefix, prediction_suffix, ref_dataset, question_dataset, kind="worst", qty=5, skip_yn=False):
  predictions = []
  refd = ref_dataset["references"]
  for seed in [42,1337,2022]:
    with open(prediction_prefix + str(seed) + prediction_suffix, "rb") as f:
      new_list = pickle.load(f)
      new_list.sort(key=lambda x: int(x["id"]))
      predictions.append(new_list)
  
  for key in source_dict.keys():
    source_dict[key].sort(key=lambda x : x[1], reverse=True if kind=="best" else False)
    print("-------------" + key + "-------------")
    i = 0
    j = 0
    while i<qty and j<len(source_dict[key]):
      id = source_dict[key][j][0]
      if skip_yn and refd[int(id)]["answers"]["text"][0] in ["yes", "no"]:
        j = j + 1
      else:
        print("question + passage: " + str(XQA_test[int(id)]))
        print("true answer: " + str(refd[int(id)]["answers"]["text"]))
        print("answer with seed 42: " + predictions[0][int(id)]["prediction_text"])
        print("answer with seed 1337: " + predictions[1][int(id)]["prediction_text"])
        print("answer with seed 2022: " + predictions[2][int(id)]["prediction_text"])
        print("f1: " + str(source_dict[key][j][1]))
        i = i + 1
        j = j + 1


'\n    for i in range(qty):\n      id = source_dict[key][i][0]\n      print("question + passage: " + str(XQA_test[int(id)]))\n      print("true answer: " + str(refd[int(id)]["answers"]["text"]))\n      print("answer with seed 42: " + predictions[0][int(id)]["prediction_text"])\n      print("answer with seed 1337: " + predictions[1][int(id)]["prediction_text"])\n      print("answer with seed 2022: " + predictions[2][int(id)]["prediction_text"])\n      print("f1: " + str(source_dict[key][i][1]))\n    '

In [None]:
# BERT TINY NO HISTORY
test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds")
source_dict = find_worst_errors("/content/gdrive/MyDrive/ckpt/dom/tinytiny_", "_testpred.pickle", test_ds)

100%|██████████| 7918/7918 [03:07<00:00, 42.24it/s]


In [None]:
print_orderered_predictions(source_dict, "/content/gdrive/MyDrive/ckpt/dom/tinytiny_", "_testpred.pickle", test_ds, XQA_test)

In [None]:
print_orderered_predictions(source_dict, "/content/gdrive/MyDrive/ckpt/dom/tinytiny_", "_testpred.pickle", test_ds, XQA_test, kind="best", skip_yn=True)

-------------cnn-------------
question + passage: ['how many shakeups have there been since Lugo took office?', "Asuncion, Paraguay (CNN) -- Paraguay installed new top military commanders, but President Fernando Lugo, who had ordered the change in leadership, was not present for the ceremony. \n\nLugo's absence Thursday morning attracted attention given his administration's silence on the sudden change in the leadership of the country's army, air force and navy. \n\nThe president's decision to replace the top brass came a day after he publicly dismissed rumors about a military coup. \n\nBrig. Gen. Bartolome Ramon Pineda Ortiz was named as the new army commander. Brig. Gen. Hugo Gilberto Aranda Chamorro and Rear Adm. Egberto Emerito Orie Benegas took over the top posts at the air force and navy, respectively. \n\nThe announcement came from the armed forces, not the president's office. \n\nCibar Benitez, commander of the armed forces, was the only top leader to retain his post. \n\nOther

In [None]:
# DISTILROBERTA NO HISTORY
test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds_rob")
source_dict = find_worst_errors("/content/gdrive/MyDrive/ckpt/dom/robrob_", "_testpred.pickle", test_ds)

In [None]:
print_orderered_predictions(source_dict, "/content/gdrive/MyDrive/ckpt/dom/robrob_", "_testpred.pickle", test_ds, XQA_test)

In [None]:
print_orderered_predictions(source_dict, "/content/gdrive/MyDrive/ckpt/dom/robrob_", "_testpred.pickle", test_ds, XQA_test, kind="best")

In [None]:
# TINYBERT WITH HISTORY
test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds_hist")
source_dict = find_worst_errors("/content/gdrive/MyDrive/ckpt/dom/tiny/histtiny_hist", "_testpred.pickle", test_ds)

In [None]:
print_orderered_predictions(source_dict, "/content/gdrive/MyDrive/ckpt/dom/tiny/histtiny_hist", "_testpred.pickle", test_ds, XQA_test)

In [None]:
print_orderered_predictions(source_dict, "/content/gdrive/MyDrive/ckpt/dom/tiny/histtiny_hist", "_testpred.pickle", test_ds, XQA_test, kind="best")

In [None]:
# TINYBERT WITH HISTORY
test_ds = load_from_disk("/content/gdrive/MyDrive/ckpt/test_ds_hist")
source_dict = find_worst_errors("/content/gdrive/MyDrive/ckpt/dom/rob/histtiny_hist", "_testpred.pickle", test_ds)

In [None]:
print_orderered_predictions(source_dict, "/content/gdrive/MyDrive/ckpt/dom/rob/histtiny_hist", "_testpred.pickle", test_ds, XQA_test)

In [None]:
print_orderered_predictions(source_dict, "/content/gdrive/MyDrive/ckpt/dom/rob/histtiny_hist", "_testpred.pickle", test_ds, XQA_test, kind="best")

# Assignment Evaluation

The following assignment points will be awarded for each task as follows:

* Task 1, Pre-processing $\rightarrow$ 0.5 points.
* Task 2, Dataset Splitting $\rightarrow$ 0.5 points.
* Task 3 and 4, Models Definition $\rightarrow$ 1.0 points.
* Task 5 and 6, Models Training and Evaluation $\rightarrow$ 2.0 points.
* Task 7, Analysis $\rightarrow$ 1.0 points.
* Report $\rightarrow$ 1.0 points.

**Total** = 6 points <br>

We may award an additional 0.5 points for outstanding submissions. 
 
**Speed Bonus** = 0.5 extra points <br>

# Report

We apply the rules described in Assignment 1 regarding the report.
* Write a clear and concise report following the given overleaf template (**max 2 pages**).
* Report validation and test results in a table.$^1$
* **Avoid reporting** code snippets or copy-paste terminal outputs $\rightarrow$ **Provide a clean schema** of what you want to show

# Comments and Organization

Remember to properly comment your code (it is not necessary to comment each single line) and don't forget to describe your work!

Structure your code for readability and maintenance. If you work with Colab, use sections. 

This allows you to build clean and modular code, as well as easy to read and to debug (notebooks can be quite tricky time to time).

# FAQ (READ THIS!)

---

**Question**: Does Task 3 also include data tokenization and conversion step?

**Answer:** Yes! These steps are usually straightforward since ```transformers``` also offers a specific tokenizer for each model.

**Example**: 

```
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_text = tokenizer(text)
%% Alternatively
inputs = tokenizer.tokenize(text, add_special_tokens=True, max_length=min(max_length, 512))
input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask']
```

**Suggestion**: Hugginface's documentation is full of tutorials and user-friendly APIs.

---
---

**Question**: I'm hitting **out of memory error** when training my models, do you have any suggestions?

**Answer**: Here are some common workarounds:

1. Try decreasing the mini-batch size
2. Try applying a different padding strategy (if you are applying padding): e.g. use quantiles instead of maximum sequence length

---
---

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# The End!

Questions?