<a href="https://colab.research.google.com/github/DomMcOyle/NLP-Assigments-22-23/blob/Assignment-2/Assignment2DP(G)RobMeaningful.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Keywords**: Transformers, Question Answering, CoQA

## Deadlines

* **December 11**, 2022: deadline for having assignments graded by January 11, 2023
* **January 11**, 2023: deadline for half-point speed bonus per assignment
* **After January 11**, 2023: assignments are still accepted, but there will be no speed bonus

## Overview

### Problem

Question Answering (QA) on [CoQA](https://stanfordnlp.github.io/coqa/) dataset: a conversational QA dataset.

### Task

Given a question $Q$, a text passage $P$, the task is to generate the answer $A$.<br>
$\rightarrow A$ can be: (i) a free-form text or (ii) unanswerable;

**Note**: a question $Q$ can refer to previous dialogue turns. <br>
$\rightarrow$ dialogue history $H$ may be a valuable input to provide the correct answer $A$.

### Models

We are going to experiment with transformer-based models to define the following models:

1.  $A = f_\theta(Q, P)$

2. $A = f_\theta(Q, P, H)$

where $f_\theta$ is the transformer-based model we have to define with $\theta$ parameters.

## The CoQA dataset

<center>
    <img src="https://drive.google.com/uc?export=view&id=16vrgyfoV42Z2AQX0QY7LHTfrgektEKKh" width="750"/>
</center>

For detailed information about the dataset, feel free to check the original [paper](https://arxiv.org/pdf/1808.07042.pdf).



## Rationales

Each QA pair is paired with a rationale $R$: it is a text span extracted from the given text passage $P$. <br>
$\rightarrow$ $R$ is not a requested output, but it can be used as an additional information at training time!

## Dataset Statistics

* **127k** QA pairs.
* **8k** conversations.
* **7** diverse domains: Children's Stories, Literature, Mid/High School Exams, News, Wikipedia, Reddit, Science.
* Average conversation length: **15 turns** (i.e., QA pairs).
* Almost **half** of CoQA questions refer back to **conversational history**.
* Only **train** and **validation** sets are available.

## Dataset snippet

The dataset is stored in JSON format. Each dialogue is represented as follows:

```
{
    "source": "mctest",
    "id": "3dr23u6we5exclen4th8uq9rb42tel",
    "filename": "mc160.test.41",
    "story": "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. 
    Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. [...]" % <-- $P$
    "questions": [
        {
            "input_text": "What color was Cotton?",   % <-- $Q_1$
            "turn_id": 1
        },
        {
            "input_text": "Where did she live?",
            "turn_id": 2
        },
        [...]
    ],
    "answers": [
        {
            "span_start": 59,   % <-- $R_1$ start index
            "spand_end": 93,    % <-- $R_1$ end index
            "span_text": "a little white kitten named Cotton",   % <-- $R_1$
            "input_text" "white",   % <-- $A_1$      
            "turn_id": 1
        },
        [...]
    ]
}
```

### Simplifications

Each dialogue also contains an additional field ```additional_answers```. For simplicity, we **ignore** this field and only consider one groundtruth answer $A$ and text rationale $R$.

CoQA only contains 1.3% of unanswerable questions. For simplicity, we **ignore** those QA pairs.

## [Task 1] Remove unaswerable QA pairs

Write your own script to remove unaswerable QA pairs from both train and validation sets.

In [1]:
!pip install transformers
!pip install tensorflow-addons
!pip install datasets
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 19.3 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 42.8 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 66.9 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow-addons
  Downloading tensorflow_addons-0.19.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB

## Dataset Download


In [2]:
import os
import urllib.request
from tqdm import tqdm

class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [23]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')  # <-- Why test? See next slides for an answer!

Downloading CoQA train data split... (it may take a while)


coqa-train-v1.0.json: 49.0MB [00:08, 5.88MB/s]                            


Download completed!
Downloading CoQA test data split... (it may take a while)


coqa-dev-v1.0.json: 9.09MB [00:03, 3.02MB/s]                            

Download completed!





#### Data Inspection

Spend some time in checking accurately the dataset format and how to retrieve the tasks' inputs and outputs!

In [3]:
import json
import random
import numpy as np
from sklearn.model_selection import train_test_split
import torch
import tensorflow as tf
from datasets import Dataset, load_from_disk


def set_seed(SEED):
  random.seed(SEED) # if you're using random
  np.random.seed(SEED) # if you're using numpy
  torch.manual_seed(SEED) # torch.cuda.manual_seed_all(SEED) is not required
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False
  tf.random.set_seed(SEED) # setting the seed for tensorflow too
  os.environ['TF_DETERMINISTIC_OPS'] = '1'

def extract_data(split_dataset):
  """
  function extracting data from the list of dictionaries in the CoQA dataset
  :params:
    split_dataset: list of dictionaries from where to extract the pairs of question and passage and corresponding the answer
  """  
  XQA = [] # list that will contain pairs (P,Q)
  YQA = [] # list that will contain the Answers
  for d in split_dataset: # scan each document
    for i in range(len(d["questions"])): # scan each question
      if d["answers"][i]["span_end"]!=-1: # discard unanswerable questions
        single_example = [] # prepare the single example...
        single_example.append(d["questions"][i]["input_text"]) #... with the question ...
        single_example.append(d["story"]) # ...and the passage
        XQA.append(single_example) # and append it
        YQA.append(d["answers"][i]["input_text"]) # add the answer
  return XQA, YQA

## [Task 2] Train, Validation and Test splits

CoQA only provides a train and validation set since the test set is hidden for evaluation purposes.

We'll consider the provided validation set as a test set. <br>
$\rightarrow$ Write your own script to:
* Split the train data in train and validation splits (80% train and 20% val)
* Perform splits such that a dialogue appears in one split only! (i.e., split at dialogue level)
* Perform splitting using the following seed for reproducibility: 42

#### Reproducibility Memo

Check back tutorial 2 on how to fix a specific random seed for reproducibility!

In [9]:
seed = 42 
set_seed(seed)

In [24]:
with open('coqa/train.json') as f:
  # loading the training json
  train_json = json.load(f)

with open('coqa/test.json') as f:
  # loading the test json
  test_json = json.load(f)

# splitting training data
train_data, val_data = train_test_split(train_json["data"],
                                        train_size=0.8,
                                        shuffle=True,
                                        random_state=seed)
# extracting X as list of pairs [Passage,Question] and Y as a list of strings (Answers) 
XQA_train, YQA_train = extract_data(train_data)
XQA_val, YQA_val = extract_data(val_data)
XQA_test, YQA_test = extract_data(test_json["data"])
del(train_json)
del(test_json)

print("First training example:")
print(XQA_train[0:17])
print(YQA_train[0:17])
print("First validation example:")
print(XQA_val[0])
print(YQA_val[0])
print("First test example:")
print(XQA_test[0])
print(YQA_test[0])

First training example:
[['Where is this taking place?', 'TUNIS, Tunisia (CNN) -- Polls closed late Sunday in Tunisia, the torchbearer of the so-called Arab Spring, but voters will not see results of national elections until Tuesday, officials said. \n\nOn Sunday, long lines of voters snaked around schools-turned-polling-stations in Tunis\'s upscale Menzah neighborhood, some waiting for hours to cast a vote in the nation\'s first national elections since the country\'s independence in 1956. \n\n"It\'s a wonderful day. It\'s the first time we can choose our own representatives," said Walid Marrakchi, a civil engineer who waited more than two hours, and who brought along his 3-year-old son Ahmed so he could "get used to freedom and democracy." \n\nTunisia\'s election is the first since a popular uprising in January overthrew long-time dictator Zine El Abidine Ben Ali and triggered a wave of revolutions -- referred to as the Arab Spring -- across the region. \n\nMore than 60 political par

In [25]:
## broken example fix:
print(XQA_train[61])
print(YQA_train[61])
YQA_train[61] = 'October'
print(YQA_train[61])

['what month?', 'Microsoft Word is a word processor developed by Microsoft. It was first released on October 25, 1983 under the name "Multi-Tool Word" for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS (1983), Apple Macintosh running Classic Mac OS (1985), AT&T Unix PC (1985), Atari ST (1988), OS/2 (1989), Microsoft Windows (1989), SCO Unix (1994), and macOS (2001). Commercial versions of Word are licensed as a standalone product or as a component of Microsoft Office, Windows RT or the discontinued Microsoft Works suite. Microsoft Word Viewer and Office Online are freeware editions of Word with limited features. \n\nIn 1981, Microsoft hired Charles Simonyi, the primary developer of Bravo, the first GUI word processor, which was developed at Xerox PARC. Simonyi started work on a word processor called "Multi-Tool Word" and soon hired Richard Brodie, a former Xerox intern, who became the primary software engineer. \n\nMicros

## [Task 3] Model definition

Write your own script to define the following transformer-based models from [huggingface](https://HuggingFace.co/).

* [M1] DistilRoBERTa (distilberta-base)
* [M2] BERTTiny (bert-tiny)

**Note**: Remember to install the ```transformers``` python package!

**Note**: We consider small transformer models for computational reasons!

In [8]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [48]:
# THIS IS A SEPARATOR ##########################################################

"""
This was tested with:
tensorflow==2.6
tensorflow-gpu==2.6
tensorflow-addons==0.16.1
transformers==4.18.0
Keras==2.6.0

Note 1: Simple adaptation of tf_seq2seq_lstm.py script
Note 2: make sure Keras and Tensorflow versions match!

"""

import tensorflow as tf
import tensorflow_addons as tfa
from tqdm import tqdm
from transformers import TFAutoModel, AutoTokenizer
import time

# check if training can be performed on GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)


class MyTrainer(object):
    """
    Simple wrapper class

    train_op -> uses tf.GradientTape to compute the loss
    batch_fit -> receives a batch and performs forward-backward passes (gradient included) 
    """

    def __init__(self, encoder, decoder, max_length):
        self.encoder = encoder
        self.decoder = decoder
        self.max_length = max_length
        self.ce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, 
                                                                reduction='none') # from logits means that it returns values after a 
                                                                                  # softmax application, thus it is useless to
                                                                                  # add a softmax activation layer if this parameter is set to 
                                                                                  # true (or even dangerous because it squashes the values)
        self.optimizer = tf.keras.optimizers.Adam(learning_rate=1e-03)            # here it is possible to tweak the learning rate

    @tf.function
    def compute_loss(self, logits, target):
        loss = self.ce(y_true=target, y_pred=logits)
        mask = tf.logical_not(tf.math.equal(target, 0))
        mask = tf.cast(mask, dtype=loss.dtype)
        loss *= mask # pointwise product
        return tf.reduce_mean(loss)

    @tf.function
    def train_op(self, inputs):
        with tf.GradientTape() as tape:
            # NOTABENE: it is necessary to add token_type_ids to see how it performs
            if self.encoder.use_token_type_ids:
              encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': inputs['input_ids'],
                                                                  'attention_mask': inputs['attention_mask'],
                                                                  'token_type_ids': inputs['token_type_ids']})
            else:
              encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': inputs['input_ids'],
                                                                  'attention_mask': inputs['attention_mask']})
            decoder_input = inputs['y_padded'][:, :-1]  # ignore <end>
            real_target = inputs['y_padded'][:, 1:]  # ignore <start>

            # encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': inputs[0][0],
            #                                                    'attention_mask': inputs[0][1]})
            # decoder_input = inputs[1][:, :-1]
            # real_target = inputs[1][:, 1:]

            decoder.attention.setup_memory(encoder_output) # setup in order to perform attention queries over the 
                                                           # embedding space

            # decoder initialization, check build_initial_state for additional insights
            decoder_initial_state = self.decoder.build_initial_state(decoder.batch_size, [encoder_h, encoder_s])
            # the input is then passed to the initialized decoder and we obtain predictions
            # in rnn_output format because the model is BERT-emdedding-sequence-sequence, so the
            # last layer is still a sequence of cells (a RNN)
            predicted = self.decoder({'input_ids': decoder_input,
                                      'initial_state': decoder_initial_state}).rnn_output
            # we compute the losses over the computed predictions
            loss = self.compute_loss(logits=predicted, target=real_target)
        # gradients of the loss computed for this minibatch considering trainable
        # parameters of encoder and decoder
        grads = tape.gradient(loss, self.encoder.trainable_variables + self.decoder.trainable_variables)
        return loss, grads

    @tf.function
    def batch_fit(self, inputs):
        loss, grads = self.train_op(inputs=inputs)
        # applies gradients to the trainable variables using Adam
        self.optimizer.apply_gradients(zip(grads, self.encoder.trainable_variables + self.decoder.trainable_variables))
        return loss

    # @tf.function
    def generate(self, output_tokenizer, input_ids,token_type_ids, attention_mask=None):
        batch_size = input_ids.shape[0] # input_ids is the minibatch
        encoder_output, encoder_h, encoder_s = self.encoder({
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'token_type_ids': token_type_ids, 
        })
        if self.encoder.use_token_type_ids:
          encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': input_ids,
                                                                  'attention_mask': attention_mask,
                                                                  'token_type_ids': token_type_ids})
        else:
          encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': input_ids,
                                                                  'attention_mask': attention_mask})
        start_tokens = tf.fill([batch_size], output_tokenizer.word_index['<start>'])
        end_token = output_tokenizer.word_index['<end>']

        # samples the possible answer with greedy technique, we could possibly
        # use a variant here such as beam search at inference time 
        # We could not do this at training time, since the Sampler used at training
        # is not designed to project the token in an embedding space before computing
        # the next one. The aforementioned embedding space
        # is changing at each backpropagation step anyways, thus we stick with
        # the computation of the argmax of the logits using TrainingSampler.
        # NOTABENE: we can still change this sampler, find a way to penalize repetitions
        # and perform the beam search
        greedy_sampler = tfa.seq2seq.GreedyEmbeddingSampler() 
        # we have a decoder for training and a decoder for test time, thus
        # we need to re-define the training decoder each time we want to
        # train a new batch
        decoder_instance = tfa.seq2seq.BasicDecoder(cell=self.decoder.wrapped_decoder_cell,
                                                    sampler=greedy_sampler,
                                                    output_layer=self.decoder.generation_dense,
                                                    maximum_iterations=self.max_length)
        self.decoder.attention.setup_memory(encoder_output)

        # decoder_initial_state is still an output of the encoder, we pass it to
        # the decoder_instance in order to get the outputs
        decoder_initial_state = self.decoder.build_initial_state(batch_size, [encoder_h, encoder_s])
        decoder_embedding_matrix = self.decoder.embedding.variables[0]
        outputs, _, _ = decoder_instance(decoder_embedding_matrix,
                                         start_tokens=start_tokens,
                                         end_token=end_token,
                                         initial_state=decoder_initial_state)
        return outputs

    def translate(self, generated, output_tokenizer):
        return output_tokenizer.sequences_to_texts(generated.sample_id.numpy())


class Encoder(tf.keras.Model):

    def __init__(self, model_name, decoder_units):
        super(Encoder, self).__init__()
        self.model = TFAutoModel.from_pretrained(model_name, from_pt=True, trainable=False)
        self.model.trainable=False
        self.reducer = tf.keras.layers.Dense(decoder_units)
        self.reducer2 = tf.keras.layers.Dense(decoder_units)
        self.avg_pool = tf.keras.layers.AveragePooling1D(pool_size = 512)
        self.use_token_type_ids = model_name=='prajjwal1/bert-tiny'

    def call(self, inputs, training=False, **kwargs):
        model_output = self.model(inputs)
        
        # all_outputs has shape (batch_size * 512 * 128)
        all_outputs = model_output[0] # output of the last layer of the model
        #pooled_output = model_output[1] # last layer but processed by a linear 
                                        # layer and a tanh
        
        # cls coding
        hidden_pooled = all_outputs[:, 0, :]
        cell_state = self.avg_pool(all_outputs)
        cell_state = tf.reshape(cell_state, [all_outputs.shape[0], all_outputs.shape[2]])

        # NOTABENE: it could be possible to add something to improve the encoding7
        
        # pooled output has shape (batch_size * 128)
        hidden_state = self.reducer(hidden_pooled)
        cell_state = self.reducer2(cell_state)
        #return all_outputs, self.reducer(model_output[1]), self.reducer(model_output[1])
        return all_outputs, hidden_state, cell_state


class Decoder(tf.keras.Model):

    def __init__(self, vocab_size, max_sequence_length, embedding_dim, decoder_units, batch_size):
        super(Decoder, self).__init__()

        self.max_sequence_length = max_sequence_length
        self.batch_size = batch_size

        self.decoder_units = decoder_units
        # NOTABENE: it is possible to change the embedding dimension and the number of LSTM cells
        self.embedding = tf.keras.layers.Embedding(input_dim=vocab_size,
                                                   output_dim=embedding_dim)
        # NOTABENE: It could be possible to swap LSTMCell with GRUCell
        self.decoder_lstm_cell = tf.keras.layers.LSTMCell(self.decoder_units)
        # NOTABENE: Just one type of attention, it could be changed to seek for different
        # results
        self.attention = tfa.seq2seq.BahdanauAttention(units=self.decoder_units,
                                                       memory=None,
                                                       memory_sequence_length=self.batch_size * [max_sequence_length])

        self.wrapped_decoder_cell = tfa.seq2seq.AttentionWrapper(self.decoder_lstm_cell,
                                                                 self.attention,
                                                                 attention_layer_size=self.decoder_units) # adds the attention mechanism after a single
                                                                                # LSTM cell, because we pass a word at the time
        # dense layer needed to generate the distribution values over 
        # the size of the vocabulary (probability for each word)
        self.generation_dense = tf.keras.layers.Dense(vocab_size)
        # Above we describe why this cannot be changed and why it resambles
        # the greedysampler
        self.sampler = tfa.seq2seq.sampler.TrainingSampler()
        self.decoder = tfa.seq2seq.BasicDecoder(self.wrapped_decoder_cell,
                                                sampler=self.sampler,
                                                output_layer=self.generation_dense)

    def build_initial_state(self, batch_size, encoder_state):
        # after initializing the tensors within the attention layer to 0 we add
        # the designated initialization that allow us to query the embedding space,
        # which is passed as encoder_state.
        # We load the embedding of a single batch and we actually don't freeze 
        # the parameters related to BERT, that are modified and can possibly 
        # overfit. 
        initial_state = self.wrapped_decoder_cell.get_initial_state(batch_size=batch_size, dtype=tf.float32)
        initial_state = initial_state.clone(cell_state=encoder_state) 
        return initial_state

    def call(self, inputs, training=False, **kwargs):
        # as shown in calls, inputs is a dictionary with entries: 
        # "input_ids" : _encoder_output_
        # "initial_state" : _result_of_build_initial_state_
        input_ids = inputs['input_ids']
        input_emb = self.embedding(input_ids)
        decoder_output, _, _ = self.decoder(input_emb,
                                            initial_state=inputs['initial_state'],
                                            sequence_length=self.batch_size * [self.max_sequence_length - 1])
        return decoder_output

## MODEL NAME
model_name = 'distilroberta-base'
#model_name = 'prajjwal1/bert-tiny'
# THIS IS A SEPARATOR ##########################################################

In [39]:
def custom_analyzer(input_text,
    filters='!"#$%&()*+,./:;=?@[\\]^_`{|}~\t\n', # there is no apostrophe, because a word must contain its apostrophe (like in "it's", we keep "it's"),
                                                   # N.W.: we removed the following default symbols: "-<>"
    lower=True,
    split=" "):
  if lower:
        input_text = input_text.lower()
  for elem in filters:
    input_text = input_text.replace(f"{elem}", f" {elem} ")
  seq = input_text.split(split)
  return [i for i in seq if i]

In [26]:
# this tokenizer doen't filter anything, so a word and the concatenation of the
# same word with a punctuation will have different embeddings
output_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+,./:;=?@[\\]^_`{|}~\t\n', oov_token='<UNK>')#, analyzer = custom_analyzer) # here we use a custom analyzer
output_tokenizer.fit_on_texts(["<start> " + i + " <end>" for i in YQA_train])

input_tokenizer = AutoTokenizer.from_pretrained(model_name)

max_output_length = max([len(i) for i in YQA_train])
print("Max input output found: " + str(max_output_length))
#max_sequence_length = max(512, max_output_length)
print("99° percentile of training set answer length:" + str(np.quantile([len(i) for i in output_tokenizer.texts_to_sequences(YQA_train)], 0.99)))
# actual percentile is 17, given that each string has the beginnning and ending token
max_sequence_length = 20


print(np.argmax([len(i) for i in YQA_train]))
print(XQA_train[7529])
print(YQA_train[7529])

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Max input output found: 539
99° percentile of training set answer length:13.0
7529
["What symptoms of addiction does Orzack's center list?", 'Caught in the Web A few months ago, it wasn\'t unusual for 47-year-old Carla Toebe to spend 15 hours per day online. She\'d wake up early, turn on her laptop and chat on Internet dating sites and instant-messaging programs - leaving her bed for only brief intervals. Her household bills piled up, along with the dishes and dirty laundry, but it took near-constant complaints from her four daughters before she realized she had a problem. "I was starting to feel like my whole world was falling apart - kind of slipping into a depression," said Carla. "I knew that if I didn\'t get off the dating sites, I\'d just keep going," detaching herself further from the outside world. Toebe\'s conclusion: She felt like she was "addicted" to the Internet. She\'s not alone. Concern about excessive Internet use isn\'t new. As far back as 1995, articles in medical jou

In [None]:
# generate dataset
train_ds = Dataset.from_dict({"xqa": XQA_train, "yqa": ["<start> " + i + " <end>" for i in YQA_train]})
train_ds = train_ds.map(lambda x: input_tokenizer(x["xqa"], return_tensors="tf", padding="max_length", truncation="longest_first", max_length=512), batched=True)
train_ds = train_ds.map(lambda x: {"y_token": output_tokenizer.texts_to_sequences(x["yqa"])}, batched=True)
train_ds = train_ds.map(lambda x: {"y_padded": tf.keras.preprocessing.sequence.pad_sequences(x["y_token"],
                                                                     padding='post',
                                                                     maxlen=max_sequence_length)}, batched=True
)
train_ds = train_ds.remove_columns(["xqa", "yqa", "y_token"])
train_ds = train_ds.with_format(type="tensorflow")
if model_name == 'prajjwal1/bert-tiny':
  train_ds.save_to_disk("gdrive/MyDrive/ckpt/train_ds")
else:
  train_ds.save_to_disk("gdrive/MyDrive/ckpt/train_ds_rob")

  0%|          | 0/86 [00:00<?, ?ba/s]

  0%|          | 0/86 [00:00<?, ?ba/s]

  0%|          | 0/86 [00:00<?, ?ba/s]

Saving the dataset (0/1 shards):   0%|          | 0/85807 [00:00<?, ? examples/s]

In [10]:
# load dataset
max_sequence_length = 20
if model_name == 'prajjwal1/bert-tiny':
  train_ds = load_from_disk("gdrive/MyDrive/ckpt/train_ds")
  val_ds = load_from_disk("gdrive/MyDrive/ckpt/val_ds")
else:
  train_ds = load_from_disk("gdrive/MyDrive/ckpt/train_ds_rob")
  val_ds = load_from_disk("gdrive/MyDrive/ckpt/val_ds_rob")



In [51]:
BATCH_SIZE = 14

decoder_units = 256

encoder = Encoder(model_name=model_name,
                      decoder_units=decoder_units)
    
# Testing the decoder
decoder = Decoder(vocab_size=len(output_tokenizer.word_index) + 1,
                      embedding_dim=100,
                      decoder_units=decoder_units,
                      batch_size=BATCH_SIZE,
                      max_sequence_length=max_sequence_length)
# Training
trainer = MyTrainer(encoder=encoder,
                      decoder=decoder,
                      max_length=max_sequence_length)

if model_name == 'prajjwal1/bert-tiny':  
  checkpoint_dir = './gdrive/MyDrive/ckpt/dom/tiny'
else: 
  checkpoint_dir = './gdrive/MyDrive/ckpt/dom/rob'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
print(checkpoint_prefix)
checkpoint = tf.train.Checkpoint(optimizer=trainer.optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.dense.bias', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


./gdrive/MyDrive/ckpt/dom/rob/ckpt


In [52]:
epochs = 3
steps_per_epoch = len(train_ds)//BATCH_SIZE
print(steps_per_epoch)
for epoch in tqdm(range(epochs)):
    batch_index = 0
    cumulative_loss = 0

    tic = time.time()
    for batch_index in range(steps_per_epoch):
      loss = trainer.batch_fit(train_ds[batch_index*BATCH_SIZE:batch_index*BATCH_SIZE+BATCH_SIZE])
      cumulative_loss += loss
      if batch_index % 1000 == 0:
        checkpoint.save(file_prefix=checkpoint_prefix + "_" + str(batch_index) + "_" + str(epoch))
      if batch_index % 10 == 0:
        print(f'Processed batch {batch_index} - Epoch: {epoch}')
        print(f'time required: %.2f' % (time.time()-tic))

    mean_loss = cumulative_loss / batch_index
    print(f"Current mean {mean_loss}")

checkpoint.save(file_prefix=checkpoint_prefix + "_final")

6129


  0%|          | 0/3 [00:00<?, ?it/s]

Processed batch 0 - Epoch: 0
time required: 9.30
Processed batch 10 - Epoch: 0
time required: 13.30
Processed batch 20 - Epoch: 0
time required: 17.28
Processed batch 30 - Epoch: 0
time required: 21.34
Processed batch 40 - Epoch: 0
time required: 25.40
Processed batch 50 - Epoch: 0
time required: 29.49
Processed batch 60 - Epoch: 0
time required: 33.56
Processed batch 70 - Epoch: 0
time required: 37.59
Processed batch 80 - Epoch: 0
time required: 41.58
Processed batch 90 - Epoch: 0
time required: 45.53
Processed batch 100 - Epoch: 0
time required: 49.44
Processed batch 110 - Epoch: 0
time required: 53.34
Processed batch 120 - Epoch: 0
time required: 57.27
Processed batch 130 - Epoch: 0
time required: 61.15
Processed batch 140 - Epoch: 0
time required: 65.04
Processed batch 150 - Epoch: 0
time required: 68.92
Processed batch 160 - Epoch: 0
time required: 72.83
Processed batch 170 - Epoch: 0
time required: 76.82
Processed batch 180 - Epoch: 0
time required: 80.74
Processed batch 190 - Ep

 33%|███▎      | 1/3 [40:24<1:20:48, 2424.02s/it]

Current mean 1.0564677715301514
Processed batch 0 - Epoch: 1
time required: 2.23
Processed batch 10 - Epoch: 1
time required: 6.18
Processed batch 20 - Epoch: 1
time required: 10.11
Processed batch 30 - Epoch: 1
time required: 14.04
Processed batch 40 - Epoch: 1
time required: 17.93
Processed batch 50 - Epoch: 1
time required: 21.85
Processed batch 60 - Epoch: 1
time required: 25.76
Processed batch 70 - Epoch: 1
time required: 29.69
Processed batch 80 - Epoch: 1
time required: 33.60
Processed batch 90 - Epoch: 1
time required: 37.52
Processed batch 100 - Epoch: 1
time required: 41.42
Processed batch 110 - Epoch: 1
time required: 45.35
Processed batch 120 - Epoch: 1
time required: 49.26
Processed batch 130 - Epoch: 1
time required: 53.17
Processed batch 140 - Epoch: 1
time required: 57.18
Processed batch 150 - Epoch: 1
time required: 61.13
Processed batch 160 - Epoch: 1
time required: 65.06
Processed batch 170 - Epoch: 1
time required: 68.99
Processed batch 180 - Epoch: 1
time required:

 67%|██████▋   | 2/3 [1:20:41<40:20, 2420.03s/it]

Current mean 0.8431016206741333
Processed batch 0 - Epoch: 2
time required: 2.52
Processed batch 10 - Epoch: 2
time required: 6.48
Processed batch 20 - Epoch: 2
time required: 10.40
Processed batch 30 - Epoch: 2
time required: 14.32
Processed batch 40 - Epoch: 2
time required: 18.24
Processed batch 50 - Epoch: 2
time required: 22.17
Processed batch 60 - Epoch: 2
time required: 26.09
Processed batch 70 - Epoch: 2
time required: 30.02
Processed batch 80 - Epoch: 2
time required: 33.96
Processed batch 90 - Epoch: 2
time required: 37.90
Processed batch 100 - Epoch: 2
time required: 41.83
Processed batch 110 - Epoch: 2
time required: 45.76
Processed batch 120 - Epoch: 2
time required: 49.66
Processed batch 130 - Epoch: 2
time required: 53.68
Processed batch 140 - Epoch: 2
time required: 57.65
Processed batch 150 - Epoch: 2
time required: 61.55
Processed batch 160 - Epoch: 2
time required: 65.45
Processed batch 170 - Epoch: 2
time required: 69.37
Processed batch 180 - Epoch: 2
time required:

100%|██████████| 3/3 [2:00:58<00:00, 2419.66s/it]

Current mean 0.6970311403274536





'./gdrive/MyDrive/ckpt/dom/rob/ckpt_final-22'

In [None]:
checkpoint.restore(checkpoint_prefix + "_final-22")

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f6bd4166c70>

In [53]:
import re
def filter_string(x):
  return re.sub('[!"#$%&()*+,./:;=?@[\\]^_`{|}~\t\n]',"",x)

In [54]:
# predictions are in lower case, so we consider the labels in lower case
val_ds = Dataset.from_dict({"xqa": XQA_val, "yqa": [filter_string(i.lower()) for i in YQA_val], "id_placeholder": list(range(len(YQA_val)))})


val_ds = val_ds.map(lambda x: input_tokenizer(x["xqa"], return_tensors="tf", padding="max_length", truncation="longest_first", max_length=512), batched=True)

val_ds = val_ds.map(lambda x:{"references": {'answers':{'text':[x["yqa"]], 'answer_start': [42]},
    'id': str(x["id_placeholder"]) } })
val_ds = val_ds.remove_columns(["xqa","yqa", "id_placeholder"])

if model_name == 'prajjwal1/bert-tiny':  
  val_ds = val_ds.with_format(type="tensorflow", columns=["input_ids", "attention_mask","token_type_ids"], output_all_columns=True)
  val_ds.save_to_disk("gdrive/MyDrive/ckpt/val_ds")
else:
  val_ds = val_ds.with_format(type="tensorflow", columns=["input_ids", "attention_mask"], output_all_columns=True)
  val_ds.save_to_disk("gdrive/MyDrive/ckpt/val_ds_rob")

  0%|          | 0/22 [00:00<?, ?ba/s]

  0%|          | 0/21479 [00:00<?, ?ex/s]

Saving the dataset (0/1 shards):   0%|          | 0/21479 [00:00<?, ? examples/s]

In [55]:
from evaluate import load

ttids=None

inference_batch_size = 64 # temporary value
inference_step = len(val_ds) // inference_batch_size
predictions = []
for step_index in tqdm(range(inference_step)):
  starting_index = step_index*inference_batch_size
  ending_index = step_index*inference_batch_size + inference_batch_size
  if model_name == 'prajjwal1/bert-tiny':  
    ttids = val_ds["token_type_ids"][starting_index : ending_index]
  generated = trainer.generate(output_tokenizer=output_tokenizer, 
                                  input_ids=val_ds["input_ids"][starting_index : ending_index],
                                  token_type_ids=ttids,
                                  attention_mask=val_ds["attention_mask"][starting_index : ending_index])
  translated = trainer.translate(generated, output_tokenizer=output_tokenizer)
  # all this mess with indexes is needed in order to have coherent ids in the field "id"
  list_to_add = [{'prediction_text': translated[i - starting_index].split("<end>")[0], 'id':str(i)} for i in range(starting_index, ending_index)]
  predictions.extend(list_to_add)

100%|██████████| 335/335 [16:04<00:00,  2.88s/it]


In [56]:
# last batch must be considered iven if it has dimension different wrt 
# inference_batch_size
if model_name == 'prajjwal1/bert-tiny':  
    ttids = val_ds["token_type_ids"][(inference_step)*inference_batch_size :]
  
generated = trainer.generate(output_tokenizer = output_tokenizer, 
                             input_ids=val_ds["input_ids"][(inference_step)*inference_batch_size :],
                             token_type_ids=ttids,
                             attention_mask=val_ds["attention_mask"][(inference_step)*inference_batch_size :])
translated = trainer.translate(generated, output_tokenizer=output_tokenizer)
print((inference_step)*inference_batch_size)
predictions.extend([{'prediction_text': translated[i - (inference_step)*inference_batch_size].split("<end>")[0], 
                    'id':str(i)} for i in range((inference_step)*inference_batch_size, 
                                                len(val_ds))])

21440


In [None]:
#predictions = [{'prediction_text': translated[i - 128].split("<end>")[0], 'id':str(i)} for i in range(128, 256)]
print(type(predictions[0]))
for i in predictions:
  print(i["prediction_text"])
  print(YQA_val[int(i["id"])])
  print(XQA_val[int(i["id"])])

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
12 
Fourteen
['How old is Jane Scott?', 'Jane Scott is fourteen and the year before last she began to study in a middle school. She likes dancing and singing and spends a lot of time on them. But she hates math and does not work hard at it. She thinks it difficult to learn. She falls behind her classmates and once failed the math exam. She decides to drop it. Her father is angry with her when he knows about it. It was Sunday. Mr Scott gave a call to his sister, who teaches math in another school. He hoped she would come and tell his daughter how to learn math. The woman came quickly and said. "You\'re a clever girl, Jane. I\'m sure you\'ll soon do well in math if you work hard at it." "I\'m afraid I can\'t, Aunt," said Jane, "Girls can\'t be good at math." "I don\'t think so," said the woman. "I was good at it when was a girl. You must do more exercises and practice a math problem again and again until you master it. Remem

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
yes 
a succession crisis
['What does he say the death has caused?', 'ISLAMABAD, Pakistan (CNN) -- The United States knows that the leader of Pakistan\'s Taliban is dead because he has not appeared in public to prove that he is alive, the top U.S. envoy to the region told CNN on Monday. \n\nBaitullah Mehsud, right, and a bodyguard arrive at a meeting in South Waziristan, Pakistan, in 2004. \n\nRichard Holbrooke said that the Pakistani Taliban have not confirmed the death of Baitullah Mehsud because of an ongoing power struggle over his successor. \n\n"The reason it\'s clear he\'s dead is that if he weren\'t dead, he\'d be giving TV and radio interviews to prove he\'s not dead," Holbrooke told CNN\'s Cal Perry. \n\nMehsud rarely gave news conferences or appeared before the media. There has been conflicting information from the Pakistani Taliban about whether Mehsud died in a suspected U.S. missile strike earlier this month. 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



['Had anyone stood in his way?', "CHAPTER XI. \n\nDETERIORATION OF CHARACTER. \n\nB.C. 329 \n\nAlexander at the summit of his ambition.--Sad changes.--Alexander becomes dissipated.--His officers became estranged.--Character of Parmenio.--His services to Alexander.--Parmenio's son, Philotas.--His dissolute character.--Conspiracies.--Plot of Dymnus.--Dymnus destroys himself.--Philotas suspected.--The council of officers.--Philotas accused.--Arrest of Philotas.--The body of Dymnus.--Alexander's address to the army.--Philotas brought to trial.--Defense of Philotas.--He is put to the torture.--Confession of Philotas.--He is stoned to death.--Parmenio condemned to death.--Mission of Polydamas.--Precautions.--Brutal murder of Parmenio.--Story of Clitus.--He saves Alexander's life.--Services of Clitus.--Occurrences at the banquet.--Clitus reproaches Alexander.--Alexander's rage.--Alexander assassinates Clitus.--His remorse. \n\nAlexander was now twenty-six years of age. He had accomplished ful

KeyboardInterrupt: ignored

In [58]:
#help(squad_metric)
print(len(predictions))
print(len(XQA_val))

21479
21479


In [57]:
squad_metric = load("squad")
#predictions = [{'prediction_text': "<start> porco dio <end>", 'id': '0'}]
references = val_ds['references']
results = squad_metric.compute(predictions=predictions, references=references)
results

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

{'exact_match': 14.199916197215884, 'f1': 17.69537892047668}

## [Task 4] Question generation with text passage $P$ and question $Q$

We want to define $f_\theta(P, Q)$. 

Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$ and $Q_i$ and generate $A_i$.

## [Task 5] Question generation with text passage $P$, question $Q$ and dialogue history $H$

We want to define $f_\theta(P, Q, H)$. Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$, $Q_i$, and $H = \{ Q_0, A_0, \dots, Q_{i-1}, A_{i-1} \}$ to generate $A_i$.

## [Task 6] Train and evaluate $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$

Write your own script to train and evaluate your $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$ models.

### Instructions

* Perform multiple train/evaluation seed runs: [42, 2022, 1337].$^1$
* Evaluate your models with the following metrics: SQUAD F1-score.$^2$
* Fine-tune each transformer-based models for **3 epochs**.
* Report evaluation SQUAD F1-score computed on the validation and test sets.

$^1$ Remember what we said about code reproducibility in Tutorial 2!

$^2$ You can use ```allennlp``` python package for a quick implementation of SQUAD F1-score: ```from allennlp_models.rc.tools import squad```. 

## [Task 7] Error Analysis

Perform a simple and short error analysis as follows:
* Group dialogues by ```source``` and report the worst 5 model errors for each source (w.r.t. SQUAD F1-score).
* Inspect observed results and try to provide some comments (e.g., do the models make errors when faced with a particular question type?)$^1$

$^1$ Check the [paper](https://arxiv.org/pdf/1808.07042.pdf) for some valuable information about question/answer types (e.g., Table 6, Table 8) 

# Assignment Evaluation

The following assignment points will be awarded for each task as follows:

* Task 1, Pre-processing $\rightarrow$ 0.5 points.
* Task 2, Dataset Splitting $\rightarrow$ 0.5 points.
* Task 3 and 4, Models Definition $\rightarrow$ 1.0 points.
* Task 5 and 6, Models Training and Evaluation $\rightarrow$ 2.0 points.
* Task 7, Analysis $\rightarrow$ 1.0 points.
* Report $\rightarrow$ 1.0 points.

**Total** = 6 points <br>

We may award an additional 0.5 points for outstanding submissions. 
 
**Speed Bonus** = 0.5 extra points <br>

# Report

We apply the rules described in Assignment 1 regarding the report.
* Write a clear and concise report following the given overleaf template (**max 2 pages**).
* Report validation and test results in a table.$^1$
* **Avoid reporting** code snippets or copy-paste terminal outputs $\rightarrow$ **Provide a clean schema** of what you want to show

# Comments and Organization

Remember to properly comment your code (it is not necessary to comment each single line) and don't forget to describe your work!

Structure your code for readability and maintenance. If you work with Colab, use sections. 

This allows you to build clean and modular code, as well as easy to read and to debug (notebooks can be quite tricky time to time).

# FAQ (READ THIS!)

---

**Question**: Does Task 3 also include data tokenization and conversion step?

**Answer:** Yes! These steps are usually straightforward since ```transformers``` also offers a specific tokenizer for each model.

**Example**: 

```
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_text = tokenizer(text)
%% Alternatively
inputs = tokenizer.tokenize(text, add_special_tokens=True, max_length=min(max_length, 512))
input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask']
```

**Suggestion**: Hugginface's documentation is full of tutorials and user-friendly APIs.

---
---

**Question**: I'm hitting **out of memory error** when training my models, do you have any suggestions?

**Answer**: Here are some common workarounds:

1. Try decreasing the mini-batch size
2. Try applying a different padding strategy (if you are applying padding): e.g. use quantiles instead of maximum sequence length

---
---

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# The End!

Questions?