# Assignment 2

**Authors**: Vincenzo Collura, Gianmarco Pappacoda, Anthea Silvia Sasdelli

## Task

Question Answering (QA) on [CoQA](https://stanfordnlp.github.io/coqa/) dataset: a conversational QA dataset.

Given a question $Q$, a text passage $P$, the task is to generate the answer $A$.<br>
$\rightarrow A$ can be: (i) a free-form text or (ii) unanswerable;

**Note**: an question $Q$ can refer to previous dialogue turns. <br>
$\rightarrow$ dialogue history $H$ may be a valuable input to provide the correct answer $A$.

In [None]:
## Colab-specific
!pip install datasets transformers allennlp-models
!pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

## Perliminary operations

In this section the required libraries are imported and the dataset is downloaded and briefly explored

In [6]:
import os
import copy
import json
import pandas as pd
import urllib.request
from tqdm import tqdm

from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from transformers import EncoderDecoderModel, AutoTokenizer, Seq2SeqTrainer, Seq2SeqTrainingArguments
import torch
import random
import numpy as np
from allennlp_models.rc.tools import squad
from pprint import pprint

# paths
data_path = 'data'
dataset_path = os.path.join(data_path, 'coqa')
models_path = 'models'

In [7]:
def set_reproducibility(seed):
    """
    set the seed for reproducibility.

    Parameters
    ----------
    seed : int
        seed value.
    """
    random.seed(seed)
    np.random.seed(seed)
    # tf.random.set_seed(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    torch.manual_seed(seed)

seed = 42
set_reproducibility(seed)

Download functions

In [8]:
class DownloadProgressBar(tqdm):
    """
    progress bar printer 
    """

    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)


def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(
            url, filename=output_path, reporthook=t.update_to)


def download_data(data_path, url_path, suffix):
    """
    download data

    Parameters
    ----------
    url_path : string
        url of the file you want to download.
    data_path : string
        path where you want to save the file you are downloading.
    extract : string
        ''train' or 'test'
    """
    if not os.path.exists(data_path):
        os.makedirs(data_path)

    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

Download the dataset

In [9]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path=dataset_path, url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path=dataset_path, url_path=test_url, suffix='test')

Downloading CoQA train data split... (it may take a while)


coqa-train-v1.0.json: 49.0MB [00:05, 8.80MB/s]                            


Download completed!
Downloading CoQA test data split... (it may take a while)


coqa-dev-v1.0.json: 9.09MB [00:00, 10.2MB/s]                            

Download completed!





Opening the dataset, which is in JSON format

In [10]:
# Opening JSON file
with open(os.path.join(dataset_path, 'train.json')) as json_file:
    train_data_raw = json.load(json_file)

    # Print the type of data variable
    print("Type:", type(train_data_raw))

Type: <class 'dict'>


A look at the columns of the dataset

In [11]:
print(train_data_raw['data'][0].keys())

dict_keys(['source', 'id', 'filename', 'story', 'questions', 'answers', 'name'])


A look into the questions feature

In [12]:
sample = 5

In [13]:
pprint(train_data_raw['data'][sample]['questions'])

[{'input_text': 'Was Budapest always one city?', 'turn_id': 1},
 {'input_text': 'How many was it?', 'turn_id': 2},
 {'input_text': 'What was one called?', 'turn_id': 3},
 {'input_text': 'Where was it located?', 'turn_id': 4},
 {'input_text': 'What was the other?', 'turn_id': 5},
 {'input_text': 'Where was it located?', 'turn_id': 6},
 {'input_text': 'When did they combine?', 'turn_id': 7},
 {'input_text': "Is it an important city in it's country?", 'turn_id': 8},
 {'input_text': 'How many people live there?', 'turn_id': 9},
 {'input_text': 'Do other people visit?', 'turn_id': 10},
 {'input_text': 'What do they do?', 'turn_id': 11},
 {'input_text': 'Where?', 'turn_id': 12},
 {'input_text': 'When do people like to go?', 'turn_id': 13},
 {'input_text': 'Why?', 'turn_id': 14},
 {'input_text': 'When was LA started?', 'turn_id': 15},
 {'input_text': 'What is the climate like there?', 'turn_id': 16},
 {'input_text': 'What is it close to?', 'turn_id': 17},
 {'input_text': 'How many people live

A look into the answers feature

In [14]:
pprint(train_data_raw['data'][sample]['answers'][:2])

[{'input_text': 'no',
  'span_end': 150,
  'span_start': 127,
  'span_text': 'Budapest was two cities',
  'turn_id': 1},
 {'input_text': 'two',
  'span_end': 150,
  'span_start': 127,
  'span_text': 'Budapest was two cities',
  'turn_id': 2}]


In [15]:
# Opening JSON file
with open(os.path.join(dataset_path, 'test.json')) as json_file:
    test_data_raw = json.load(json_file)

## Task 1 - Pre-processing

In this section a light form of pre-processing is described, most notably the dataset contains "unanswerable" (i.e. answer: unknown) questions, they are removed from the dataset as per instructions.


Remove unaswerable QA pairs

In [16]:
def remove_unanswerable(dataset):
    """
    remove unanswerable QA pairs.

    Parameters
    ----------
    data : dict
        dictionary of the data

    Returns
    -------
    result dict
        input data but without unanswerable QA pairs.
    """
    result = copy.deepcopy(dataset)
    count = 0
    removed = 0
    for entry_index, entry in tqdm(enumerate(result['data'])):
        for question_index, (question, answer) in reversed(list(enumerate(zip(entry['questions'], entry['answers'])))):
            count += 1
            if answer['input_text'] == 'unknown' or answer['span_text'] == 'unknown':
                result['data'][entry_index]['questions'].pop(question_index)
                result['data'][entry_index]['answers'].pop(question_index)
                removed += 1

    print(
        f"Original size: {count}, removed entries: {removed} ({removed*100/count:.2f}%)")

    return result

Functions to transform the dataset into a Pandas dataframe.

In [17]:
def dataset_to_tuples(data):
    """
    create tuples from a dict of data.

    Parameters
    ----------
    data : dict
        dictionary of the data
    """
    for entry in data['data']:
        for question, answer in zip(entry['questions'], entry['answers']):
            history = []
            for q, a in zip(entry['questions'], entry['answers']):
                if q['turn_id'] < question['turn_id']:
                    history.append({
                        'question': q['input_text'],
                        'answer': a['input_text']
                    })
            yield (entry['name'], entry['story'], question['turn_id'], question['input_text'], answer['span_start'], answer['span_end'], answer['span_text'], answer['input_text'], history)


def dataset_to_df(data):
    """
    create a df from tuples.

    Parameters
    ----------
    data : iterable of tuples
        iterable of tuples of the data

    Returns
    -------
    df
        pandas dataframe.
    """
    df = pd.DataFrame(dataset_to_tuples(data))
    df.columns = ['source', 'context', 'turn_id', 'question',
                  'span_start', 'span_end', 'span_text', 'answer', 'history']
    return df

remove unaswerable QA pairs

In [18]:
train_data = remove_unanswerable(train_data_raw)
# test_data = remove_unanswerable(test_data_raw) # Not required by assignment
test_data = test_data_raw

7199it [00:00, 128646.15it/s]

Original size: 108647, removed entries: 1372 (1.26%)





In [19]:
for entry in train_data['data']:
    for index, (question, answer) in enumerate(zip(entry['questions'], entry['answers'])):
        if answer['input_text'] == 'unknown':
            print(answer['turn_id'])
            print(index)
            print(question['input_text'])
            print(answer['input_text'])

Dataframe creation

In [20]:
df = dataset_to_df(train_data)
df_t = dataset_to_df(test_data)

## Task 2 - Training/Validation/Test split

In this section the dataset is split into training, validation and test sets.

Quik look to the data

In [21]:
df.head()

Unnamed: 0,source,context,turn_id,question,span_start,span_end,span_text,answer,history
0,Vatican_Library.txt,"The Vatican Apostolic Library (), more commonl...",1,When was the Vat formally opened?,151,179,Formally established in 1475,It was formally established in 1475,[]
1,Vatican_Library.txt,"The Vatican Apostolic Library (), more commonl...",2,what is the library for?,454,494,he Vatican Library is a research library,research,[{'question': 'When was the Vat formally opene...
2,Vatican_Library.txt,"The Vatican Apostolic Library (), more commonl...",3,for what subjects?,457,511,Vatican Library is a research library for hist...,"history, and law",[{'question': 'When was the Vat formally opene...
3,Vatican_Library.txt,"The Vatican Apostolic Library (), more commonl...",4,and?,457,545,Vatican Library is a research library for hist...,"philosophy, science and theology",[{'question': 'When was the Vat formally opene...
4,Vatican_Library.txt,"The Vatican Apostolic Library (), more commonl...",5,what was started in 2014?,769,879,"March 2014, the Vatican Library began an initi...",a project,[{'question': 'When was the Vat formally opene...


split the train data in train and validation splits (80% train and 20% val), using seed reproducibility = 42

In [22]:
cut = ['source', 'context', 'question', 'answer', 'history']
df_cut = df[cut]
df_train, df_val = train_test_split(df_cut, test_size=0.2, random_state=seed)
df_test = df_t[cut]

## Task 3 - Model definition

In this section the base wrapper is defined in order to carry out subsequent experiments in an orderly manner.

In [23]:
class s2smodel:
    """
    Wrapper class for HF encoder-decoder-based model and tokenizer for Question Answering.
    """

    def __init__(self, model_name : str, tie_encoder_decoder : bool = True):
        """
        Initialize the model.

        Parameters
        ----------
        model_name : str
            nome of the chosen model, prajjwal1/bert-tiny or distilroberta-base.
        tie_encoder_decoder: bool
        """
        # DO NOT translate to cuda BEFORE setting parameters, your config shall be cast into void!!!
        # bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder).to("cuda")

        config_name = model_name
        # The two following try-catch blocks are an ugly solution to an ugly library issue when loading config/weights
        try:
            bert2bert = EncoderDecoderModel.from_pretrained(model_name, tie_encoder_decoder=tie_encoder_decoder)
        except:
            bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained(
                model_name, model_name, tie_encoder_decoder=tie_encoder_decoder)

        try:
            tokenizer = AutoTokenizer.from_pretrained(model_name)
        except:
            config_name = bert2bert.config.encoder._name_or_path
            tokenizer = AutoTokenizer.from_pretrained(config_name)

        bert2bert.config.decoder.is_decoder = True
        bert2bert.config.decoder.add_cross_attention = True

        if config_name == 'prajjwal1/bert-tiny':
          bert2bert.config.decoder_start_token_id = tokenizer.cls_token_id
          bert2bert.config.eos_token_id = tokenizer.sep_token_id
          bert2bert.config.pad_token_id = tokenizer.pad_token_id

        if config_name == 'distilroberta-base':
          tokenizer.bos_token = tokenizer.cls_token
          tokenizer.eos_token = tokenizer.sep_token
          bert2bert.config.decoder_start_token_id = tokenizer.bos_token_id                                             
          bert2bert.config.eos_token_id = tokenizer.eos_token_id
          bert2bert.config.pad_token_id = tokenizer.pad_token_id
        
        bert2bert.config.vocab_size = bert2bert.config.encoder.vocab_size


        bert2bert.config.max_length = 512
        bert2bert.config.min_length = 1
        bert2bert.finetuning_task = True
        bert2bert.config.no_repeat_ngram_size = 3
        bert2bert.config.num_beams = 50

        self.model = bert2bert
        self.tokenizer = tokenizer
        self.config_name = config_name
        self.model_name = model_name.split('/')[-1] if '/' in model_name else model_name

    def encode_history(self, history):
        """
        Encode history in text format.

        Parameters
        ----------
        history : list
            list of  the previous questions and answers.

        Returns
        -------
        str
            previous questions and answers with respect to the current question.
        """
        return '; '.join([f"question: {entry['question']} answer: {entry['answer']}" for entry in history])

    def tokenize(self, data, encoder_max_length=512, decoder_max_length=32, include_history=False):
        """
        Tokenize data

        Parameters
        ----------
        data : list
            list of  data.
        encoder_max_length : int
            encoder max length.
        decoder_max_length : int
            decoder max length.
        include_history : bool
            True if you want the model that include the history, False otherwise.

        Returns
        -------
        dataset
            tokenized dataset.
        """
        if include_history:
            inputs = self.tokenizer(data['question'], [f"{context} {history}" for context, history in zip(data['context'], [self.encode_history(history) for history in data['history']])],
                                    padding="max_length", truncation=True, max_length=encoder_max_length)
        else:
            inputs = self.tokenizer(data['question'], data['context'],
                                    padding="max_length", truncation=True, max_length=encoder_max_length)
        outputs = self.tokenizer(
            data["answer"], padding="max_length", truncation=True, max_length=decoder_max_length)

        input_ids = inputs.input_ids
        attention_mask = inputs.attention_mask
        # data["decoder_input_ids"] = outputs.input_ids
        # data["decoder_attention_mask"] = outputs.attention_mask
        out_labels = outputs.input_ids.copy()

        # because BERT automatically shifts the labels, the labels correspond exactly to `decoder_input_ids`.
        # We have to make sure that the PAD token is ignored
        if self.config_name == 'prajjwal1/bert-tiny':
            out_labels = [[-100 if token == self.tokenizer.pad_token_id else token for token in labels]
                                for labels in out_labels]
        # For single sample
        #  out_labels = [-100 if token == tokenizer.pad_token_id else token for token in out_labels]
        out = Dataset.from_dict({'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': out_labels})
        out.set_format('torch')
        return out

    def process_data_to_model_inputs(self, batch, encoder_max_length=512, decoder_max_length=32, include_history=False):
        """
        Process data to be fed into the network, modifies batch

        Parameters
        ----------
        batch : list
            input batch.
        encoder_max_length : int
            max length of the encoder.
        decoder_max_length : int
            max length of the decoder.
        
        Returns
        -------
        batch
        """
        # tokenize the inputs and labels
        # concat = 'question: ' + batch['question'] + ' context: ' + batch['context']

        tokenized = self.tokenize(
            batch,
            encoder_max_length=encoder_max_length, 
            decoder_max_length=decoder_max_length,
            include_history=include_history
        )

        batch["input_ids"] = tokenized["input_ids"]
        batch["attention_mask"] = tokenized["attention_mask"]
        batch["labels"] = tokenized["labels"]
        
        return batch

    def process_dataset(self, dataset, batch_size=512, **kwargs):
        """
        Map function to process the dataset

        Parameters
        ----------
        dataset : dataset
            input dataset.
        batch_size : int
            size of the batch.
        """
        if batch_size > 1:
          batched = True
        else:
          batched = False
        return dataset.map(self.process_data_to_model_inputs, batch_size=batch_size, batched=batched, fn_kwargs=kwargs)

    def train(self, train_data, val_data, num_epochs=3, batch_size=8, learning_rate=1e-3):
        """
        Train the model.

        Parameters
        ----------
        train_data : pandas dataframe
            train dataset.
        val_data : pandas dataframe
            validation dataset.
        num_epochs : int
            number of training epochs
        batch_size : int
            size of the batch.
        learning_rate : real number
            startint learning rate
        """
        self.model.train()

        training_args = Seq2SeqTrainingArguments(
            num_train_epochs=num_epochs,
            prediction_loss_only=True,               # Do not compute metrics during train
            predict_with_generate=True,
            evaluation_strategy="epoch",
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            fp16=True,
            output_dir="./training",
            # logging_steps=4,
            # save_steps=10,
            eval_steps=4,
            # optim="adamw_torch",
            learning_rate=learning_rate,
            weight_decay=0.01,
            # logging_steps=1000,
            save_steps=500,
            # eval_steps=7500,
            # warmup_steps=2000,
            save_total_limit=3,
            report_to="none",
        )

        trainer = Seq2SeqTrainer(
            model=self.model,
            tokenizer=self.tokenizer,
            args=training_args,
            compute_metrics=self.compute_metrics,
            train_dataset=train_data,
            eval_dataset=val_data
        )
        trainer.train()

    def compute_metrics(self, pred):
        """
        Compute the squad f1-score during validation.

        Parameters
        ----------
        pred : list
            prediction
        
        Returns
        -------
        computed squad f1-score
            dict
        """
        
        labels_ids = pred.label_ids
        pred_ids = pred.predictions

        pred_str = self.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
        if self.config_name == 'prajjwal1/bert-tiny':
            labels_ids[labels_ids == -100] = self.tokenizer.pad_token_id
        label_str = self.tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

        pred_str = ' '.join(pred_str)
        label_str = ' '.join(label_str)

        return {
            # "exact": squad.compute_exact(pred_str, label_str),
            "squad_f1": squad.compute_f1(pred_str, label_str),
        }

    def evaluate(self, samples, batch_size=32, return_preds=True, **kwargs):
        """
        Evaluate the model.

        Parameters
        ----------
        samples : list
            list of samples
        batch_size : int
            size of batch
        return_preds :  bool
            True if you want to unclude the prediction in the return data, False otherwise
        
        Returns
        -------
        squad f1-score or (squad f1-score, prediction)
        """
        score = 0
        y_pred = []
        for i in tqdm(range(int(np.ceil(len(samples)/batch_size)))):
            sample = samples.select(range(i*batch_size,(i+1)*batch_size if (i+1)*batch_size < len(samples) else len(samples)))
            y_true = sample['answer']
            pred = self.predict_from_tokenized(sample['input_ids'], sample['attention_mask'], num_return_sequences=1)
            for p, t in zip(pred, y_true):
                pred_score = squad.compute_f1(p, t)
                y_pred.append({"prediction": p, "squad_f1": pred_score})
                score += pred_score
        
        score = score / len(samples)

        if return_preds:
            return score, y_pred
        
        return score

    def predict(self, data, mode='beam', num_return_sequences=1, repetition_penalty=3.0, length_penalty=2.0, include_history=False, batch_size=1):
        """
        Predict method.

        Parameters
        ----------
        data : dataset
            dataset taken into consideration.
        mode : str
            search mode.
        num_return_sequences :  int
            set the number of returned sequences
        repetition_penalty : float
            penalty give to the prediction
        length_penalty : float
            penalty give to the prediction
        include_history : bool
            True if you want the model that include the history, False otherwise.
        batch_size : int 
            size of the batch.

        Returns
        -------
        y_pred
        """
        self.model.eval()

        with torch.no_grad():
            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
            self.model.to(device)

            if batch_size > 1:
                y_pred = []
                for i in tqdm(range(int(np.ceil(len(data)/batch_size)))):
                    sample = data.select(range(i*batch_size,(i+1)*batch_size if (i+1)*batch_size < len(data) else len(data)))

                    tokenized = self.tokenize(sample, include_history=include_history)
                    input_ids = tokenized['input_ids']
                    attention_mask = tokenized['attention_mask']
                    
                    pred = self.predict_from_tokenized(input_ids, attention_mask, mode=mode, num_return_sequences=num_return_sequences, repetition_penalty=repetition_penalty, length_penalty=length_penalty)
                    for p in pred:
                        y_pred.append(p)
                
            else:
                question = data['question']
                context = data['context']
                if include_history:
                    context = [f"{context} {history}" for context, history in zip(data['context'], [self.encode_history(history) for history in data['history']])]

                tokenized = self.tokenize(data, include_history=include_history)
                input_ids = tokenized['input_ids']
                attention_mask = tokenized['attention_mask']
                
                y_pred = self.predict_from_tokenized(input_ids, attention_mask, mode=mode, num_return_sequences=num_return_sequences, repetition_penalty=repetition_penalty, length_penalty=length_penalty)
            
            return y_pred


    def predict_from_tokenized(self, input_ids, attention_mask, mode='beam', num_return_sequences=1, repetition_penalty=3.0, length_penalty=2.0):
        """
        Predict from pre-tokenized data.

        Parameters
        ----------
        input_ids : list
            ids of the input data.
        mode : str
            search mode.
        num_return_sequences :  int
            set the number of returned sequences
        repetition_penalty : float
            penalty give to the prediction
        length_penalty : float
            penalty give to the prediction

        Returns
        -------
        str
            the genereted text.
        """
        self.model.eval()

        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(device)
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)

        with torch.no_grad():
            if mode == 'beam':
                generated_ids = self.model.generate(input_ids,
                                                    attention_mask=attention_mask,
                                                    decoder_start_token_id=self.model.config.decoder.bos_token_id,
                                                    num_beams=15,
                                                    no_repeat_ngram_size=2,
                                                    early_stopping=True,
                                                    repetition_penalty=repetition_penalty,
                                                    length_penalty=length_penalty,
                                                    num_return_sequences=num_return_sequences
                                                    )
            elif mode == 'topk':
                generated_ids = self.model.generate(input_ids,
                                                    attention_mask=attention_mask,
                                                    decoder_start_token_id=self.model.config.decoder.bos_token_id,
                                                    repetition_penalty=repetition_penalty,
                                                    length_penalty=length_penalty,
                                                    num_return_sequences=num_return_sequences,
                                                    do_sample=True,
                                                    top_k=50,
                                                    top_p=0.95,
                                                    )
            else:
                generated_ids = self.model.generate(input_ids,
                                                    attention_mask=attention_mask,
                                                    decoder_start_token_id=self.model.config.decoder.bos_token_id,
                                                    num_return_sequences=num_return_sequences,
                                                    repetition_penalty=repetition_penalty,
                                                    length_penalty=length_penalty
                                                    )
            generated_text = self.tokenizer.batch_decode(
                generated_ids, skip_special_tokens=True)
                
        return generated_text 
    
    def save_pretrained(self, *args, **kwargs):
        """
        Save the model.
        """
        #saving model
        return self.model.save_pretrained(*args, **kwargs)

## Task 4 - Models with Context+Question

Implementation of $f_\theta(P, Q)$ for both the models, using the wrapper model defined before.

In [None]:
bert_tiny = s2smodel('prajjwal1/bert-tiny')
include_history = False

In [None]:
roberta = s2smodel('distilroberta-base')
include_history = False

## Task 5 - Models with Context+Question+History

Implementation of $f_\theta(P, Q, H)$ for both the models, using the wrapper model defined before.

In [None]:
bert_tiny_history = s2smodel('prajjwal1/bert-tiny')
include_history = True

In [None]:
roberta_history = s2smodel('distilroberta-base')
include_history = True

## Task 6 - Training and Evaluation

In this section the models are trained and evaluated. Please select the model and hyperparameters carefully as the training phase is quite heavy.

In [None]:
# The model to train
model = bert_tiny
# Use the largest affordable batch_szie (suggested: 16 for bert-tiny, 8 or 4 for Distilroberta-base)
batch_size = 16
# Use 1e-3 for Bert-tiny, 2e-5 for Distilroberta-base
learning_rate = 1e-3

### Training

In [None]:
train_dataset = Dataset.from_dict(df_train)
train_dataset

Dataset({
    features: ['context', 'question', 'answer', 'history'],
    num_rows: 85820
})

In [None]:
val_dataset = Dataset.from_dict(df_val)
val_dataset

Dataset({
    features: ['context', 'question', 'answer', 'history'],
    num_rows: 21455
})

In [None]:
test_dataset = Dataset.from_dict(df_test)
test_dataset

In [None]:
dataset_dict = DatasetDict({
    "train": train_dataset,
    "val": val_dataset,
    "test": test_dataset    # test data will be mapped but won't be considered during training
})
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['context', 'question', 'answer', 'history'],
        num_rows: 85820
    })
    val: Dataset({
        features: ['context', 'question', 'answer', 'history'],
        num_rows: 21455
    })
})

In [None]:
dataset_dict = model.process_dataset(dataset_dict)

In [None]:
train_data = dataset_dict['train']
val_data = dataset_dict['val']
test_data = dataset_dict['test']
train_data.set_format('torch')
val_data.set_format('torch')
test_data.set_format('torch')

In [None]:
model.train(train_data, val_data, batch_size=batch_size, learning_rate=learning_rate)

### Saving model

In [None]:
model.save_pretrained(os.path.join(models_path ,f"{model.model_name}{'-history' if model.include_history else ''}-{str(seed)}"))

### Evaluation

In [None]:
# Use the following to use finetuned weights
# model = s2smodel(os.path.join(models_path, f"bert-tiny-{seed}"))
# model = s2smodel(os.path.join(models_path, f"bert-tiny-history-{seed}"))
# model = s2smodel(os.path.join(models_path, f"distilroberta-base-{seed}"))
# model = s2smodel(os.path.join(models_path, f"distilroberta-base-history-{seed}"))

include_history = True

The following encoder weights were not tied to the decoder ['bert/pooler']
The following encoder weights were not tied to the decoder ['bert/pooler']


In [None]:
batch_size = 16

In [None]:
val_dataset = Dataset.from_dict(df_val)
val_dataset

Dataset({
    features: ['source', 'context', 'question', 'answer', 'history'],
    num_rows: 21455
})

In [None]:
test_dataset = Dataset.from_dict(df_test)
test_dataset

Dataset({
    features: ['source', 'context', 'question', 'answer', 'history'],
    num_rows: 7983
})

In [None]:
dataset_dict = DatasetDict({
    "val": val_dataset,
    "test": test_dataset
})
dataset_dict

DatasetDict({
    val: Dataset({
        features: ['source', 'context', 'question', 'answer', 'history'],
        num_rows: 21455
    })
    test: Dataset({
        features: ['source', 'context', 'question', 'answer', 'history'],
        num_rows: 7983
    })
})

In [None]:
dataset_dict = model.process_dataset(dataset_dict, include_history=include_history)

  0%|          | 0/42 [00:00<?, ?ba/s]

  0%|          | 0/16 [00:00<?, ?ba/s]

In [None]:
val_data = dataset_dict['val']
test_data = dataset_dict['test']
val_data.set_format('torch')
test_data.set_format('torch')

In [None]:
score, y_val_pred = model.evaluate(val_data, batch_size=batch_size)
print(f"[val] SQUAD F1: {score}")

100%|██████████| 1341/1341 [19:14<00:00,  1.16it/s]

[val] SQUAD F1: 0.17001658389956673





In [None]:
score, y_test_pred = model.evaluate(test_data, batch_size=batch_size)
print(f"[test] SQUAD F1: {score}")

100%|██████████| 499/499 [07:00<00:00,  1.19it/s]

[test] SQUAD F1: 0.17098120847211923





## Task 7 - Error analysis

In this section the top 5 errors per source are reported in order to observe specific patterns in wrong answers and their associated questions.

In [24]:
def get_top_k_errors_per_source(data, predictions, k=5, metric='squad_f1', groupby='source'):
    concat = pd.concat([pd.DataFrame(data), pd.DataFrame(predictions)], axis=1)[[groupby, 'question', 'answer', 'prediction', metric]]
    return concat.sort_values(metric).groupby(groupby).head(k).sort_values([groupby, metric])

In [25]:
batch_size = 32
m1 = s2smodel(os.path.join(models_path, f"distilroberta-base-history-{seed}"))
include_history_m1 = True
m2 = s2smodel(os.path.join(models_path, f"bert-tiny-history-{seed}"))
include_history_m2 = True

The following encoder weights were not tied to the decoder ['roberta/pooler']
The following encoder weights were not tied to the decoder ['roberta/pooler']


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

The following encoder weights were not tied to the decoder ['bert/pooler']
The following encoder weights were not tied to the decoder ['bert/pooler']


Downloading:   0%|          | 0.00/285 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [26]:
test_dataset_m1 = Dataset.from_dict(df_test)
test_data_m1 = m1.process_dataset(test_dataset_m1, include_history=include_history_m1)
test_data_m1.set_format('torch')

  0%|          | 0/16 [00:00<?, ?ba/s]

In [27]:
test_dataset_m2 = Dataset.from_dict(df_test)
test_data_m2 = m2.process_dataset(test_dataset_m2, include_history=include_history_m2)
test_data_m2.set_format('torch')

  0%|          | 0/16 [00:00<?, ?ba/s]

In [28]:
_, m1_pred = m1.evaluate(test_data_m1, batch_size=batch_size)

100%|██████████| 250/250 [04:40<00:00,  1.12s/it]


In [29]:
get_top_k_errors_per_source(test_data_m1, m1_pred)

Unnamed: 0,source,question,answer,prediction,squad_f1
7510,2008_Summer_Olympics_torch_relay2008_Summer_Ol...,Did they visit any ancient Chinese sites?,Silk Road,yes,0.0
7509,2008_Summer_Olympics_torch_relay2008_Summer_Ol...,And did they climb any mountains?,unknown,yes,0.0
7508,2008_Summer_Olympics_torch_relay2008_Summer_Ol...,Did they visit any notable landmarks?,Panathinaiko Stadium,yes,0.0
7507,2008_Summer_Olympics_torch_relay2008_Summer_Ol...,How many days was the race?,seven,three years,0.0
7506,2008_Summer_Olympics_torch_relay2008_Summer_Ol...,Where did they go after?,Athens,in the Beijing Olympics,0.0
...,...,...,...,...,...
6641,middle880.txt,Who suggested making the spot easy to return to?,Bruce,Tommy,0.0
6640,middle880.txt,How come?,there were a lot of fish.,to go to the beach,0.0
6639,middle880.txt,Was the spot they found a good spot to cast th...,Yes,no,0.0
6638,middle880.txt,Did they search all day for a good spot to no ...,No.,yes,0.0


In [31]:
_, m2_pred = m2.evaluate(test_data_m2, batch_size=batch_size)

100%|██████████| 250/250 [03:09<00:00,  1.32it/s]


In [32]:
get_top_k_errors_per_source(test_data_m2, m2_pred)

Unnamed: 0,source,question,answer,prediction,squad_f1
7510,2008_Summer_Olympics_torch_relay2008_Summer_Ol...,Did they visit any ancient Chinese sites?,Silk Road,yes,0.0
7509,2008_Summer_Olympics_torch_relay2008_Summer_Ol...,And did they climb any mountains?,unknown,the journey of harmony,0.0
7508,2008_Summer_Olympics_torch_relay2008_Summer_Ol...,Did they visit any notable landmarks?,Panathinaiko Stadium,yes,0.0
7507,2008_Summer_Olympics_torch_relay2008_Summer_Ol...,How many days was the race?,seven,two,0.0
7506,2008_Summer_Olympics_torch_relay2008_Summer_Ol...,Where did they go after?,Athens,beijing,0.0
...,...,...,...,...,...
6632,middle880.txt,Where did they live?,England.,in the lake,0.0
6631,middle880.txt,How many friends were there?,Three,two,0.0
6633,middle880.txt,Did they all live in different cities there?,No,yes,0.0
6635,middle880.txt,What did they do there?,Fished,go to the lake,0.0


### Sample predictions

In this section a utility script is defined to look at the model performance on a per-question basis.

In [None]:
def sample_predict(model, sample, df, include_history=False):

    question = df.iloc[sample].question
    context = df.iloc[sample].context
    answer = df.iloc[sample].answer
    if include_history:
        history = df.iloc[sample].history

    data = Dataset.from_pandas(pd.DataFrame(df.iloc[sample]).T)

    print(f"Context: {context}")
    if include_history:
        print(f"\nHistory: {history}")
    print(f"\n\nQuestion: {question}")
    print(f"Answer: {answer}")

    pred = model.predict(data, mode='beam', include_history=include_history)
    
    print(f"Predicted answer: {pred[0]}")
    print(f"F1-score {squad.compute_f1(pred[0], answer)}")


In [None]:
sample_predict(model, 8, df_train)

Context: Johnny and his class were looking forward to a fun day in art class. The teacher gave the class paint, brushes and other items to use to make their drawings. Johnny's friend Kevin used a straw to blow paint on his paper. It looked very cool. Lisa used markers to make a picture of her and her dog. Lisa has several pets, but her favorite one is her dog, Ben. Tony used a potato to make stars. He then put the potato into different colors of paint and made a nice pattern. Johnny used feathers to make his picture. When they had finished, the class chose which picture was the best. Johnny got second place and was very excited. Then it was time for lunch and the class had a party. They had hamburgers with ketchup and had cake for dessert. It was a very fun day for the whole class. They all went home tired and happy. Johnny took a nap when he went home.


Question: What was eaten?
Answer: hamburgers with ketchup and  cake
Predicted answer: strawberries
F1-score 0.0


## Conclusions

It has been possible to implement all the given tasks. The models employing history have been empirically proven to be slightly superior with respect to their non-history variants. While it has been possible to obtain some results, the limit on the number of epochs and the lack of required equipment to train the models have severely limited the possibility to obtain significative results. As a result of this the error analysis did not highlight anything in particular as both models are incapable of answering to most questions and most of the positive results are due to closed yes/no questions.

Moreover, the method used to implement history (i.e. concatenating history and context) has proven to be somewhat useful, but it is once again limited by the input length which is in turn limited by the available resources to train the model. A different approach would have required different architectures and/or multiple networks, which was again, not allowed.

Overall distilroberta-base performed only slightly better compared to bert-tiny (w.r.t squad-f1) while using many more parameters, resources.