<center>
    <h1>Assignment 2 - Question Answering with Transformers on CoQA</h1>
    <h2>Natural Language Processing</h2>
    <h3>Antonio Politano, Enrico Pittini, Riccardo Spolaor and Samuele Bortolato</h3>
    <h4>antonio.politano2@studio.unibo.it, enrico.pittini@studio.unibo.it, riccardo.spolaor@studio.unibo.it, samuele.bortolato@studio.unibo.it</h4>
</center>



---



Assignment description: see `Assignment.ipynb`.

In this notebook the QA task is addressed.

For more detailed informations about the used functions, look into the corresponding docstrings inside the python files, inside the `utils` folder.

In [None]:
import json
import numpy as np
import pandas as pd
import torch

In [None]:
# Settings for autoreloading

%load_ext autoreload
%autoreload 2

In [None]:
# Settings for reproducibility
from utils.seeder import set_random_seed

set_random_seed(42)

# [Task 1] Remove unaswerable QA pairs

## 1.1 Dataset download

The dataset is downloaded and saved in the `coqua` folder using the snippet of code provided in `Assignment.ipynb`.

In [None]:
import os
import urllib.request
from tqdm import tqdm

class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [None]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')  # <-- Why test? See next slides for an answer!

## 1.2 Dataframe Creation

The train and test dataframes (`train_df` and `test_df`) are built. Each row contains information about a specific question and the corresponding answer along with their chronological collocation (`turn_id`) in the conversation. Furthermore informations about the passage containing the context and the history of previous questions and answers of the relative conversation is contained.

In [None]:
from utils.dataframe_builder import get_dataframe

train_df = get_dataframe(os.path.join('coqa', 'train.json'))
test_df = get_dataframe(os.path.join('coqa', 'test.json'))


## 1.3 Data Inspection

The heads of the train and test dataframes are shown below along with their shapes.

In [None]:
print(f'Train dataframe shape: {train_df.shape}')
train_df.head()

In [None]:
print(f'Test dataframe shape: {test_df.shape}')
test_df.head()

We can observe that the training dataframe contains $15$ different features, while the test dataframe has just $14$. In particular the train dataframe includes the additional features `question_bad_turn` and `answer_bad_turn`. With a quick inspection of the dataframe's head it can be observed that they include `NaN` values. Since the task requires to remove solely unanswerable question-answer pairs, not mentioning the handling of "bad turn", these two features are dropped.

On the other hand the test dataframe contains the extra feature `additional_answers`, which can be removed as expressed in the specifications of the assignment.

In [None]:
# Drop non-matching features

train_df.drop(['question_bad_turn', 'answer_bad_turn'], axis=1, inplace=True)
test_df.drop('additional_answers', axis=1, inplace=True)

In addition the features `name` and `filename` are removed since they are considered useless for the task.

In [None]:
# Drop useless columns (`name`, `filename`)

train_df.drop(['name', 'filename'], axis=1, inplace=True)
test_df.drop(['name', 'filename'], axis=1, inplace=True)

Next, by inspecting the `question_turn_id` and `answer_turn_id` it can be noticed that they are equivalent, since they refer to the same question-answer pair, hence they can be merged in a single feature (`turn_id`).

In [None]:
# Assert that the turn ids of the questions are the same as the respective answers

assert train_df['question_turn_id'].equals(train_df['answer_turn_id']), \
    'Question and answer turn ids are different in the train dataset'
    
assert test_df['question_turn_id'].equals(test_df['answer_turn_id']), \
    'Question and answer turn ids are different in the test dataset'

In [None]:
# Rename columns `question_turn_id` and `answer_turn_id` into a singular `turn_id` column since they are equal
refactor_turn_id_columns = lambda df: \
    df.drop('question_turn_id', axis=1).rename(columns = {'answer_turn_id': 'turn_id'})
    
train_df = refactor_turn_id_columns(train_df)
test_df = refactor_turn_id_columns(test_df)

Finally the columns `answer_input_text` and `question_input_text` are renamed into `answer` and `question` respectively for simplicity.

In [None]:
# Rename columns `answer_input_text` and `question_input_text` into `answer` and `question` respectively
column_renames = {'answer_input_text': 'answer', 'question_input_text': 'question'}

train_df.rename(columns=column_renames, inplace=True)
test_df.rename(columns=column_renames, inplace=True)

The shapes of the dataframes now match on the column number and no Null values are present.

In [None]:
print(f'Train dataframe shape after the unwanted columns drop: {train_df.shape}')
print(f'Test dataframe shape after the unwanted column drop: {test_df.shape}')

In [None]:
print(f'Null values in the train dataframe: {train_df.isna().sum().sum()}.')
print(f'Null values in the test dataframe: {test_df.isna().sum().sum()}.')

In [None]:
train_df.head()

In [None]:
test_df.head()

## 1.3 Remove Unanswerable Question-Answer Pairs

As required by the task, the unanswerable question-answer pairs are removed from the dataset by dropping the rows of the dataframes where the feature `answer` is equal to "unknown".

In [None]:
# Delete rows with unknown answer

train_df.drop(train_df[train_df['answer'] == 'unknown'].index, inplace=True)
train_df.reset_index(drop=True, inplace=True)

test_df.drop(test_df[test_df['answer'] == 'unknown'].index, inplace=True)
test_df.reset_index(drop=True, inplace=True)


In [None]:
print(f'Train dataframe shape after the unanswerable question-answer pairs are removed: {train_df.shape}')
print(f'Test dataframe shape after the unanswerable question-answer pairs are removed: {test_df.shape}')

In the next cell it is asserted that the history was properly created for each Question-Answer pair.

In [None]:
def check_history(df: pd.DataFrame, dataframe_name: str = None):
    """Check that the history is properly built for each Question-Answer pair in each row of the dataframe.

    Parameters
    ----------
    df : DataFrame
        The dataframe on which the history is checked.
    dataframe_name : str, optional
        The name of the dataframe. Defaults to None.
    """
    prev_doc = None
    prev_hist = []
    prev_question = None
    prev_answer = None
    for d, h, q, a in zip(df['id'], df['history'], df['question'], df['answer']):
        if d != prev_doc:
            assert len(h) == 0, 'Error: Initial history of a new conversation is not empty!'
            prev_doc = d
            prev_hist = []
            prev_question = q
            prev_answer = a
        else:
            assert prev_hist + [prev_question, prev_answer] == h, 'Error: The history was not computed properly!'
            prev_question = q
            prev_answer = a
            prev_hist = h

    print(f'The history{f" of {dataframe_name} dataframe" if dataframe_name is not None else ""}', 
          'was properly built for each Question-Answer pair.')

check_history(train_df, 'train')
check_history(test_df, 'test')

## 1.4 Data Analysis

In this section some interesting analyses on the training set are carried out.

In [None]:
# Group dataframe by `id`
grouped_train_df = train_df.groupby(by=['id'])

In [None]:
print(f'Number of train passages: {len(grouped_train_df)}')

In [None]:
from utils.dataset_analisys import *
plot_converstion_length_distribution(grouped_train_df)

In [None]:
plot_passage_length_analysis(grouped_train_df.story.unique())

In [None]:
plot_answer_span_text_percentile(train_df)

## [Task 2] Train, Validation and Test splits

In this section the train dataframe is split into an actual train and a validation dataframes.

The split is performed as follows:
1. The random seed is set to $42$ for reproducibility purposes.
2. The train proportion of the actual training dataset to the original dataset is of $0.8$.
3. The train dataframe is shuffled and divided into the two new dataframes making sure that no conversation is split among them.

In [None]:
set_random_seed(42)

In [None]:
from sklearn.model_selection import GroupShuffleSplit
from typing import Tuple

def train_validation_split(df: pd.DataFrame, train_size: int = .8, random_seed: int = 42) \
    -> Tuple[pd.DataFrame, pd.DataFrame]:
    """ Get train and validation dataframes by shuffling and splitting an original dataframe according to a given proportion
    and a specific random seed.
    
    Note: The order of the rows of the same conversation is preserved in the shuffle. Moreover, the conversations are never
    split across the two resulting dataframes.

    Parameters
    ----------
    df : DataFrame
        The dataframe from which the train and validation dataframes are obtained.
    train_size : int, optional
        The proportion of the train split. Defaults to 0.8.
    random_seed : int, optional
        The random seed for the shuffle. Defaults to 42.

    Returns:
        Tuple[pd.DataFrame, pd.DataFrame]: _description_
    """
    # Get indices of train and test rows in the dataframe
    group_shuffle_split = GroupShuffleSplit(n_splits=2, train_size=train_size, random_state=random_seed)
    train_ix, test_ix = next(group_shuffle_split.split(df, groups=df.id))

    train_df = df.loc[train_ix]
    train_df.reset_index(inplace=True, drop=True)
    
    val_df = df.loc[test_ix]
    val_df.reset_index(inplace=True, drop=True)
    
    return train_df, val_df

In [None]:
train_df, val_df = train_validation_split(train_df)

The tail of the obtain dataframe (`train_df`) and the head of the validation dataframe (`val_df`) are shown below to assert that the conversations are not splitted and that their question-answer pairs are still chronologically ordered.

In [None]:
print(f'Train dataframe shape after the split: {train_df.shape}')
train_df.tail()

In [None]:
print(f'Validation dataframe shape after the split: {val_df.shape}')
val_df.head()

In [None]:
print(f'Train passages count: {len(train_df.groupby(by=["id"]))}')
print(f'Validation passages count: {len(val_df.groupby(by=["id"]))}')

print()

len_tot=len(train_df)+len(val_df)
print(f'Train QaA count: {len(train_df)} \t\t Train QaA ratio: {len(train_df)/len_tot:.2f}')
print(f'Validation QaA count: {len(val_df)} \t Validation QaA ratio: {len(val_df)/len_tot:.2f}')

In addition, the train, validation and test dataloaders are provided for future training purposes.

In [None]:
from utils.dataloader_builder import get_dataloader

train_dataloader = get_dataloader(train_df, batch_size=8)
val_dataloader = get_dataloader(val_df)
test_dataloader = get_dataloader(test_df)

## [Task 3] Model definition

In [None]:
#model_name = 'distilroberta-base'          # distil-roberta pretrained model
model_name = 'prajjwal1/bert-tiny'          # tiny-bert pretrained model

use_history=False
seed = 1337

folder='weigths'
if use_history:
    folder_name = f'{folder}\PQH\seed{seed}'
else:
    folder_name = f'{folder}\PQ\seed{seed}'

In [None]:
from models.model import Model 

set_random_seed(seed)
model = Model(model_name=model_name, device='cuda')

## [Task 4] Question generation with text passage $P$ and question $Q$

In [None]:
i=5
question_sample = [train_df.iloc[i]['question']]
passage_sample = [train_df.iloc[i]['story']]
answer_sample = train_df.iloc[i]['answer']

print(f'Question sample: "{question_sample[0]}"')
print()
print(f'Predicted answer by the model: "{model.generate(passage_sample, question_sample)[0]}"')
print()
print(f'True answer: "{answer_sample}"')

## [Task 5] Question generation with text passage $P$, question $Q$ and dialogue history $H$

In [None]:
question_sample = [train_df.iloc[5]['question']]
passage_sample = [train_df.iloc[5]['story']]
answer_sample = train_df.iloc[5]['answer']
history_sample = [' <sep> '.join(train_df.iloc[5]['history'])]

print(f'Question sample: "{question_sample[0]}"')
print()
print(f'Predicted answer by the model: "{model.generate(passage_sample, question_sample, history=history_sample)[0]}"')
print()
print(f'True answer: "{answer_sample}"')

## [Task 6] Train and evaluate $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$

In [None]:
try:
    checkpoint = torch.load(f"{folder_name}\\{model_name.replace('/','_')}.pth")
    loss_history = checkpoint['loss_history']
    val_loss_history = checkpoint['val_loss_history']
    opt_state_dict=checkpoint['opt_state_dict']
    model.load_state_dict(checkpoint['model_state_dict'])
    print('Loaded saved files')
except:
    loss_history=None
    val_loss_history=None
    opt_state_dict=None
    print('Unable to load saved files, default initialization')

In [None]:
from utils.training import train_1
set_random_seed(seed)
train_1( train_dataloader=train_dataloader, val_dataloader=val_dataloader, epochs=3,
        model=model, use_history=False, folder_name=folder_name,
        #opt_state_dict = opt_state_dict, 
        loss_history = list(loss_history) if loss_history is not None else None,
        val_loss_history = list(val_loss_history) if loss_history is not None else None,
        steps_validate=0.33, steps_save=0.01, device='cuda')

In [None]:
checkpoint = torch.load(f"{folder_name}\\{model_name.replace('/','_')}.pth")
lh = checkpoint['loss_history']
vlh = checkpoint['val_loss_history']

N=100

plt.figure(figsize=(15,12))
plt.subplot(2,2,1)
plt.plot(lh[:,0])
plt.plot(np.convolve(lh[:,0], np.ones(N)/N, mode='valid'))
if len(vlh)>0:
    plt.plot(vlh[:,0],vlh[:,1],'r*')

plt.subplot(2,2,2)
plt.plot(lh[:,1])
plt.plot(np.convolve(lh[:,1], np.ones(N)/N, mode='valid'))
if len(vlh)>0:
    plt.plot(vlh[:,0],vlh[:,2],'r*')

plt.subplot(2,2,3)
plt.plot(lh[:,0])
plt.plot(np.convolve(lh[:,0], np.ones(N)/N, mode='valid'))
if len(vlh)>0:
    plt.plot(vlh[:,0],vlh[:,1],'r*')
plt.yscale('log')

plt.subplot(2,2,4)
plt.plot(lh[:,1])
plt.plot(np.convolve(lh[:,1], np.ones(N)/N, mode='valid'))
if len(vlh)>0:
    plt.plot(vlh[:,0],vlh[:,2],'r*')
plt.yscale('log')

In [None]:
from utils.squad import validate
f1_squad = validate(model, val_dataloader, use_history=False)

## [Task 7] Error Analysis

Load weights

In [None]:
#model_name = 'distilroberta-base'          # distil-roberta pretrained model
model_name = 'prajjwal1/bert-tiny'          # tiny-bert pretrained model

use_history=False
seed = 42

In [None]:
if use_history:
    folder_name = 'weigths\PQH\seed'+str(seed)
else:
    folder_name = 'weigths\PQ\seed'+str(seed)

set_random_seed(seed)
model = Model(model_name=model_name, device='cuda')

model.load_state_dict(torch.load(f"{folder_name}/{model_name.replace('/','_')}.pt"))

In [None]:
from utils.squad import _compute_squad_f1
from tqdm import tqdm

def get_worst_answers(model: Model, df_source, use_history: bool = False, k=5, min_answer_length=1):
    # (f1, question, passage, history if, gold_answer, pred_answer)
    worst_answers = []

    torch.cuda.empty_cache()

    source_dataloader = get_dataloader(df=df_source, batch_size=16)

    for batch_idx, data in tqdm(enumerate(source_dataloader, 0)):
        
        with torch.no_grad():
            # get the inputs; data is a list of [inputs, labels]
            (passage, question, history), (answer, _, _) = data
            
            pred = model.generate(passage,question,history if use_history else None)
            
            if min_answer_length > 1:
                mask = np.array([len(predicted.split(' ')) >= min_answer_length for predicted in pred])
                passage = np.array(passage)[mask]
                question = np.array(question)[mask]
                history = np.array(history)[mask]
                answer = np.array(answer)[mask]
                pred = np.array(pred)[mask]

            f1_scores = np.array([_compute_squad_f1(gold,predicet) for gold, predicet in zip(answer,pred)])
            samples_indices = np.argsort(f1_scores)[:k]

            worst_answers += [(f1_scores[sample_idx], question[sample_idx], passage[sample_idx], history[sample_idx], 
                               answer[sample_idx], pred[sample_idx]) 
                              for sample_idx in samples_indices]
            worst_answers = sorted(worst_answers)[:k]
    
    return worst_answers

In [None]:
it = iter(test_df.groupby(by=['source']))
next(it)
#next(it)
df_source = next(it)[1]

In [None]:
df_source.info()

In [None]:
source_dataloader = get_dataloader(df=df_source, batch_size=16)

In [None]:
len(source_dataloader)

In [None]:
a = get_worst_answers(model, df_source)

In [None]:
a

In [None]:
it = iter(train_df.groupby(by=['source']))
next(it)
df_source = next(it)[1]

In [None]:
#a = get_worst_answers(M2, df_source)

In [None]:
#a

In [None]:
def show_token_importances(model, question, passage, span_start, span_end, history=None):

    token_importances = model.compute_token_importances(passage, question, span_start, span_end, history)

    y = np.zeros(shape=(token_importances.shape[1],))

    y[span_start : span_end] = 1



    plt.plot(token_importances.cpu().detach()[0])
    plt.plot(y)

    plt.show()

In [None]:
r = test_df.iloc[0]
show_token_importances(model, r['question'], r['story'], r['answer_span_start'],  r['answer_span_end'])