# Assignment 2

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Keywords**: Transformers, Question Answering, CoQA

## Deadlines

* **December 11**, 2022: deadline for having assignments graded by January 11, 2023
* **January 11**, 2023: deadline for half-point speed bonus per assignment
* **After January 11**, 2023: assignments are still accepted, but there will be no speed bonus

## Overview

### Problem

Question Answering (QA) on [CoQA](https://stanfordnlp.github.io/coqa/) dataset: a conversational QA dataset.

### Task

Given a question $Q$, a text passage $P$, the task is to generate the answer $A$.<br>
$\rightarrow A$ can be: (i) a free-form text or (ii) unanswerable;

**Note**: an question $Q$ can refer to previous dialogue turns. <br>
$\rightarrow$ dialogue history $H$ may be a valuable input to provide the correct answer $A$.

### Models

We are going to experiment with transformer-based models to define the following models:

1.  $A = f_\theta(Q, P)$

2. $A = f_\theta(Q, P, H)$

where $f_\theta$ is the transformer-based model we have to define with $\theta$ parameters.

## The CoQA dataset

<center>
    <img src="https://drive.google.com/uc?export=view&id=16vrgyfoV42Z2AQX0QY7LHTfrgektEKKh" width="750"/>
</center>

For detailed information about the dataset, feel free to check the original [paper](https://arxiv.org/pdf/1808.07042.pdf).



## Rationales

Each QA pair is paired with a rationale $R$: it is a text span extracted from the given text passage $P$. <br>
$\rightarrow$ $R$ is not a requested output, but it can be used as an additional information at training time!

## Dataset Statistics

* **127k** QA pairs.
* **8k** conversations.
* **7** diverse domains: Children's Stories, Literature, Mid/High School Exams, News, Wikipedia, Reddit, Science.
* Average conversation length: **15 turns** (i.e., QA pairs).
* Almost **half** of CoQA questions refer back to **conversational history**.
* Only **train** and **validation** sets are available.

## Dataset snippet

The dataset is stored in JSON format. Each dialogue is represented as follows:

```
{
    "source": "mctest",
    "id": "3dr23u6we5exclen4th8uq9rb42tel",
    "filename": "mc160.test.41",
    "story": "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. 
    Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. [...]" % <-- $P$
    "questions": [
        {
            "input_text": "What color was Cotton?",   % <-- $Q_1$
            "turn_id": 1
        },
        {
            "input_text": "Where did she live?",
            "turn_id": 2
        },
        [...]
    ],
    "answers": [
        {
            "span_start": 59,   % <-- $R_1$ start index
            "spand_end": 93,    % <-- $R_1$ end index
            "span_text": "a little white kitten named Cotton",   % <-- $R_1$
            "input_text" "white",   % <-- $A_1$      
            "turn_id": 1
        },
        [...]
    ]
}
```

### Simplifications

Each dialogue also contains an additional field ```additional_answers```. For simplicity, we **ignore** this field and only consider one groundtruth answer $A$ and text rationale $R$.

CoQA only contains 1.3% of unanswerable questions. For simplicity, we **ignore** those QA pairs.

## [Task 1] Remove unaswerable QA pairs

Write your own script to remove unaswerable QA pairs from both train and validation sets.

## Dataset Download


In [1]:
import os
import urllib.request
from tqdm import tqdm

class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [2]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')  # <-- Why test? See next slides for an answer!

In [3]:
# SB Added Code starts here

# Import libraries
import json
import pandas as pd
import random

# Store unknown IDs in a list to exclude them during the creation of the training, validation and test datasets
# Training
train_data_json = open('./coqa/train.json')
train_data = json.load(train_data_json)
cnt_un_train = 0
idx_un_train = []
cnt_un_test = 0
idx_un_test = []

for i in range(0,len(train_data['data'])):
  if (train_data['data'][i]['answers'][0]['span_text'] == 'unknown'):
    cnt_un_train += 1
    idx_un_train.append(i)  
percentage = round((100 * (cnt_un_train / len(train_data['data']))),2)
print("Training unknown answers",cnt_un_train,"over a total of",len(train_data['data']),":",percentage," %")

# Test
test_data_json = open('./coqa/test.json')
test_data = json.load(test_data_json)

for i in range(0,len(test_data['data'])):
  if (test_data['data'][i]['answers'][0]['span_text'] == 'unknown'):
    cnt_un_test += 1
    idx_un_test.append(i)  
percentage = round((100 * (cnt_un_test / len(test_data['data']))),2)

print("Test unknown answers",cnt_un_test,"over a total of",len(test_data['data']),":",percentage," %")

Training unknown answers 54 over a total of 7199 : 0.75  %
Test unknown answers 2 over a total of 500 : 0.4  %


#### Data Inspection

Spend some time in checking accurately the dataset format and how to retrieve the tasks' inputs and outputs!

In [4]:
df_train_mockup = pd.read_json('./coqa/train.json')
print("Training Dataframe shape\n",df_train_mockup.shape)
print("\nExample of one dataframe row:\n",df_train_mockup.iloc[2])
# Focusing on 'data' column
print("\nData dictionary keys\n",df_train_mockup['data'].iloc[2].keys(),'\n')
print("\nStory\n",df_train_mockup['data'].iloc[2]['story'],'\n')
print("\nQuestions\n",df_train_mockup['data'].iloc[2]['questions'],'\n')
print("\nAnswers\n",df_train_mockup['data'].iloc[2]['answers'],'\n')
df_train_mockup.head(10)

Training Dataframe shape
 (7199, 2)

Example of one dataframe row:
 version                                                    1
data       {'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn...
Name: 2, dtype: object

Data dictionary keys
 dict_keys(['source', 'id', 'filename', 'story', 'questions', 'answers', 'name']) 


Story
 CHAPTER VII. THE DAUGHTER OF WITHERSTEEN 

"Lassiter, will you be my rider?" Jane had asked him. 

"I reckon so," he had replied. 

Few as the words were, Jane knew how infinitely much they implied. She wanted him to take charge of her cattle and horse and ranges, and save them if that were possible. Yet, though she could not have spoken aloud all she meant, she was perfectly honest with herself. Whatever the price to be paid, she must keep Lassiter close to her; she must shield from him the man who had led Milly Erne to Cottonwoods. In her fear she so controlled her mind that she did not whisper this Mormon's name to her own soul, she did not even think it. Beside

Unnamed: 0,version,data
0,1,"{'source': 'wikipedia', 'id': '3zotghdk5ibi9ce..."
1,1,"{'source': 'cnn', 'id': '3wj1oxy92agboo5nlq4r7..."
2,1,"{'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn..."
3,1,"{'source': 'cnn', 'id': '3ewijtffvo7wwchw6rtya..."
4,1,"{'source': 'gutenberg', 'id': '3urfvvm165iantk..."
5,1,"{'source': 'race', 'id': '3ftf2t8wlri896r0rn6x..."
6,1,"{'source': 'cnn', 'id': '3qemnnsb2xz5mh3gvv3nj..."
7,1,"{'source': 'race', 'id': '369j354ofdapu1z2ebz3..."
8,1,"{'source': 'race', 'id': '3v0z7ywsiy0kux6wg4mm..."
9,1,"{'source': 'wikipedia', 'id': '3v5q80fxixr0io4..."


## [Task 2] Train, Validation and Test splits

CoQA only provides a train and validation set since the test set is hidden for evaluation purposes.

We'll consider the provided validation set as a test set. <br>
$\rightarrow$ Write your own script to:
* Split the train data in train and validation splits (80% train and 20% val)
* Perform splits such that a dialogue appears in one split only! (i.e., split at dialogue level)
* Perform splitting using the following seed for reproducibility: 42

#### Reproducibility Memo

Check back tutorial 2 on how to fix a specific random seed for reproducibility!

In [5]:
# Generate training and validation dialogues and corresponding Q/A dataset splits accordingly to 80/20 ratio
# excluding dialogues with unaswered questions
# To avoid introducing bias across epochs, select randomly the dialogues
# Selection starts from stories, and then corresponding Q/A are selected to ensure the split between training and validation is done at dialogue level

# Initialization
random.seed(42)
train_split = 0.8
val_split = 1-train_split
train_idx = []
train_dialogue_story = {}
train_dialogue_qa = {}
train_val_tot = len(train_data['data'])
train_val_net_tot = train_val_tot - cnt_un_train
train_rec_tot = int(train_split*train_val_net_tot)
cnt = 0
cnt_q = 0
end_of_train = False

# Iterate over training data selecting 80% of the dialogues randomly and store them in the training story dictiornary provided 
# they do not contain unaswerable questions and that are not already present.
# Once a dialogue is stored in the stories dictionary insert the corresponding questions in the QA training dictionary.
# Q/A are linked to the corresponding dialogues, storing the stories ID.

# Training
while not end_of_train:
  idx = random.randint(0,train_val_tot)
  if ((idx not in idx_un_train) and (idx not in train_idx)):
    train_dialogue_story[cnt] = [train_data['data'][idx]['id'],train_data['data'][idx]['story']]
    for x in range(0,len(train_data['data'][idx]['questions'])):
      train_dialogue_qa[cnt_q] = {'id':x,'question':train_data['data'][idx]['questions'][x],'answer':train_data['data'][idx]['answers'][x],'ref_id':train_data['data'][idx]['id'],'x_id':cnt}
      cnt_q += 1
    cnt += 1
    train_idx.append(idx)
  if cnt == train_rec_tot: end_of_train = True

# Validation
val_dialogue_story = {}
val_dialogue_qa = {}
cnt = 0
cnt_q = 0

for idx in range(0,train_val_tot):
  if ((idx not in idx_un_train) and (idx not in train_idx)):
    val_dialogue_story[cnt] = [train_data['data'][idx]['id'],train_data['data'][idx]['story']]
    for x in range(0,len(train_data['data'][idx]['questions'])):
      val_dialogue_qa[cnt_q] = {'id':x,'question':train_data['data'][idx]['questions'][x],'answer':train_data['data'][idx]['answers'][x],'ref_id':train_data['data'][idx]['id'],'x_id':cnt}
      cnt_q += 1
    cnt += 1

# Test
test_tot = len(test_data['data'])
test_dialogue_story = {}
test_dialogue_qa = {}
cnt = 0
cnt_q = 0

for idx in range(0,test_tot):
  if (idx not in idx_un_test):
    test_dialogue_story[cnt] = [test_data['data'][idx]['id'],test_data['data'][idx]['story']]
    for x in range(0,len(test_data['data'][idx]['questions'])):
      test_dialogue_qa[cnt_q] = {'id':x,'question':test_data['data'][idx]['questions'][x],'answer':test_data['data'][idx]['answers'][x],'ref_id':test_data['data'][idx]['id'],'x_id':cnt}
      cnt_q += 1
    cnt += 1


In [6]:
# Analysis of the training data
print("Analysis of the training data")
print("TOT: Total dialogues in the original training repository: ",train_val_tot)
print("UKN: Unaswerable dialogues in the original training repository: ",cnt_un_train)
print("TOT - UKN: Dialogues to be split between training and validaton: ",(train_val_tot-cnt_un_train))
# Training dataset
print("\nTraining dataset")
print("Expected training dialogues ",round((train_split*100),0),"% of (TOT - UKN):",int(train_split*train_val_net_tot))
print("Actual training dialogues: ",len(train_dialogue_story))
print("Actual training q/a pairs: ",len(train_dialogue_qa))
# Validation dataset
print("\nValidation dataset")
print("Expected validation dialogues: ",(train_val_tot-cnt_un_train)-len(train_dialogue_story))
print("Actual validation dialogues",len(val_dialogue_story))
print("Actual validation q/a pairs",len(val_dialogue_qa))

# Analysis of the test data
print("\nTest dataset")
print("TOT: Total dialogues in the original test repository: ",test_tot)
print("UKN: Unaswerable dialogues in the original training repository: ",cnt_un_test)
print("TOT - UKN: Dialogues to be split between training and validaton: ",(test_tot-cnt_un_test))
# Test dataset
print("Expected test dialogues (TOT - UKN):",(test_tot-cnt_un_test))
print("Actual test dialogues: ",len(test_dialogue_story))
print("Actual test q/a pairs: ",len(test_dialogue_qa))

Analysis of the training data
TOT: Total dialogues in the original training repository:  7199
UKN: Unaswerable dialogues in the original training repository:  54
TOT - UKN: Dialogues to be split between training and validaton:  7145

Training dataset
Expected training dialogues  80.0 % of (TOT - UKN): 5716
Actual training dialogues:  5716
Actual training q/a pairs:  86618

Validation dataset
Expected validation dialogues:  1429
Actual validation dialogues 1429
Actual validation q/a pairs 21357

Test dataset
TOT: Total dialogues in the original test repository:  500
UKN: Unaswerable dialogues in the original training repository:  2
TOT - UKN: Dialogues to be split between training and validaton:  498
Expected test dialogues (TOT - UKN): 498
Actual test dialogues:  498
Actual test q/a pairs:  7963


In [7]:
# Creation of dataframes only once pre-processing is completed for performance reasons
df_train_dialogue_story = pd.DataFrame.from_dict(data=train_dialogue_story,orient='index',columns=['id','story'])
df_train_dialogue_qa = pd.DataFrame.from_dict(data=train_dialogue_qa,orient='index',columns=['id','question','answer','ref_id','x_id'])
df_val_dialogue_story = pd.DataFrame.from_dict(data=val_dialogue_story,orient='index',columns=['id','story'])
df_val_dialogue_qa = pd.DataFrame.from_dict(data=val_dialogue_qa,orient='index',columns=['id','question','answer','ref_id','x_id'])


In [8]:
# Traininig dialogue dataframe
df_train_dialogue_story.head()

Unnamed: 0,id,story
0,37m28k1j0qd08516cu1iw1wrtq5jaa,"HYANNIS, Massachusetts (CNN) -- Family and clo..."
1,3ymtujh0dsgfkjhufn5vl4x0zmi4tf,"When I was a little kid, a father was like the..."
2,31lm9edvols7sovvly6ni7grrnrjnw,CHAPTER X: Reddy Fox Is Impudent \n\nA saucy t...
3,3rxpczqmqpbunfy585nmonb8wzag1w,Billy and Sally are brother and sister. Billy ...
4,3amywka6ybmdmeg02ucbosbrvpro6l,Wang Jiaming from Beijing Chenjinglun High Sch...


In [9]:
# Traininig Q/A dataframe
df_train_dialogue_qa.head()

Unnamed: 0,id,question,answer,ref_id,x_id
0,0,{'input_text': 'When did the Special Olympics ...,"{'span_start': 1275, 'span_end': 1351, 'span_t...",37m28k1j0qd08516cu1iw1wrtq5jaa,0
1,1,"{'input_text': 'Who started it?', 'turn_id': 2}","{'span_start': 287, 'span_end': 355, 'span_tex...",37m28k1j0qd08516cu1iw1wrtq5jaa,0
2,2,"{'input_text': 'Where?', 'turn_id': 3}","{'span_start': 1274, 'span_end': 1343, 'span_t...",37m28k1j0qd08516cu1iw1wrtq5jaa,0
3,3,{'input_text': 'What was the format at first?'...,"{'span_start': 1275, 'span_end': 1327, 'span_t...",37m28k1j0qd08516cu1iw1wrtq5jaa,0
4,4,"{'input_text': 'Has it grown?', 'turn_id': 5}","{'span_start': 1355, 'span_end': 1418, 'span_t...",37m28k1j0qd08516cu1iw1wrtq5jaa,0


In [10]:
# Validation dialogue dataframe
df_val_dialogue_story.head()

Unnamed: 0,id,story
0,3l2is5hsfaig646pxxa1p9p29hjnuj,Once an Englishman named Jack Brown went to Ru...
1,336yqze83vet37vakvnt4i8m51em5g,"A Chinese actor's divorce from his wife, over ..."
2,3wj1oxy92agboo5nlq4r7bndcb68a1,Laura and Graham were having a party for their...
3,3dhe4r9ocwb1c0g1r9n0t6ldp5u2g6,The Oscars ceremony at the 87th Academy Awards...
4,3u5nzhp4lr2b43ciddguaj57fmuhpg,Rochester ( or ) is a city on the southern sho...


In [11]:
# Validation Q/A dataframe
df_val_dialogue_qa.head()

Unnamed: 0,id,question,answer,ref_id,x_id
0,0,"{'input_text': 'What did Jack shoot?', 'turn_i...","{'span_start': 1029, 'span_end': 1068, 'span_t...",3l2is5hsfaig646pxxa1p9p29hjnuj,0
1,1,"{'input_text': 'was it killed?', 'turn_id': 2}","{'span_start': 1029, 'span_end': 1133, 'span_t...",3l2is5hsfaig646pxxa1p9p29hjnuj,0
2,2,{'input_text': 'What did the other wolves do t...,"{'span_start': 1135, 'span_end': 1231, 'span_t...",3l2is5hsfaig646pxxa1p9p29hjnuj,0
3,3,{'input_text': 'How long did the sleigh get aw...,"{'span_start': 1135, 'span_end': 1231, 'span_t...",3l2is5hsfaig646pxxa1p9p29hjnuj,0
4,4,"{'input_text': 'What was shining?', 'turn_id': 5}","{'span_start': 1297, 'span_end': 1342, 'span_t...",3l2is5hsfaig646pxxa1p9p29hjnuj,0


## [Task 3] Model definition

Write your own script to define the following transformer-based models from [huggingface](https://HuggingFace.co/).

* [M1] DistilRoBERTa (distilberta-base)
* [M2] BERTTiny (bert-tiny)

**Note**: Remember to install the ```transformers``` python package!

**Note**: We consider small transformer models for computational reasons!

In [12]:
# Install transformers package
!pip install transformers

# BERT Tiny model. Citing:
# @misc{bhargava2021generalization,
#      title={Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics}, 
#      author={Prajjwal Bhargava and Aleksandr Drozd and Anna Rogers},
#      year={2021},
#      eprint={2110.01518},
#      archivePrefix={arXiv},
#      primaryClass={cs.CL}
# }
# @article{DBLP:journals/corr/abs-1908-08962,
#  author    = {Iulia Turc and
#               Ming{-}Wei Chang and
#               Kenton Lee and
#               Kristina Toutanova},
#  title     = {Well-Read Students Learn Better: The Impact of Student Initialization
#               on Knowledge Distillation},
#  journal   = {CoRR},
#  volume    = {abs/1908.08962},
#  year      = {2019},
#  url       = {http://arxiv.org/abs/1908.08962},
#  eprinttype = {arXiv},
#  eprint    = {1908.08962},
#  timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
#  biburl    = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
#  bibsource = {dblp computer science bibliography, https://dblp.org}
# }

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## [Task 4] Question generation with text passage $P$ and question $Q$

We want to define $f_\theta(P, Q)$. 

Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$ and $Q_i$ and generate $A_i$.

## [Task 5] Question generation with text passage $P$, question $Q$ and dialogue history $H$

We want to define $f_\theta(P, Q, H)$. Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$, $Q_i$, and $H = \{ Q_0, A_0, \dots, Q_{i-1}, A_{i-1} \}$ to generate $A_i$.

## [Task 6] Train and evaluate $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$

Write your own script to train and evaluate your $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$ models.

### Instructions

* Perform multiple train/evaluation seed runs: [42, 2022, 1337].$^1$
* Evaluate your models with the following metrics: SQUAD F1-score.$^2$
* Fine-tune each transformer-based models for **3 epochs**.
* Report evaluation SQUAD F1-score computed on the validation and test sets.

$^1$ Remember what we said about code reproducibility in Tutorial 2!

$^2$ You can use ```allennlp``` python package for a quick implementation of SQUAD F1-score: ```from allennlp_models.rc.tools import squad```. 

## [Task 7] Error Analysis

Perform a simple and short error analysis as follows:
* Group dialogues by ```source``` and report the worst 5 model errors for each source (w.r.t. SQUAD F1-score).
* Inspect observed results and try to provide some comments (e.g., do the models make errors when faced with a particular question type?)$^1$

$^1$ Check the [paper](https://arxiv.org/pdf/1808.07042.pdf) for some valuable information about question/answer types (e.g., Table 6, Table 8) 

# Assignment Evaluation

The following assignment points will be awarded for each task as follows:

* Task 1, Pre-processing $\rightarrow$ 0.5 points.
* Task 2, Dataset Splitting $\rightarrow$ 0.5 points.
* Task 3 and 4, Models Definition $\rightarrow$ 1.0 points.
* Task 5 and 6, Models Training and Evaluation $\rightarrow$ 2.0 points.
* Task 7, Analysis $\rightarrow$ 1.0 points.
* Report $\rightarrow$ 1.0 points.

**Total** = 6 points <br>

We may award an additional 0.5 points for outstanding submissions. 
 
**Speed Bonus** = 0.5 extra points <br>

# Report

We apply the rules described in Assignment 1 regarding the report.
* Write a clear and concise report following the given overleaf template (**max 2 pages**).
* Report validation and test results in a table.$^1$
* **Avoid reporting** code snippets or copy-paste terminal outputs $\rightarrow$ **Provide a clean schema** of what you want to show

# Comments and Organization

Remember to properly comment your code (it is not necessary to comment each single line) and don't forget to describe your work!

Structure your code for readability and maintenance. If you work with Colab, use sections. 

This allows you to build clean and modular code, as well as easy to read and to debug (notebooks can be quite tricky time to time).

# FAQ (READ THIS!)

---

**Question**: Does Task 3 also include data tokenization and conversion step?

**Answer:** Yes! These steps are usually straightforward since ```transformers``` also offers a specific tokenizer for each model.

**Example**: 

```
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_text = tokenizer(text)
%% Alternatively
inputs = tokenizer.tokenize(text, add_special_tokens=True, max_length=min(max_length, 512))
input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask']
```

**Suggestion**: Hugginface's documentation is full of tutorials and user-friendly APIs.

---
---

**Question**: I'm hitting **out of memory error** when training my models, do you have any suggestions?

**Answer**: Here are some common workarounds:

1. Try decreasing the mini-batch size
2. Try applying a different padding strategy (if you are applying padding): e.g. use quantiles instead of maximum sequence length

---
---

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# The End!

Questions?