# Assignment 2

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Keywords**: Transformers, Question Answering, CoQA

## Deadlines

* **December 11**, 2022: deadline for having assignments graded by January 11, 2023
* **January 11**, 2023: deadline for half-point speed bonus per assignment
* **After January 11**, 2023: assignments are still accepted, but there will be no speed bonus

## Overview

### Problem

Question Answering (QA) on [CoQA](https://stanfordnlp.github.io/coqa/) dataset: a conversational QA dataset.

### Task

Given a question $Q$, a text passage $P$, the task is to generate the answer $A$.<br>
$\rightarrow A$ can be: (i) a free-form text or (ii) unanswerable;

**Note**: an question $Q$ can refer to previous dialogue turns. <br>
$\rightarrow$ dialogue history $H$ may be a valuable input to provide the correct answer $A$.

### Models

We are going to experiment with transformer-based models to define the following models:

1.  $A = f_\theta(Q, P)$

2. $A = f_\theta(Q, P, H)$

where $f_\theta$ is the transformer-based model we have to define with $\theta$ parameters.

## The CoQA dataset

<center>
    <img src="https://drive.google.com/uc?export=view&id=16vrgyfoV42Z2AQX0QY7LHTfrgektEKKh" width="750"/>
</center>

For detailed information about the dataset, feel free to check the original [paper](https://arxiv.org/pdf/1808.07042.pdf).



## Rationales

Each QA pair is paired with a rationale $R$: it is a text span extracted from the given text passage $P$. <br>
$\rightarrow$ $R$ is not a requested output, but it can be used as an additional information at training time!

## Dataset Statistics

* **127k** QA pairs.
* **8k** conversations.
* **7** diverse domains: Children's Stories, Literature, Mid/High School Exams, News, Wikipedia, Reddit, Science.
* Average conversation length: **15 turns** (i.e., QA pairs).
* Almost **half** of CoQA questions refer back to **conversational history**.
* Only **train** and **validation** sets are available.

## Dataset snippet

The dataset is stored in JSON format. Each dialogue is represented as follows:

```
{
    "source": "mctest",
    "id": "3dr23u6we5exclen4th8uq9rb42tel",
    "filename": "mc160.test.41",
    "story": "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. 
    Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. [...]" % <-- $P$
    "questions": [
        {
            "input_text": "What color was Cotton?",   % <-- $Q_1$
            "turn_id": 1
        },
        {
            "input_text": "Where did she live?",
            "turn_id": 2
        },
        [...]
    ],
    "answers": [
        {
            "span_start": 59,   % <-- $R_1$ start index
            "spand_end": 93,    % <-- $R_1$ end index
            "span_text": "a little white kitten named Cotton",   % <-- $R_1$
            "input_text" "white",   % <-- $A_1$      
            "turn_id": 1
        },
        [...]
    ]
}
```

### Simplifications

Each dialogue also contains an additional field ```additional_answers```. For simplicity, we **ignore** this field and only consider one groundtruth answer $A$ and text rationale $R$.

CoQA only contains 1.3% of unanswerable questions. For simplicity, we **ignore** those QA pairs.

## [Task 1] Remove unaswerable QA pairs

Write your own script to remove unaswerable QA pairs from both train and validation sets.

## Dataset Download


In [1]:
import os
import urllib.request
from tqdm import tqdm

class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [2]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')  # <-- Why test? See next slides for an answer!

In [3]:
# Install transformers package
!pip install transformers

# Import libraries
import json
import pandas as pd
import random
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
# define device as GPU if available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# initialize number of epochs
epochs = 1

In [5]:
# Store unknown IDs in a list to exclude them during the creation of the training, validation and test datasets
# Training
train_data_json = open('./coqa/train.json')
train_data = json.load(train_data_json)
train_val_tot = len(train_data['data'])
train_val_tot_q = 0
train_val_cnt = 0      
train_val_cnt_q = 0
idx_un_train = []
idx_un_test = []

for idx in range(train_val_tot):
    all_unasnwered = True
    for x in range(0,len(train_data['data'][idx]['questions'])): # If at least one answer is answerable store it in the training lists
      train_val_tot_q += 1
      if train_data['data'][idx]['answers'][x]['span_text'] != 'unknown':
        all_unasnwered = False
        train_val_cnt_q += 1      
    if all_unasnwered:
      idx_un_train.append(idx) # store the story idx so to exclude it from further processing and from the 80-20 split computation
    else: # if the story at least one answerable question store the idx in the training story list     
      train_val_cnt += 1
      
cnt_un_train_val = train_val_tot - train_val_cnt

percentage = round((100 * (cnt_un_train_val / train_val_tot)),2)
percentage_q = round((100 * ((train_val_tot_q - train_val_cnt_q) / train_val_tot_q)),2)

print("Training and validation unaswerable stats:")
print("Training and validation stories with all unaswerable questions are:",cnt_un_train_val,"vs. a total number of stories of",train_val_tot,":",percentage," %")
print("Test q/a pairs with unaswerable questions are:",(train_val_tot_q - train_val_cnt_q),"vs. a total number of q/a pairs of",train_val_tot_q,":",percentage_q," %")

# Test
test_data_json = open('./coqa/test.json')
test_data = json.load(test_data_json)
test_tot = len(test_data['data'])
test_cnt = 0      
test_cnt_q = 0
test_tot_q = 0      

for idx in range(0,len(test_data['data'])):
    all_unasnwered = True
    for x in range(0,len(train_data['data'][idx]['questions'])): # If at least one answer is answerable store it in the training lists
      test_tot_q += 1
      if train_data['data'][idx]['answers'][x]['span_text'] != 'unknown':
        all_unasnwered = False
        test_cnt_q += 1      
    if all_unasnwered:
      idx_un_test.append(idx) # store the story idx so to exclude it from further processing and from the 80-20 split computation
    else: # if the story at least one answerable question store the idx in the training story list     
      test_cnt += 1
      
cnt_un_test = test_tot - test_cnt

percentage = round((100 * (cnt_un_test / test_tot)),2)
percentage_q = round((100 * ((test_tot_q - test_cnt_q) / test_tot_q)),2)

print("\nTest unaswerable stats:")
print("Test stories with all unaswerable questions are:",cnt_un_test,"vs. a total number of stories of",test_tot,":",percentage," %")
print("Test q/a pairs with unaswerable questions are:",(test_tot_q - test_cnt_q),"vs. a total number of q/a pairs of",test_tot_q,":",percentage_q," %")

print("\nPlease note: the requirement is to split trainig and validation at story level (not q/a)")
print("therefore counts at story level will be used to compute properly the 80-20 split")
print("While the checks required to exclude unaswerable q/a pairs will be done at each q/a pairs level")

Training and validation unaswerable stats:
Training and validation stories with all unaswerable questions are: 6 vs. a total number of stories of 7199 : 0.08  %
Test q/a pairs with unaswerable questions are: 1362 vs. a total number of q/a pairs of 108647 : 1.25  %

Test unaswerable stats:
Test stories with all unaswerable questions are: 2 vs. a total number of stories of 500 : 0.4  %
Test q/a pairs with unaswerable questions are: 92 vs. a total number of q/a pairs of 7421 : 1.24  %

Please note: the requirement is to split trainig and validation at story level (not q/a)
therefore counts at story level will be used to compute properly the 80-20 split
While the checks required to exclude unaswerable q/a pairs will be done at each q/a pairs level


#### Data Inspection

Spend some time in checking accurately the dataset format and how to retrieve the tasks' inputs and outputs!

In [6]:
df_train_mockup = pd.read_json('./coqa/train.json')
print("Training Dataframe shape\n",df_train_mockup.shape)
print("\nExample of one dataframe row:\n",df_train_mockup.iloc[2])
# Focusing on 'data' column
print("\nData dictionary keys\n",df_train_mockup['data'].iloc[2].keys(),'\n')
print("\nStory\n",df_train_mockup['data'].iloc[2]['story'],'\n')
print("\nQuestions\n",df_train_mockup['data'].iloc[2]['questions'],'\n')
print("\nAnswers\n",df_train_mockup['data'].iloc[2]['answers'],'\n')
df_train_mockup.head(10)

Training Dataframe shape
 (7199, 2)

Example of one dataframe row:
 version                                                    1
data       {'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn...
Name: 2, dtype: object

Data dictionary keys
 dict_keys(['source', 'id', 'filename', 'story', 'questions', 'answers', 'name']) 


Story
 CHAPTER VII. THE DAUGHTER OF WITHERSTEEN 

"Lassiter, will you be my rider?" Jane had asked him. 

"I reckon so," he had replied. 

Few as the words were, Jane knew how infinitely much they implied. She wanted him to take charge of her cattle and horse and ranges, and save them if that were possible. Yet, though she could not have spoken aloud all she meant, she was perfectly honest with herself. Whatever the price to be paid, she must keep Lassiter close to her; she must shield from him the man who had led Milly Erne to Cottonwoods. In her fear she so controlled her mind that she did not whisper this Mormon's name to her own soul, she did not even think it. Beside

Unnamed: 0,version,data
0,1,"{'source': 'wikipedia', 'id': '3zotghdk5ibi9ce..."
1,1,"{'source': 'cnn', 'id': '3wj1oxy92agboo5nlq4r7..."
2,1,"{'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn..."
3,1,"{'source': 'cnn', 'id': '3ewijtffvo7wwchw6rtya..."
4,1,"{'source': 'gutenberg', 'id': '3urfvvm165iantk..."
5,1,"{'source': 'race', 'id': '3ftf2t8wlri896r0rn6x..."
6,1,"{'source': 'cnn', 'id': '3qemnnsb2xz5mh3gvv3nj..."
7,1,"{'source': 'race', 'id': '369j354ofdapu1z2ebz3..."
8,1,"{'source': 'race', 'id': '3v0z7ywsiy0kux6wg4mm..."
9,1,"{'source': 'wikipedia', 'id': '3v5q80fxixr0io4..."


## [Task 2] Train, Validation and Test splits

CoQA only provides a train and validation set since the test set is hidden for evaluation purposes.

We'll consider the provided validation set as a test set. <br>
$\rightarrow$ Write your own script to:
* Split the train data in train and validation splits (80% train and 20% val)
* Perform splits such that a dialogue appears in one split only! (i.e., split at dialogue level)
* Perform splitting using the following seed for reproducibility: 42

#### Reproducibility Memo

Check back tutorial 2 on how to fix a specific random seed for reproducibility!

In [7]:
# Generate training and validation dialogues and corresponding Q/A dataset splits accordingly to 80/20 ratio
# excluding dialogues with unaswered questions
# To avoid introducing bias across epochs, select randomly the dialogues
# Selection starts from stories, and then corresponding Q/A are selected to ensure the split between training and validation is done at dialogue level

# Initialization
random.seed(42)
train_split = 0.8
val_split = 1-train_split
train_idx = []
train_ukn = []
train_id = []
train_story = []
train_question = []
train_answer_start = []
train_answer_end = []
val_id = []
val_story = []
val_question = []
val_answer_start = []
val_answer_end = []
test_id = []
test_story = []
test_question = []
test_answer_start = []
test_answer_end = []
train_qa = {}
train_val_tot = len(train_data['data'])
train_val_net_tot = train_val_tot - cnt_un_train_val
train_rec_tot = int(train_split*train_val_net_tot)
train_cnt = 0
train_cnt_q = 0
val_cnt = 0
val_cnt_q = 0
test_cnt = 0
test_cnt_q = 0

# Iterate over training data selecting 80% of the dialogues randomly and store them in the training story dictiornary provided 
# they do not contain unaswerable questions and that are not already present.
# Once a dialogue is stored in the stories dictionary insert the corresponding questions in the QA training dictionary.
# Q/A are linked to the corresponding dialogues, storing the stories ID.

# Training
end_of_train = False
while not end_of_train: 
  idx = random.randint(0,train_val_tot) # Select a random story from training
  if ((idx not in idx_un_train) and (idx not in train_idx)): # Check the story is not already in the training set and that it does not contain all unaswerable questions.
    all_unasnwered = True
    for x in range(0,len(train_data['data'][idx]['questions'])): # If at least one answer is answerable store it in the training lists
      if train_data['data'][idx]['answers'][x]['span_text'] != 'unknown':
        train_id.append(train_data['data'][idx]['id'])
        train_story.append(train_data['data'][idx]['story'])
        train_question.append(train_data['data'][idx]['questions'][x]['input_text'])
        train_answer_start.append(train_data['data'][idx]['answers'][x]['span_start'])
        train_answer_end.append(train_data['data'][idx]['answers'][x]['span_end'])
        all_unasnwered = False
        train_cnt_q += 1      
    if not all_unasnwered: # if the story at least one answerable question store the idx in the training story list     
      train_cnt += 1
      train_idx.append(idx)
  if train_cnt == train_rec_tot: end_of_train = True # Reached the target stories for training, end loop

# Validation
for idx in range(0,train_val_tot): # iterate over all stories  
  if ((idx not in idx_un_train) and (idx not in train_idx)): # if the story has not been already marked as unanswerable or in training
    all_unasnwered = True
    for x in range(0,len(train_data['data'][idx]['questions'])): # If at least one answer is answerable store it in the training lists
      if train_data['data'][idx]['answers'][x]['span_text'] != 'unknown':
        val_id.append(train_data['data'][idx]['id'])
        val_story.append(train_data['data'][idx]['story'])
        val_question.append(train_data['data'][idx]['questions'][x]['input_text'])
        val_answer_start.append(train_data['data'][idx]['answers'][x]['span_start'])
        val_answer_end.append(train_data['data'][idx]['answers'][x]['span_end'])
        all_unasnwered = False
        val_cnt_q += 1      
    if not all_unasnwered: # if the story at least one answerable question store the idx in the training story list
      val_cnt += 1

# Test
test_tot = len(test_data['data'])

for idx in range(0,test_tot): # iterate over all test stories
  if (idx not in idx_un_test): # if the story has not been already marked as unanswerable
    all_unasnwered = True
    for x in range(0,len(train_data['data'][idx]['questions'])): # If at least one answer is answerable store it in the test lists
      if train_data['data'][idx]['answers'][x]['span_text'] != 'unknown':
        test_id.append(train_data['data'][idx]['id'])
        test_story.append(train_data['data'][idx]['story'])
        test_question.append(train_data['data'][idx]['questions'][x]['input_text'])
        test_answer_start.append(train_data['data'][idx]['answers'][x]['span_start'])
        test_answer_end.append(train_data['data'][idx]['answers'][x]['span_end'])
        test_cnt_q += 1
        all_unasnwered = False
    if not all_unasnwered: # if the story at least one answerable question store the idx in the training story list
      test_cnt += 1

In [8]:
# Analysis of the training data
print("Analysis of the training data")
print("TOT: Total dialogues in the original training repository: ",train_val_tot)
print("UKN: Unaswerable dialogues in the original training repository: ",cnt_un_train_val)
print("TOT - UKN: Dialogues to be split between training and validaton: ",(train_val_tot-cnt_un_train_val))
# Training dataset
print("\nTraining dataset")
print("Expected training dialogues ",round((train_split*100),0),"% of (TOT - UKN):",int(train_split*train_val_net_tot))
print("Actual training dialogues: ",train_cnt)
print("Actual training q/a pairs: ",train_cnt_q)
# Validation dataset
print("\nValidation dataset")
print("Expected validation dialogues: ",(train_val_tot-cnt_un_train_val-train_cnt))
print("Actual validation dialogues",val_cnt)
print("Actual validation q/a pairs",val_cnt_q)

# Analysis of the test data
print("\nTest dataset")
print("TOT: Total dialogues in the original test repository: ",test_tot)
print("UKN: Unaswerable dialogues in the original training repository: ",cnt_un_test)
print("TOT - UKN: Dialogues to be split between training and validaton: ",(test_tot-cnt_un_test))
# Test dataset
print("Expected test dialogues (TOT - UKN):",(test_tot-cnt_un_test))
print("Actual test dialogues: ",test_cnt)
print("Actual test q/a pairs: ",test_cnt_q)

Analysis of the training data
TOT: Total dialogues in the original training repository:  7199
UKN: Unaswerable dialogues in the original training repository:  6
TOT - UKN: Dialogues to be split between training and validaton:  7193

Training dataset
Expected training dialogues  80.0 % of (TOT - UKN): 5754
Actual training dialogues:  5754
Actual training q/a pairs:  86032

Validation dataset
Expected validation dialogues:  1439
Actual validation dialogues 1439
Actual validation q/a pairs 21253

Test dataset
TOT: Total dialogues in the original test repository:  500
UKN: Unaswerable dialogues in the original training repository:  2
TOT - UKN: Dialogues to be split between training and validaton:  498
Expected test dialogues (TOT - UKN): 498
Actual test dialogues:  498
Actual test q/a pairs:  7329


## [Task 3] Model definition

Write your own script to define the following transformer-based models from [huggingface](https://HuggingFace.co/).

* [M1] DistilRoBERTa (distilberta-base)
* [M2] BERTTiny (bert-tiny)

**Note**: Remember to install the ```transformers``` python package!

**Note**: We consider small transformer models for computational reasons!

In [9]:
# Tokenization function
def tokenization(tokenizer,story,question,answer_start,answer_end):
  # Initialization
  answer_start_token = []
  answer_end_token = []
  story_clean_none = []
  question_clean_none = []
  answer_start_token_clean_none = []
  answer_end_token_clean_none = []
  idx_none = []

  # Tokenization of stories and questions 
  encodings = tokenizer(story,question,truncation=True,padding=True)

  # Translate answer span start/end pointers from story text indexes to token embedding indexes 
  # managing cases where answer start/end are pointing to 'blanks'
  for i in range(len(answer_start)):
    answer_start_token.append(encodings.char_to_token(i,answer_start[i]))
    if answer_start_token[-1] is None:
      answer_start_token[-1] = encodings.char_to_token(i,answer_start[i]+1)
    if answer_start_token[-1] is None:
      answer_start_token[-1] = encodings.char_to_token(i,answer_start[i]+2)
  for i in range(len(answer_end)):
    answer_end_token.append(encodings.char_to_token(i,answer_end[i]))
    if answer_end_token[-1] is None:
      answer_end_token[-1] = encodings.char_to_token(i,answer_end[i]+1)
    if answer_end_token[-1] is None:
      answer_end_token[-1] = encodings.char_to_token(i,answer_end[i]+2)
  
  # Create lists skippng question/answer pairs where answer span start or span end to tokens are 'None' 
  cnt = 0
  for i in range(len(answer_start_token)):
    if answer_start_token[i] is None:
      idx_none.append(i)
  cnt = 0
  for i in range(len(answer_end_token)):
    if answer_end_token[i] is None:
      idx_none.append(i)

  for i in range(len(story)):
    if i not in idx_none:
      story_clean_none.append(story[i])
      question_clean_none.append(question[i])
      answer_start_token_clean_none.append(answer_start_token[i])
      answer_end_token_clean_none.append(answer_end_token[i])
  
  # Tokenization of the cleaned up lists and adding token based start and end position
  encodings=tokenizer(story_clean_none,question_clean_none,truncation=True,padding=True)
  encodings.update({'start_positions':answer_start_token_clean_none,'end_positions':answer_end_token_clean_none})
  
  print("Q/A pairs tokenized",len(story_clean_none))
  return(encodings)


In [10]:
# Dataset Objects Class
class QADastset(torch.utils.data.Dataset):
  def __init__(self,encodings):
    self.encodings = encodings
  
  def __getitem__(self,idx):
    return {key: torch.tensor(val[idx]) for key,val in self.encodings.items()}
  
  def __len__(self):
    return len(self.encodings.input_ids)

In [11]:
# Model fine-tuning loop function
def fine_tuning_model(model,dataloader,epochs):
  # move model to device
  model.to(device)

  # define optimizer with weight decay to reduce risks of overfitting
  opt = AdamW(model.parameters(), lr=5e-5)

  for epoch in range(epochs): # per epoch training iteration
    model.train()
    loop = tqdm(dataloader,leave=True)
    for batch in loop: #loop # per batch training iteration
      opt.zero_grad()
      
      # pull tensor batches 
      input_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)
      start_positions = batch['start_positions'].to(device)
      end_positions = batch['end_positions'].to(device)
    
      # train model on batch
      output = model(input_ids,attention_mask=attention_mask,start_positions=start_positions,end_positions=end_positions)
      
      # retrive loss      
      loss = output[0]
      
      # backward pass
      loss.backward()
      
      # optimization step
      opt.step()

      # display results      
      loop.set_description(f'Epoch {epoch}')
      loop.set_postfix(loss=loss.item())
      
    return(model)

In [12]:
# [M1] DistilRoBERTa (distilbert-base). 
# M1 tokenizer and model definition
from transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering
tokenizer_M1 = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')
M1 = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased')

# DistillRoberta. Citing:
#  https://huggingface.co/distilroberta-base
#  @article{Sanh2019DistilBERTAD,
#  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
#  author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
#  journal={ArXiv},
#  year={2019},
#  volume={abs/1910.01108}
# }
# Reference for code: https://towardsdatascience.com/how-to-fine-tune-a-q-a-transformer-86f91ec92997

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on

In [13]:
# [M2] BERTTiny (bert-tiny)
# M2 tokenizer and model definition
from transformers import AutoModel, AutoTokenizer
model_name = "prajjwal1/bert-tiny"
M2 = AutoModel.from_pretrained(model_name)
tokenizer_M2 = AutoTokenizer.from_pretrained(model_name)

# BERT Tiny model. Citing:
# https://huggingface.co/prajjwal1/bert-tiny
# @misc{bhargava2021generalization,
#      title={Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics}, 
#      author={Prajjwal Bhargava and Aleksandr Drozd and Anna Rogers},
#      year={2021},
#      eprint={2110.01518},
#      archivePrefix={arXiv},
#      primaryClass={cs.CL}
# }
# @article{DBLP:journals/corr/abs-1908-08962,
#  author    = {Iulia Turc and
#               Ming{-}Wei Chang and
#               Kenton Lee and
#               Kristina Toutanova},
#  title     = {Well-Read Students Learn Better: The Impact of Student Initialization
#               on Knowledge Distillation},
#  journal   = {CoRR},
#  volume    = {abs/1908.08962},
#  year      = {2019},
#  url       = {http://arxiv.org/abs/1908.08962},
#  eprinttype = {arXiv},
#  eprint    = {1908.08962},
#  timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
#  biburl    = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
#  bibsource = {dblp computer science bibliography, https://dblp.org}
# }

Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## [Task 4] Question generation with text passage $P$ and question $Q$

We want to define $f_\theta(P, Q)$. 

Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$ and $Q_i$ and generate $A_i$.

In [14]:
# M1 Training tokenization
#train_encodings_M1 = tokenization(tokenizer_M1,train_story,train_question,train_answer_start,train_answer_end)

In [15]:
# M1 Validation tokenization
val_encodings_M1 = tokenization(tokenizer_M1,val_story,val_question,val_answer_start,val_answer_end)

Q/A pairs tokenized 20415


In [16]:
# M1 Test tokenization
test_encodings_M1 = tokenization(tokenizer_M1,test_story,test_question,test_answer_start,test_answer_end)

Q/A pairs tokenized 7028


In [17]:
# M1 model Datasets initalization
#train_dataset_M1 = QADastset(train_encodings_M1) 
val_dataset_M1 = QADastset(val_encodings_M1) 
test_dataset_M1 = QADastset(test_encodings_M1)

# DataLoaders initialization
#train_dataloader_M1 = DataLoader(train_dataset_M1,batch_size=10,shuffle=True) 
val_dataloader_M1 = DataLoader(val_dataset_M1,batch_size=10,shuffle=False) 
test_dataloader_M1 = DataLoader(test_dataset_M1,batch_size=10,shuffle=False) 

In [18]:
# M1 Model fine tuning
M1 = fine_tuning_model(M1,train_dataloader_M1,epochs)

NameError: ignored

In [19]:
# M2 model Training tokenization
#train_encodings_M2 = tokenization(tokenizer_M2,train_story,train_question,train_answer_start,train_answer_end)

In [20]:
# M2 Model Validation tokenization
#val_encodings_M2 = tokenization(tokenizer_M2,val_story,val_question,val_answer_start,val_answer_end)

In [21]:
# M2 Model Test tokenization
test_encodings_M2 = tokenization(tokenizer_M2,test_story,test_question,test_answer_start,test_answer_end)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Q/A pairs tokenized 7060


In [22]:
# Datasets initalization
#train_dataset_M2 = QADastset(train_encodings_M2) 
#val_dataset_M2 = QADastset(val_encodings_M2) 
test_dataset_M2 = QADastset(test_encodings_M2)

# DataLoaders initialization
#train_dataloader_M2 = DataLoader(train_dataset_M2,batch_size=10,shuffle=True) 
#val_dataloader_M2 = DataLoader(val_dataset_M2,batch_size=10,shuffle=False) 
test_dataloader_M2 = DataLoader(test_dataset_M2,batch_size=10,shuffle=False) 

In [None]:
# M2 model fine-tuning
M2 = fine_tuning_model(M2,train_dataloader_M2,epochs)

## [Task 5] Question generation with text passage $P$, question $Q$ and dialogue history $H$

We want to define $f_\theta(P, Q, H)$. Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$, $Q_i$, and $H = \{ Q_0, A_0, \dots, Q_{i-1}, A_{i-1} \}$ to generate $A_i$.

## [Task 6] Train and evaluate $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$

Write your own script to train and evaluate your $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$ models.

### Instructions

* Perform multiple train/evaluation seed runs: [42, 2022, 1337].$^1$
* Evaluate your models with the following metrics: SQUAD F1-score.$^2$
* Fine-tune each transformer-based models for **3 epochs**.
* Report evaluation SQUAD F1-score computed on the validation and test sets.

$^1$ Remember what we said about code reproducibility in Tutorial 2!

$^2$ You can use ```allennlp``` python package for a quick implementation of SQUAD F1-score: ```from allennlp_models.rc.tools import squad```. 

## [Task 7] Error Analysis

Perform a simple and short error analysis as follows:
* Group dialogues by ```source``` and report the worst 5 model errors for each source (w.r.t. SQUAD F1-score).
* Inspect observed results and try to provide some comments (e.g., do the models make errors when faced with a particular question type?)$^1$

$^1$ Check the [paper](https://arxiv.org/pdf/1808.07042.pdf) for some valuable information about question/answer types (e.g., Table 6, Table 8) 

# Assignment Evaluation

The following assignment points will be awarded for each task as follows:

* Task 1, Pre-processing $\rightarrow$ 0.5 points.
* Task 2, Dataset Splitting $\rightarrow$ 0.5 points.
* Task 3 and 4, Models Definition $\rightarrow$ 1.0 points.
* Task 5 and 6, Models Training and Evaluation $\rightarrow$ 2.0 points.
* Task 7, Analysis $\rightarrow$ 1.0 points.
* Report $\rightarrow$ 1.0 points.

**Total** = 6 points <br>

We may award an additional 0.5 points for outstanding submissions. 
 
**Speed Bonus** = 0.5 extra points <br>

# Report

We apply the rules described in Assignment 1 regarding the report.
* Write a clear and concise report following the given overleaf template (**max 2 pages**).
* Report validation and test results in a table.$^1$
* **Avoid reporting** code snippets or copy-paste terminal outputs $\rightarrow$ **Provide a clean schema** of what you want to show

# Comments and Organization

Remember to properly comment your code (it is not necessary to comment each single line) and don't forget to describe your work!

Structure your code for readability and maintenance. If you work with Colab, use sections. 

This allows you to build clean and modular code, as well as easy to read and to debug (notebooks can be quite tricky time to time).

# FAQ (READ THIS!)

---

**Question**: Does Task 3 also include data tokenization and conversion step?

**Answer:** Yes! These steps are usually straightforward since ```transformers``` also offers a specific tokenizer for each model.

**Example**: 

```
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_text = tokenizer(text)
%% Alternatively
inputs = tokenizer.tokenize(text, add_special_tokens=True, max_length=min(max_length, 512))
input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask']
```

**Suggestion**: Hugginface's documentation is full of tutorials and user-friendly APIs.

---
---

**Question**: I'm hitting **out of memory error** when training my models, do you have any suggestions?

**Answer**: Here are some common workarounds:

1. Try decreasing the mini-batch size
2. Try applying a different padding strategy (if you are applying padding): e.g. use quantiles instead of maximum sequence length

---
---

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# The End!

Questions?