# Assignment 2

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Keywords**: Transformers, Question Answering, CoQA

## Deadlines

* **December 11**, 2022: deadline for having assignments graded by January 11, 2023
* **January 11**, 2023: deadline for half-point speed bonus per assignment
* **After January 11**, 2023: assignments are still accepted, but there will be no speed bonus

## Overview

### Problem

Question Answering (QA) on [CoQA](https://stanfordnlp.github.io/coqa/) dataset: a conversational QA dataset.

### Task

Given a question $Q$, a text passage $P$, the task is to generate the answer $A$.<br>
$\rightarrow A$ can be: (i) a free-form text or (ii) unanswerable;

**Note**: a question $Q$ can refer to previous dialogue turns. <br>
$\rightarrow$ dialogue history $H$ may be a valuable input to provide the correct answer $A$.

### Models

We are going to experiment with transformer-based models to define the following models:

1.  $A = f_\theta(Q, P)$

2. $A = f_\theta(Q, P, H)$

where $f_\theta$ is the transformer-based model we have to define with $\theta$ parameters.

## The CoQA dataset

<center>
    <img src="https://drive.google.com/uc?export=view&id=16vrgyfoV42Z2AQX0QY7LHTfrgektEKKh" width="750"/>
</center>

For detailed information about the dataset, feel free to check the original [paper](https://arxiv.org/pdf/1808.07042.pdf).



## Rationales

Each QA pair is paired with a rationale $R$: it is a text span extracted from the given text passage $P$. <br>
$\rightarrow$ $R$ is not a requested output, but it can be used as an additional information at training time!

## Dataset Statistics

* **127k** QA pairs.
* **8k** conversations.
* **7** diverse domains: Children's Stories, Literature, Mid/High School Exams, News, Wikipedia, Reddit, Science.
* Average conversation length: **15 turns** (i.e., QA pairs).
* Almost **half** of CoQA questions refer back to **conversational history**.
* Only **train** and **validation** sets are available.

## Dataset snippet

The dataset is stored in JSON format. Each dialogue is represented as follows:

```
{
    "source": "mctest",
    "id": "3dr23u6we5exclen4th8uq9rb42tel",
    "filename": "mc160.test.41",
    "story": "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. 
    Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. [...]" % <-- $P$
    "questions": [
        {
            "input_text": "What color was Cotton?",   % <-- $Q_1$
            "turn_id": 1
        },
        {
            "input_text": "Where did she live?",
            "turn_id": 2
        },
        [...]
    ],
    "answers": [
        {
            "span_start": 59,   % <-- $R_1$ start index
            "spand_end": 93,    % <-- $R_1$ end index
            "span_text": "a little white kitten named Cotton",   % <-- $R_1$
            "input_text" "white",   % <-- $A_1$      
            "turn_id": 1
        },
        [...]
    ]
}
```

### Simplifications

Each dialogue also contains an additional field ```additional_answers```. For simplicity, we **ignore** this field and only consider one groundtruth answer $A$ and text rationale $R$.

CoQA only contains 1.3% of unanswerable questions. For simplicity, we **ignore** those QA pairs.

## [Task 1] Remove unaswerable QA pairs

Write your own script to remove unaswerable QA pairs from both train and validation sets.

In [None]:
!pip install transformers
!pip install tensorflow-addons
!pip install datasets
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m52.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m63.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us

## Dataset Download


In [None]:
import os
import urllib.request
from tqdm import tqdm

class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [None]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')  # <-- Why test? See next slides for an answer!

Downloading CoQA train data split... (it may take a while)


coqa-train-v1.0.json: 49.0MB [00:07, 6.21MB/s]                            


Download completed!
Downloading CoQA test data split... (it may take a while)


coqa-dev-v1.0.json: 9.09MB [00:02, 3.37MB/s]                            

Download completed!





#### Data Inspection

Spend some time in checking accurately the dataset format and how to retrieve the tasks' inputs and outputs!

In [None]:
import json
import random
import numpy as np
from sklearn.model_selection import train_test_split
import torch
import tensorflow as tf
from datasets import Dataset, load_from_disk
import pickle
import re
from evaluate import load


def set_seed(SEED):
  random.seed(SEED) # if you're using random
  np.random.seed(SEED) # if you're using numpy
  torch.manual_seed(SEED) # torch.cuda.manual_seed_all(SEED) is not required
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False
  tf.random.set_seed(SEED) # setting the seed for tensorflow too
  os.environ['TF_DETERMINISTIC_OPS'] = '1'

def extract_data(split_dataset,add_history=False,sep_char="[SEP]"):
  """
  function extracting data from the list of dictionaries in the CoQA dataset
  :params:
    split_dataset: list of dictionaries from where to extract the pairs of question and passage and corresponding the answer
  """  
  XQA = [] # list that will contain pairs (P,Q)
  YQA = [] # list that will contain the Answers
  for d in split_dataset: # scan each document
    for i in range(len(d["questions"])): # scan each question
      if d["answers"][i]["span_end"]!=-1: # discard unanswerable questions
        single_example = [] # prepare the single example...
        single_example.append(d["questions"][i]["input_text"]) #... with the question ...
        single_example.append(d["story"]) # ...and the passage
        if add_history:
          for j in range(i-1,-1,-1):
            if d["answers"][j]["span_end"]!=-1:
              single_example[1] = single_example[1] + sep_char + d["questions"][j]["input_text"]+ sep_char + d["answers"][j]["input_text"]
              
        XQA.append(single_example) # and append it
        YQA.append(d["answers"][i]["input_text"]) # add the answer
  return XQA, YQA

## [Task 2] Train, Validation and Test splits

CoQA only provides a train and validation set since the test set is hidden for evaluation purposes.

We'll consider the provided validation set as a test set. <br>
$\rightarrow$ Write your own script to:
* Split the train data in train and validation splits (80% train and 20% val)
* Perform splits such that a dialogue appears in one split only! (i.e., split at dialogue level)
* Perform splitting using the following seed for reproducibility: 42

#### Reproducibility Memo

Check back tutorial 2 on how to fix a specific random seed for reproducibility!

In [None]:
seed = 42 
set_seed(seed)

In [None]:
## MODEL NAME
model_name = 'distilroberta-base'
#model_name = 'prajjwal1/bert-tiny'
add_history=True

with open('coqa/train.json') as f:
  # loading the training json
  train_json = json.load(f)

with open('coqa/test.json') as f:
  # loading the test json
  test_json = json.load(f)

# splitting training data
train_data, val_data = train_test_split(train_json["data"],
                                        train_size=0.8,
                                        shuffle=True,
                                        random_state=seed)
# extracting X as list of pairs [Question, Passage] and Y as a list of strings (Answers) 
XQA_train, YQA_train = extract_data(train_data, add_history)
XQA_val, YQA_val = extract_data(val_data, add_history)
XQA_test, YQA_test = extract_data(test_json["data"], add_history)
del(train_json)
del(test_json)

print("First training example:")
print(XQA_train[3])
print(YQA_train[3])
print("First validation example:")
print(XQA_val[3])
print(YQA_val[3])
print("First test example:")
print(XQA_test[3])
print(YQA_test[3])

First training example:
['When was the last one held?', 'TUNIS, Tunisia (CNN) -- Polls closed late Sunday in Tunisia, the torchbearer of the so-called Arab Spring, but voters will not see results of national elections until Tuesday, officials said. \n\nOn Sunday, long lines of voters snaked around schools-turned-polling-stations in Tunis\'s upscale Menzah neighborhood, some waiting for hours to cast a vote in the nation\'s first national elections since the country\'s independence in 1956. \n\n"It\'s a wonderful day. It\'s the first time we can choose our own representatives," said Walid Marrakchi, a civil engineer who waited more than two hours, and who brought along his 3-year-old son Ahmed so he could "get used to freedom and democracy." \n\nTunisia\'s election is the first since a popular uprising in January overthrew long-time dictator Zine El Abidine Ben Ali and triggered a wave of revolutions -- referred to as the Arab Spring -- across the region. \n\nMore than 60 political part

In [None]:
## broken example fix:
print(XQA_train[61])
print(YQA_train[61])
YQA_train[61] = 'October'
print(YQA_train[61])

In [None]:
from transformers import AutoTokenizer

def filter_string(x):
  return re.sub('[!"#$%&()*+,./:;=?@[\\]^_`{|}~\t\n]',"",x)
  
# this tokenizer doen't filter anything, so a word and the concatenation of the
# same word with a punctuation will have different embeddings
output_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+,./:;=?@[\\]^_`{|}~\t\n', oov_token='<UNK>')#, analyzer = custom_analyzer) # here we use a custom analyzer
output_tokenizer.fit_on_texts(["<start> " + i + " <end>" for i in YQA_train])

input_tokenizer = AutoTokenizer.from_pretrained(model_name)

max_output_length = max([len(i) for i in YQA_train])
print("Max input output found: " + str(max([len(i) for i in output_tokenizer.texts_to_sequences(YQA_train)])))
#max_sequence_length = max(512, max_output_length)
print("99° percentile of training set answer length:" + str(np.quantile([len(i) for i in output_tokenizer.texts_to_sequences(YQA_train)], 0.99)))
# actual percentile is 17, given that each string has the beginnning and ending token
max_sequence_length = 20


print(np.argmax([len(i) for i in YQA_train]))
print(XQA_train[7529])
print(YQA_train[7529])

dataset_suffix = "_hist" if add_history else ""

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Max input output found: 106
99° percentile of training set answer length:13.0
61
["What symptoms of addiction does Orzack's center list?", 'Caught in the Web A few months ago, it wasn\'t unusual for 47-year-old Carla Toebe to spend 15 hours per day online. She\'d wake up early, turn on her laptop and chat on Internet dating sites and instant-messaging programs - leaving her bed for only brief intervals. Her household bills piled up, along with the dishes and dirty laundry, but it took near-constant complaints from her four daughters before she realized she had a problem. "I was starting to feel like my whole world was falling apart - kind of slipping into a depression," said Carla. "I knew that if I didn\'t get off the dating sites, I\'d just keep going," detaching herself further from the outside world. Toebe\'s conclusion: She felt like she was "addicted" to the Internet. She\'s not alone. Concern about excessive Internet use isn\'t new. As far back as 1995, articles in medical journ

In [None]:
# generate dataset
train_ds = Dataset.from_dict({"xqa": XQA_train, "yqa": ["<start> " + i + " <end>" for i in YQA_train]})
train_ds = train_ds.map(lambda x: input_tokenizer(x["xqa"], return_tensors="tf", padding="max_length", truncation="longest_first", max_length=512), batched=True)
train_ds = train_ds.map(lambda x: {"y_token": output_tokenizer.texts_to_sequences(x["yqa"])}, batched=True)
train_ds = train_ds.map(lambda x: {"y_padded": tf.keras.preprocessing.sequence.pad_sequences(x["y_token"],
                                                                     padding='post',
                                                                     maxlen=max_sequence_length)}, batched=True
)
train_ds = train_ds.remove_columns(["xqa", "yqa", "y_token"])
train_ds = train_ds.with_format(type="tensorflow")
if model_name == 'prajjwal1/bert-tiny':
  train_ds.save_to_disk("gdrive/MyDrive/ckpt/train_ds" + dataset_suffix)
else:
  train_ds.save_to_disk("gdrive/MyDrive/ckpt/train_ds_rob" + dataset_suffix)

In [None]:
# predictions are in lower case, so we consider the labels in lower case
val_ds = Dataset.from_dict({"xqa": XQA_val, "yqa": [filter_string(i.lower()) for i in YQA_val], "id_placeholder": list(range(len(YQA_val)))})
val_ds = val_ds.map(lambda x: input_tokenizer(x["xqa"], return_tensors="tf", padding="max_length", truncation="longest_first", max_length=512), batched=True)
val_ds = val_ds.map(lambda x:{"references": {'answers':{'text':[x["yqa"]], 'answer_start': [42]},
    'id': str(x["id_placeholder"]) } })

val_ds = val_ds.remove_columns(["xqa","yqa", "id_placeholder"])

if model_name == 'prajjwal1/bert-tiny':  
  val_ds = val_ds.with_format(type="tensorflow", columns=["input_ids", "attention_mask","token_type_ids"], output_all_columns=True)
  val_ds.save_to_disk("gdrive/MyDrive/ckpt/val_ds" + dataset_suffix)
else:
  val_ds = val_ds.with_format(type="tensorflow", columns=["input_ids", "attention_mask"], output_all_columns=True)
  val_ds.save_to_disk("gdrive/MyDrive/ckpt/val_ds_rob" + dataset_suffix)

In [None]:
# predictions are in lower case, so we consider the labels in lower case
test_ds = Dataset.from_dict({"xqa": XQA_test, "yqa": [filter_string(i.lower()) for i in YQA_test], "id_placeholder": list(range(len(YQA_test)))})
test_ds = test_ds.map(lambda x: input_tokenizer(x["xqa"], return_tensors="tf", padding="max_length", truncation="longest_first", max_length=512), batched=True)
test_ds = test_ds.map(lambda x:{"references": {'answers':{'text':[x["yqa"]], 'answer_start': [42]},
    'id': str(x["id_placeholder"]) } })

test_ds = test_ds.remove_columns(["xqa","yqa", "id_placeholder"])

if model_name == 'prajjwal1/bert-tiny':  
  test_ds = test_ds.with_format(type="tensorflow", columns=["input_ids", "attention_mask","token_type_ids"], output_all_columns=True)
  test_ds.save_to_disk("gdrive/MyDrive/ckpt/test_ds" + dataset_suffix)
else:
  test_ds = test_ds.with_format(type="tensorflow", columns=["input_ids", "attention_mask"], output_all_columns=True)
  test_ds.save_to_disk("gdrive/MyDrive/ckpt/test_ds_rob" + dataset_suffix)

Agumentation of the dataset by use of the best POS tagger obtained from Assignment 1 

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

to_pos_tag_XQ = [XQA_train, XQA_val, XQA_test]
to_pos_tag_YQ = [YQA_train, YQA_val, YQA_test]

def from_sentence_to_token_list(sentence):
  tokens_sentence = [word_tokenize(t) for t in sent_tokenize(sentence)]
  tags_sentence = [nltk.pos_tag(token_list) for token_list in tokens_sentence]
  to_return = []
  for tagged_sentence in tags_sentence:
    to_return.append([tagged_sentence[i][1] for i in range(0, len(tagged_sentence))])
  return to_return

def pos_tag_datasets(XQ, YQ):
  pos_XQ = []
  pos_YQ = []
  i = 0
  for elem in tqdm(XQ): 
    question = elem[0]
    passage = elem[1]
    answer = YQ[0]
    question_tags = from_sentence_to_token_list(question)
    passage_tags = from_sentence_to_token_list(passage)
    answer_tags = from_sentence_to_token_list(answer)
    i = i + 1
    pos_XQ.append([question_tags, passage_tags])
    pos_YQ.append(answer_tags)
  return pos_XQ, pos_YQ

XQA_train_pos, YQA_train_pos = pos_tag_datasets(XQA_train, YQA_train) 
XQA_val_pos, YQA_val_pos = pos_tag_datasets(XQA_val, YQA_val) 
XQA_test_pos, YQA_test_pos = pos_tag_datasets(XQA_val, YQA_val) 



In [None]:
def save_file(list_to_save, filename):
  with open(filename, "wb") as f:
    pickle.dump(list_to_save, f)

save_list(XQA_train_pos, "drive/MyDrive/ckpt/pos/XQA_train_pos.pos")
save_list(YQA_train_pos, "drive/MyDrive/ckpt/pos/YQA_train_pos.pos")

save_list(XQA_val_pos, "drive/MyDrive/ckpt/pos/XQA_val_pos.pos")
save_list(YQA_val_pos, "drive/MyDrive/ckpt/pos/YQA_val_pos.pos")

save_list(XQA_test_pos, "drive/MyDrive/ckpt/pos/XQA_test_pos.pos")
save_list(YQA_test_pos, "drive/MyDrive/ckpt/pos/YQA_test_pos.pos")

NameError: ignored

In [None]:
def load_list(filename):
  to_return = []
  with open(filename, "rb") as f:
    to_return = pickle.load(f)
  return to_return
  
XQA_train_pos = load_list(r"drive/MyDrive/ckpt/pos/XQA_train_pos.pos")
YQA_train_pos = load_list(r"drive/MyDrive/ckpt/pos/YQA_train_pos.pos")

XQA_val_pos = load_list(r"drive/MyDrive/ckpt/pos/XQA_val_pos.pos")
YQA_val_pos = load_list(r"drive/MyDrive/ckpt/pos/YQA_val_pos.pos")

XQA_test_pos = load_list(r"drive/MyDrive/ckpt/pos/XQA_test_pos.pos")
YQA_test_pos = load_list(r"drive/MyDrive/ckpt/pos/YQA_test_pos.pos")

XQA_train_pos = np.asarray(XQA_train_pos)
XQA_val_pos = np.asarray(XQA_val_pos)
XQA_val_pos = np.asarray(XQA_val_pos)

  XQA_train_pos = np.asarray(XQA_train_pos)
  XQA_val_pos = np.asarray(XQA_val_pos)


In [None]:
print(type(XQA_train_pos[0][0]))

<class 'list'>


In [None]:
# generate dataset
#train_ds = Dataset.from_dict({"xqa": XQA_train, "yqa": ["<start> " + i + " <end>" for i in YQA_train]})
#train_ds = train_ds.map(lambda x: input_tokenizer(x["xqa"], return_tensors="tf", padding="max_length", truncation="longest_first", max_length=512), batched=True)
#train_ds = train_ds.map(lambda x: {"y_token": output_tokenizer.texts_to_sequences(x["yqa"])}, batched=True)
#train_ds = train_ds.map(lambda x: {"y_padded": tf.keras.preprocessing.sequence.pad_sequences(x["y_token"],
#                                                                     padding='post',
#       
#                                                              maxlen=max_sequence_length)}, batched=True)
#print(XQA_train_pos[0])
#print(XQA_train_pos[1])
#print(train_ds[0])
quest_to_test = XQA_train_pos[0][0]
pass_to_test = XQA_train_pos[0][1]
#srt_to_test = input_tokenizer(str_to_test, return_tensors="tf", padding="max_length", truncation="longest_first", max_length=512)
#print(srt_to_test)

import tensorflow as tf
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from datasets import concatenate_datasets
from nltk.data import load
import itertools
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download("tagsets")

def get_quantile_length_sentence_pos(XQA_train_pos, XQA_val_pos, XQA_test_pos):
  train_sentences = [[len(XQA_train_pos[i][1][j]) for j in range(0, len(XQA_train_pos[i][1]))] for i in range(0, len(XQA_train_pos))]
  train_sentence_length = list(itertools.chain.from_iterable(train_sentences))
  val_sentences = [[len(XQA_val_pos[i][1][j]) for j in range(0, len(XQA_val_pos[i][1]))] for i in range(0, len(XQA_val_pos))]
  val_sentence_length = list(itertools.chain.from_iterable(val_sentences))
  test_sentences = [[len(XQA_test_pos[i][1][j]) for j in range(0, len(XQA_test_pos[i][1]))] for i in range(0, len(XQA_test_pos))]
  test_sentence_length = list(itertools.chain.from_iterable(test_sentences))

  quantile_sentences_length = np.quantile((train_sentence_length + val_sentence_length + test_sentence_length), 0.75)
  return int(quantile_sentences_length)

def get_quantile_numsentences_pos(XQA_train_pos, XQA_val_pos, XQA_test_pos):
  train_num_sentences = [len(XQA_train_pos[i][0])+len(XQA_train_pos[i][1]) for i in range(0, len(XQA_train_pos))]
  train_nums = train_num_sentences
  val_num_sentences = [len(XQA_val_pos[i][0])+len(XQA_val_pos[i][1]) for i in range(0, len(XQA_val_pos))]
  val_nums = val_num_sentences
  test_num_sentences = [len(XQA_test_pos[i][0])+len(XQA_test_pos[i][1]) for i in range(0, len(XQA_test_pos))]
  test_nums = test_num_sentences
  quantile_numsentences_length = np.quantile((train_nums + val_nums + test_nums), 0.75)
  return int(quantile_numsentences_length)


def pos_tags_to_tensor(pos_tags_list, max_length_sentence_pos, max_numsentences_pos):
  tag_to_idx = {}
  tagdict = load('help/tagsets/upenn_tagset.pickle')
  for i, elem in enumerate(tagdict.keys()):
    tag_to_idx.update({elem : float(i + 1)}) # no tag is represented as 0
  tag_to_idx.update({"#": float(len(tag_to_idx.keys()) + 2)})

  to_return = []
  for i, pos_tags in enumerate(pos_tags_list):
    pos_tags = [tag_to_idx[tag] for tag in pos_tags]
    if i < max_numsentences_pos:
      to_return.append(pos_tags)
  to_return = tf.keras.preprocessing.sequence.pad_sequences(to_return, padding='post', maxlen=max_length_sentence_pos, dtype="float64")
  return to_return

def append_zero_vecs(where_append, max_length_sentence_pos, max_numsentences_pos):
  while len(where_append) < max_numsentences_pos:
    where_append = np.concatenate((where_append, [np.zeros(max_length_sentence_pos)]), axis = 0)
  return where_append

def create_pos_tensor_from_pos_dataset_entry(pos_entry, max_length_sentence_pos, max_numsentences_pos):
  #print(pos_tags_to_tensor(pos_entry[0], max_length_sentence_pos))
  pos_entry_sentences = []
  for sentence in pos_entry[0]:
    pos_entry_sentences.append(sentence)
  for sentence in pos_entry[1]:
    pos_entry_sentences.append(sentence)
  list_of_pos_vecs = pos_tags_to_tensor(pos_entry_sentences, max_length_sentence_pos, max_numsentences_pos)
  pad_sentence_list = append_zero_vecs(list_of_pos_vecs, max_length_sentence_pos, max_numsentences_pos)
  tensor_to_return = tf.convert_to_tensor(pad_sentence_list) 
  return tensor_to_return

def convert_pos_df_to_tensors(XQA, max_length_sentence_pos, max_numsentences_pos):
  first = create_pos_tensor_from_pos_dataset_entry(XQA[0], max_length_sentence_pos, max_numsentences_pos)
  ds_to_return = np.array([first])
  ds_to_return_2 = []
  ds_to_return_3 = []
  for elem in tqdm(range(1, len(XQA))):
    to_append = create_pos_tensor_from_pos_dataset_entry(XQA[elem], max_length_sentence_pos, max_numsentences_pos)
    to_append = np.array([to_append])
    if (elem / len(XQA)) < 0.33:
      ds_to_return = np.concatenate((ds_to_return, to_append))
    if (elem / len(XQA)) > 0.33 and (elem / len(XQA)) < 0.66: 
      if ds_to_return_2 == []:
        ds_to_return_2 = to_append
      else:
        ds_to_return_2 = np.concatenate((ds_to_return_2, to_append))
    if (elem / len(XQA)) > 0.66:
      if ds_to_return_3 == []:
        ds_to_return_3 = to_append
      else:
        ds_to_return_3 = np.concatenate((ds_to_return_3, to_append))

  ds_to_return = np.concatenate((ds_to_return, ds_to_return_2))
  ds_to_return = np.concatenate((ds_to_return, ds_to_return_3))
  return ds_to_return

def serialize_obj(obj_to_save, filename):
  with open(filename, "wb") as f:
    pickle.dump(obj_to_save, f)

max_length_sentence_pos = get_quantile_length_sentence_pos(XQA_train_pos, XQA_val_pos, XQA_test_pos) # 303 is the maximum, it exceeded the ram so
                                                                                                      # we consider the quantile
print(max_length_sentence_pos)
max_numsentences_pos = get_quantile_numsentences_pos(XQA_train_pos, XQA_val_pos, XQA_test_pos)        # 103 is the maximum, again we use the quantile
print(max_numsentences_pos)


XQA_train_pos_tensors = convert_pos_df_to_tensors(XQA_train_pos, max_length_sentence_pos, max_numsentences_pos)
serialize_obj(XQA_train_pos_tensors, "drive/MyDrive/ckpt/pos/XQA_train_pos_tensors.tens")

XQA_val_pos_tensors = convert_pos_df_to_tensors(XQA_val_pos, max_length_sentence_pos, max_numsentences_pos)
serialize_obj(XQA_val_pos_tensors, "drive/MyDrive/ckpt/pos/XQA_val_pos_tensors.tens")

XQA_test_pos_tensors = convert_pos_df_to_tensors(XQA_test_pos, max_length_sentence_pos, max_numsentences_pos)
serialize_obj(XQA_test_pos_tensors, "drive/MyDrive/ckpt/pos/XQA_test_pos_tensors.tens")

NameError: ignored

In [None]:
def load_list(filename):
  to_return = []
  with open(filename, "rb") as f:
    to_return = pickle.load(f)
  return to_return

XQA_train_pos_tensors = load_list("drive/MyDrive/ckpt/pos/XQA_train_pos_tensors.tens")
XQA_val_pos_tensors = load_list("drive/MyDrive/ckpt/pos/XQA_val_pos_tensors.tens")
XQA_test_pos_tensors = load_list("drive/MyDrive/ckpt/pos/XQA_test_pos_tensors.tens")

In [None]:
print("""Indexed POS vector of the first (question, passage) 
pair of the training set:""")
print(XQA_train_pos_tensors[0]) 
print(type(XQA_train_pos_tensors))

Indexed POS vector of the first (question, passage) 
pair of the training set:
[[38.  9. 13.  7. 12. 22.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.]
 [12. 30. 13.  8. 39. 12. 23. 43. 41. 36. 26. 37. 41. 30.  8. 41. 30. 39.
  23. 41. 29. 22.]
 [ 7. 30. 41.  2. 37. 13. 12. 30. 13. 12. 45.  8.  8. 41. 30. 13. 12. 45.
  12. 30. 44. 22.]
 [24. 14.  9. 13.  8. 12. 22.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.]
 [41. 23. 43.  5. 29. 32. 18.  8. 12. 39. 30. 14. 36. 24. 37.  3.  2. 37.
  43. 12. 22.  4.]
 [39. 39. 39. 39. 43. 29. 13. 12. 30. 41. 15. 29.  2. 30. 13.  8. 12. 15.
  30. 13. 12. 22.]
 [ 8. 39. 39. 23. 19. 36. 37.  3. 30.  7. 13.  8. 12. 43.  7. 13. 12. 30.
  13. 12. 12. 22.]
 [41. 29. 12. 30. 39. 23.  7. 41. 30. 13.  8. 30.  7. 41. 23. 13.  7.  8.
  41. 22.  0.  0.]
 [18. 12. 30. 13. 12. 30. 13.  8. 12. 30. 33. 30. 44. 41.  7. 30. 13.  8.
  12. 30. 39. 22.]
 [24. 30. 14. 26. 26. 29. 13. 12.  2. 37. 41. 45. 43. 45. 22.  4.  4.  0.
   0.  0. 

## [Task 3] Model definition

Write your own script to define the following transformer-based models from [huggingface](https://HuggingFace.co/).

* [M1] DistilRoBERTa (distilberta-base)
* [M2] BERTTiny (bert-tiny)

**Note**: Remember to install the ```transformers``` python package!

**Note**: We consider small transformer models for computational reasons!

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

ValueError: ignored

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [91]:
# THIS IS A SEPARATOR ##########################################################

"""
This was tested with:
tensorflow==2.6
tensorflow-gpu==2.6
tensorflow-addons==0.16.1
transformers==4.18.0
Keras==2.6.0

Note 1: Simple adaptation of tf_seq2seq_lstm.py script
Note 2: make sure Keras and Tensorflow versions match!

"""

import tensorflow as tf
import tensorflow_addons as tfa
from tqdm import tqdm
from transformers import TFAutoModel, AutoTokenizer
import time

# check if training can be performed on GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)


class MyTrainer(object):
    """
    Simple wrapper class

    train_op -> uses tf.GradientTape to compute the loss
    batch_fit -> receives a batch and performs forward-backward passes (gradient included) 
    """

    def __init__(self, encoder, decoder, max_length):
        self.encoder = encoder
        self.decoder = decoder
        self.max_length = max_length
        self.ce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, 
                                                                reduction='none') # from logits means that it returns values after a 
                                                                                  # softmax application, thus it is useless to
                                                                                  # add a softmax activation layer if this parameter is set to 
                                                                                  # true (or even dangerous because it squashes the values)
        self.optimizer = tf.keras.optimizers.Adam(learning_rate=1e-03)            # here it is possible to tweak the learning rate

    @tf.function
    def compute_loss(self, logits, target):
        loss = self.ce(y_true=target, y_pred=logits)
        mask = tf.logical_not(tf.math.equal(target, 0))
        mask = tf.cast(mask, dtype=loss.dtype)
        loss *= mask # pointwise product
        return tf.reduce_mean(loss)

    @tf.function
    def train_op(self, inputs):
        with tf.GradientTape() as tape:
            # NOTABENE: it is necessary to add token_type_ids to see how it performs
            if self.encoder.use_token_type_ids:
              encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': inputs['input_ids'],
                                                                  'attention_mask': inputs['attention_mask'],
                                                                  'token_type_ids': inputs['token_type_ids'],
                                                                  "pos_tags": inputs["pos_tags"]})
            else:
              encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': inputs['input_ids'],
                                                                  'attention_mask': inputs['attention_mask'],
                                                                  "pos_tags": inputs["pos_tags"]})
            



            decoder_input = inputs['y_padded'][:, :-1]  # ignore <end>
            real_target = inputs['y_padded'][:, 1:]  # ignore <start>

            # encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': inputs[0][0],
            #                                                    'attention_mask': inputs[0][1]})
            # decoder_input = inputs[1][:, :-1]
            # real_target = inputs[1][:, 1:]

            decoder.attention.setup_memory(encoder_output) # setup in order to perform attention queries over the 
                                                           # embedding space

            # decoder initialization, check build_initial_state for additional insights
            decoder_initial_state = self.decoder.build_initial_state(decoder.batch_size, [encoder_h, encoder_s])
            # the input is then passed to the initialized decoder and we obtain predictions
            # in rnn_output format because the model is BERT-emdedding-sequence-sequence, so the
            # last layer is still a sequence of cells (a RNN)
            predicted = self.decoder({'input_ids': decoder_input,
                                      'initial_state': decoder_initial_state}).rnn_output
            # we compute the losses over the computed predictions
            loss = self.compute_loss(logits=predicted, target=real_target)
        # gradients of the loss computed for this minibatch considering trainable
        # parameters of encoder and decoder
        grads = tape.gradient(loss, self.encoder.trainable_variables + self.decoder.trainable_variables)
        return loss, grads

    @tf.function
    def batch_fit(self, inputs):
        loss, grads = self.train_op(inputs=inputs)
        # applies gradients to the trainable variables using Adam
        self.optimizer.apply_gradients(zip(grads, self.encoder.trainable_variables + self.decoder.trainable_variables))
        return loss

    # @tf.function
    def generate(self, output_tokenizer, input_ids,token_type_ids, attention_mask=None):
        batch_size = input_ids.shape[0] # input_ids is the minibatch
        encoder_output, encoder_h, encoder_s = self.encoder({
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'token_type_ids': token_type_ids, 
        })
        if self.encoder.use_token_type_ids:
          encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': input_ids,
                                                                  'attention_mask': attention_mask,
                                                                  'token_type_ids': token_type_ids})
        else:
          encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': input_ids,
                                                                  'attention_mask': attention_mask})
          

        start_tokens = tf.fill([batch_size], output_tokenizer.word_index['<start>'])
        end_token = output_tokenizer.word_index['<end>']

        # samples the possible answer with greedy technique, we could possibly
        # use a variant here such as beam search at inference time 
        # We could not do this at training time, since the Sampler used at training
        # is not designed to project the token in an embedding space before computing
        # the next one. The aforementioned embedding space
        # is changing at each backpropagation step anyways, thus we stick with
        # the computation of the argmax of the logits using TrainingSampler.
        # NOTABENE: we can still change this sampler, find a way to penalize repetitions
        # and perform the beam search
        greedy_sampler = tfa.seq2seq.GreedyEmbeddingSampler() 
        # we have a decoder for training and a decoder for test time, thus
        # we need to re-define the training decoder each time we want to
        # train a new batch
        decoder_instance = tfa.seq2seq.BasicDecoder(cell=self.decoder.wrapped_decoder_cell,
                                                    sampler=greedy_sampler,
                                                    output_layer=self.decoder.generation_dense,
                                                    maximum_iterations=self.max_length)
        self.decoder.attention.setup_memory(encoder_output)

        # decoder_initial_state is still an output of the encoder, we pass it to
        # the decoder_instance in order to get the outputs
        decoder_initial_state = self.decoder.build_initial_state(batch_size, [encoder_h, encoder_s])
        decoder_embedding_matrix = self.decoder.embedding.variables[0]
        outputs, _, _ = decoder_instance(decoder_embedding_matrix,
                                         start_tokens=start_tokens,
                                         end_token=end_token,
                                         initial_state=decoder_initial_state)
        return outputs

    def translate(self, generated, output_tokenizer):
        return output_tokenizer.sequences_to_texts(generated.sample_id.numpy())

    def beam_translate(self, results, output_tokenizer):
        return output_tokenizer.sequences_to_texts(results[0][:,0,:])

    def beam_generate(self, output_tokenizer, input_ids,token_type_ids, attention_mask=None, beam_width=3, length_penalty=0.5):
        batch_size = input_ids.shape[0] # input_ids is the minibatch
        encoder_output, encoder_h, encoder_s = self.encoder({
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'token_type_ids': token_type_ids, 
        })
        if self.encoder.use_token_type_ids:
          encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': input_ids,
                                                                  'attention_mask': attention_mask,
                                                                  'token_type_ids': token_type_ids})
        else:
          encoder_output, encoder_h, encoder_s = self.encoder({'input_ids': input_ids,
                                                                  'attention_mask': attention_mask})
        start_tokens = tf.fill([batch_size], output_tokenizer.word_index['<start>'])
        end_token = output_tokenizer.word_index['<end>']
        
        # From official documentation
        # NOTE If you are using the BeamSearchDecoder with a cell wrapped in AttentionWrapper, then you must ensure that:
        # The encoder output has been tiled to beam_width via tfa.seq2seq.tile_batch (NOT tf.tile).
        # The batch_size argument passed to the get_initial_state method of this wrapper is equal to true_batch_size * beam_width.
        # The initial state created with get_initial_state above contains a cell_state value containing properly tiled final state from the encoder.

        encoder_output = tfa.seq2seq.tile_batch(encoder_output, multiplier=beam_width)
        self.decoder.attention.setup_memory(encoder_output)

        # set decoder_inital_state which is an AttentionWrapperState considering beam_width
        hidden_state = tfa.seq2seq.tile_batch([encoder_h, encoder_s], multiplier=beam_width)
        decoder_initial_state = self.decoder.build_initial_state(beam_width*batch_size, hidden_state)

        # Instantiate BeamSearchDecoder
        decoder_instance = tfa.seq2seq.BeamSearchDecoder(self.decoder.wrapped_decoder_cell,
                                                          beam_width=beam_width,
                                                          output_layer=self.decoder.generation_dense,
                                                          length_penalty_weight=length_penalty,
                                                          maximum_iterations=self.max_length)
        decoder_embedding_matrix = decoder.embedding.variables[0]

        # The BeamSearchDecoder object's call() function takes care of everything.
        outputs, final_state, sequence_lengths = decoder_instance(decoder_embedding_matrix, 
                                                                  start_tokens=start_tokens,
                                                                  end_token=end_token,
                                                                  initial_state=decoder_initial_state)
        # outputs is tfa.seq2seq.FinalBeamSearchDecoderOutput object. 
        # The final beam predictions are stored in outputs.predicted_id
        # outputs.beam_search_decoder_output is a tfa.seq2seq.BeamSearchDecoderOutput object which keep tracks of beam_scores and parent_ids while performing a beam decoding step
        # final_state = tfa.seq2seq.BeamSearchDecoderState object.
        # Sequence Length = [inference_batch_size, beam_width] details the maximum length of the beams that are generated


        # outputs.predicted_id.shape = (inference_batch_size, time_step_outputs, beam_width)
        # outputs.beam_search_decoder_output.scores.shape = (inference_batch_size, time_step_outputs, beam_width)
        # Convert the shape of outputs and beam_scores to (inference_batch_size, beam_width, time_step_outputs)
        final_outputs = tf.transpose(outputs.predicted_ids, perm=(0,2,1))
        beam_scores = tf.transpose(outputs.beam_search_decoder_output.scores, perm=(0,2,1))

        return final_outputs.numpy(), beam_scores.numpy()

      


class Encoder(tf.keras.Model):

    def __init__(self, model_name, decoder_units):
        super(Encoder, self).__init__()
        self.model = TFAutoModel.from_pretrained(model_name, from_pt=True, trainable=False)
        self.model.trainable=False
        self.reducer = tf.keras.layers.Dense(decoder_units)
        self.reducer2 = tf.keras.layers.Dense(decoder_units)
        self.avg_pool = tf.keras.layers.AveragePooling1D(pool_size = 512)
        self.use_token_type_ids = model_name=='prajjwal1/bert-tiny'
        self.emb_layer_postags = tf.keras.layers.Embedding(682, 512)
        self.conv_pos = tf.keras.layers.Conv1D(decoder_units, kernel_size = 682)
        self.dense_pos_hidden = tf.keras.layers.Dense(512)
        self.dense_pos_cell = tf.keras.layers.Dense(512)

    def call(self, inputs, training=False, **kwargs):
        inputs_no_pos = {"input_ids":inputs["input_ids"], "attention_mask":inputs["attention_mask"]}
        model_output = self.model(inputs_no_pos)
        
        # all_outputs has shape (batch_size * 512 * 128)
        all_outputs = model_output[0] # output of the last layer of the model
        #pooled_output = model_output[1] # last layer but processed by a linear 
                                        # layer and a tanh
        
        
        
        # cls coding
        hidden_pooled = all_outputs[:, 0, :]
        cell_state = self.avg_pool(all_outputs)
        cell_state = tf.reshape(cell_state, [all_outputs.shape[0], all_outputs.shape[2]])

        # NOTABENE: it could be possible to add something to improve the encoding
        
        # pooled output has shape (batch_size * 128)
        hidden_state = self.reducer(hidden_pooled)
        cell_state = self.reducer2(cell_state)
        #return all_outputs, self.reducer(model_output[1]), self.reducer(model_output[1])

        to_concat = self.emb_layer_postags(inputs["pos_tags"])
        to_concat = self.conv_pos(to_concat)
        to_concat = tf.reshape(to_concat, (14, 512))

        #tf.print(to_concat.shape)
        #tf.print(all_outputs.shape)
        

        pos_hidden = tf.concat([hidden_state, to_concat], axis = 1)
        pos_cell = tf.concat([cell_state, to_concat], axis = 1)
        
        pos_hidden = self.dense_pos_hidden(pos_hidden)
        pos_cell = self.dense_pos_cell(pos_cell)
        #tf.print(pos_hidden.shape)
        #tf.print(pos_cell.shape)

        return all_outputs, pos_hidden, pos_cell


class Decoder(tf.keras.Model):

    def __init__(self, vocab_size, max_sequence_length, embedding_dim, decoder_units, batch_size):
        super(Decoder, self).__init__()

        self.max_sequence_length = max_sequence_length
        self.batch_size = batch_size

        self.decoder_units = decoder_units
        # NOTABENE: it is possible to change the embedding dimension and the number of LSTM cells
        self.embedding = tf.keras.layers.Embedding(input_dim=vocab_size,
                                                   output_dim=embedding_dim)
        # NOTABENE: It could be possible to swap LSTMCell with GRUCell
        self.decoder_lstm_cell = tf.keras.layers.LSTMCell(self.decoder_units)
        # NOTABENE: Just one type of attention, it could be changed to seek for different
        # results
        self.attention = tfa.seq2seq.BahdanauAttention(units=self.decoder_units,
                                                       memory=None,
                                                       memory_sequence_length=self.batch_size * [max_sequence_length])

        self.wrapped_decoder_cell = tfa.seq2seq.AttentionWrapper(self.decoder_lstm_cell,
                                                                 self.attention,
                                                                 attention_layer_size=self.decoder_units) # adds the attention mechanism after a single
                                                                                # LSTM cell, because we pass a word at the time
        # dense layer needed to generate the distribution values over 
        # the size of the vocabulary (probability for each word)
        self.generation_dense = tf.keras.layers.Dense(vocab_size)
        # Above we describe why this cannot be changed and why it resambles
        # the greedysampler
        self.sampler = tfa.seq2seq.sampler.TrainingSampler()
        self.decoder = tfa.seq2seq.BasicDecoder(self.wrapped_decoder_cell,
                                                sampler=self.sampler,
                                                output_layer=self.generation_dense)

    def build_initial_state(self, batch_size, encoder_state):
        # after initializing the tensors within the attention layer to 0 we add
        # the designated initialization that allow us to query the embedding space,
        # which is passed as encoder_state.
        # We load the embedding of a single batch and we actually don't freeze 
        # the parameters related to BERT, that are modified and can possibly 
        # overfit. 
        initial_state = self.wrapped_decoder_cell.get_initial_state(batch_size=batch_size, dtype=tf.float32)
        initial_state = initial_state.clone(cell_state=encoder_state) 
        return initial_state

    def call(self, inputs, training=False, **kwargs):
        # as shown in calls, inputs is a dictionary with entries: 
        # "input_ids" : _encoder_output_
        # "initial_state" : _result_of_build_initial_state_
        input_ids = inputs['input_ids']
        input_emb = self.embedding(input_ids)
        decoder_output, _, _ = self.decoder(input_emb,
                                            initial_state=inputs['initial_state'],
                                            sequence_length=self.batch_size * [self.max_sequence_length - 1])
        return decoder_output


# THIS IS A SEPARATOR ##########################################################

In [None]:
!pip3 install keras

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting keras
  Downloading keras-2.11.0-py2.py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: keras
Successfully installed keras-2.11.0


In [92]:
from tensorflow_addons.seq2seq import decoder
import tensorflow as tf
from tensorflow_addons.seq2seq import sampler as sampler_py
import numpy as np
from typing import Tuple, List, Mapping, Union, Optional
import tempfile

class POSLSTMCell(tf.keras.layers.LSTMCell):
  def __init__(self, units, **kwargs):
    super(POSLSTMCell, self).__init__(units, **kwargs)

  def call(self, inputs, states, pos_tags):
    # Unpack the inputs and states
    inputs, h, c = inputs, states[0], states[1]

    # Concatenate the POS tags to the inputs
    inputs = tf.concat([inputs, pos_tags], axis=-1)

    # Compute the new h and c states
    h, c = super(POSLSTMCell, self).call(inputs, states=[h, c])

    return h, c


class NomoreBasicDecoder(decoder.BaseDecoder):
  def __init__(
    self,
    cell: tf.keras.layers.Layer,
    sampler: sampler_py.Sampler,
    output_layer: Optional[tf.keras.layers.Layer] = None,
    **kwargs,):
    """Initialize BasicDecoder.
    Args:
      cell: A layer that implements the `tf.keras.layers.AbstractRNNCell`
        interface.
      sampler: A `tfa.seq2seq.Sampler` instance.
      output_layer: (Optional) An instance of `tf.keras.layers.Layer`, i.e.,
        `tf.keras.layers.Dense`. Optional layer to apply to the RNN output
          prior to storing the result or sampling.
      **kwargs: Other keyword arguments of `tfa.seq2seq.BaseDecoder`.
    """
    #keras_utils.assert_like_rnncell("cell", cell) removing security checks about the cell layer because we want to use a custom one
    self.cell = cell
    self.sampler = sampler
    self.output_layer = output_layer
    super().__init__(**kwargs)

  def initialize(self, inputs, initial_state=None, **kwargs):
    """Initialize the decoder."""
    # Assume the dtype of the cell is the output_size structure
    # containing the input_state's first component's dtype.
    #self._cell_dtype = tf.nest.flatten(initial_state)[0].dtype removing security checks about the cell layer because we want to use a custom one
    return self.sampler.initialize(inputs, **kwargs) + (initial_state,)

  @property
  def batch_size(self):
    return self.sampler.batch_size

  def _rnn_output_size(self):
    size = tf.TensorShape(self.cell.output_size)
    if self.output_layer is None:
        return size
    else:
        # To use layer's compute_output_shape, we need to convert the
        # RNNCell's output_size entries into shapes with an unknown
        # batch size.  We then pass this through the layer's
        # compute_output_shape and read off all but the first (batch)
        # dimensions to get the output size of the rnn with the layer
        # applied to the top.
        output_shape_with_unknown_batch = tf.nest.map_structure(
            lambda s: tf.TensorShape([None]).concatenate(s), size
        )
        layer_output_shape = self.output_layer.compute_output_shape(
            output_shape_with_unknown_batch
        )
        return tf.nest.map_structure(lambda s: s[1:], layer_output_shape)

  @property
  def output_size(self):
    # Return the cell output and the id
    return NomoreBasicDecoder(
        rnn_output=self._rnn_output_size(), sample_id=self.sampler.sample_ids_shape
    )

  @property
  def output_dtype(self):
    # Assume the dtype of the cell is the output_size structure
    # containing the input_state's first component's dtype.
    # Return that structure and the sample_ids_dtype from the helper.
    dtype = self._cell_dtype
    return NomoreBasicDecoder(
        tf.nest.map_structure(lambda _: dtype, self._rnn_output_size()),
        self.sampler.sample_ids_dtype,
    )
  
  def step(self, time, inputs, state, training=None, **kwargs):
    """Perform a decoding step.
    Args:
      time: scalar `int32` tensor.
      inputs: A (structure of) input tensors.
      state: A (structure of) state tensors and TensorArrays.
      training: Python boolean.
    Returns:
      `(outputs, next_state, next_inputs, finished)`.
    """
    pos_tags = kwargs["pos_tags"]
    cell_outputs, cell_state = self.cell(inputs, state, pos_tags, training=training)
    cell_state = tf.nest.pack_sequence_as(state, tf.nest.flatten(cell_state))
    if self.output_layer is not None:
        cell_outputs = self.output_layer(cell_outputs)
        sample_ids = self.sampler.sample(
          time=time, outputs=cell_outputs, state=cell_state)
        (finished, next_inputs, next_state) = self.sampler.next_inputs(time=time, 
                                                                        outputs=cell_outputs, 
                                                                        state=cell_state, 
                                                                        sample_ids=sample_ids)
    outputs = tfa.seq2seq.BasicDecoderOutput(cell_outputs, sample_ids)
    return (outputs, next_state, next_inputs, finished)

In [None]:
# load dataset
max_sequence_length = 20
if model_name == 'prajjwal1/bert-tiny':
  train_ds = load_from_disk("drive/MyDrive/ckpt/train_ds")
  val_ds = load_from_disk("drive/MyDrive/ckpt/val_ds")
  test_ds = load_from_disk("drive/MyDrive/ckpt/test_ds")
else:
  train_ds = load_from_disk("drive/MyDrive/ckpt/train_ds_rob")
  val_ds = load_from_disk("drive/MyDrive/ckpt/val_ds_rob")
  test_ds = load_from_disk("drive/MyDrive/ckpt/test_ds_rob")



In [93]:
def train_loop(trainer, dataset, epochs, batch_size, checkpoint, checkpoint_prefix):
  steps_per_epoch = len(dataset)//batch_size

  for epoch in tqdm(range(epochs)):
    batch_index = 0
    cumulative_loss = 0

    for batch_index in tqdm(range(steps_per_epoch), position=0, leave=True):
      loss = trainer.batch_fit(dataset[batch_index*batch_size:batch_index*batch_size+batch_size])
      cumulative_loss += loss

    checkpoint.save(file_prefix=checkpoint_prefix)
    mean_loss = cumulative_loss / batch_index
    print(f"Current mean {mean_loss}")


def predict_loop(trainer, dataset, inference_batch_size,model_name,output_tokenizer, beam_search=False):
  ttids=None
  if beam_search:
    generation_func = trainer.beam_generate
    translation_func = trainer.beam_translate
  else:
    generation_func = trainer.generate
    translation_func = trainer.translate
  
  inference_step = len(dataset) // inference_batch_size
  predictions = []
  for step_index in tqdm(range(inference_step)):
    starting_index = step_index*inference_batch_size
    ending_index = step_index*inference_batch_size + inference_batch_size
    if model_name == 'prajjwal1/bert-tiny':  
      ttids = dataset["token_type_ids"][starting_index : ending_index]
    generated = generation_func(output_tokenizer=output_tokenizer, 
                                  input_ids=dataset["input_ids"][starting_index : ending_index],
                                  token_type_ids=ttids,
                                  attention_mask=dataset["attention_mask"][starting_index : ending_index])
    translated = translation_func(generated, output_tokenizer=output_tokenizer)
  # all this mess with indexes is needed in order to have coherent ids in the field "id"
    list_to_add = [{'prediction_text': translated[i - starting_index].split("<end>")[0], 'id':str(i)} for i in range(starting_index, ending_index)]
    predictions.extend(list_to_add)
  if model_name == 'prajjwal1/bert-tiny':  
    ttids = dataset["token_type_ids"][(inference_step)*inference_batch_size :]
  
  generated = generation_func(output_tokenizer = output_tokenizer, 
                             input_ids=dataset["input_ids"][(inference_step)*inference_batch_size :],
                             token_type_ids=ttids,
                             attention_mask=dataset["attention_mask"][(inference_step)*inference_batch_size :])
  translated = translation_func(generated, output_tokenizer=output_tokenizer)

  predictions.extend([{'prediction_text': translated[i - (inference_step)*inference_batch_size].split("<end>")[0], 
                    'id':str(i)} for i in range((inference_step)*inference_batch_size, 
                                                len(dataset))])
  return predictions
  
def save_prediction(prediction, filename):
  with open(filename, "wb") as f:
    pickle.dump(prediction, f)

In [None]:
from evaluate import load as lo

BATCH_SIZE = 14
EPOCHS = 3
INF_BS = 64
squad_metric = lo("squad")

if model_name == 'prajjwal1/bert-tiny':  
  checkpoint_dir = './gdrive/MyDrive/ckpt/dom/tiny'
  decoder_units = 128
else: 
  checkpoint_dir = './gdrive/MyDrive/ckpt/dom/rob'
  decoder_units = 512

checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
print(checkpoint_prefix)

./gdrive/MyDrive/ckpt/dom/rob/ckpt


In [88]:
import pandas as pd
to_add_train = [np.asarray(elem).flatten() for elem in XQA_train_pos_tensors]
to_add_val = [np.asarray(elem).flatten() for elem in XQA_val_pos_tensors]


to_add_df = pd.DataFrame({"pos_tags":to_add})
to_add_df_val = pd.DataFrame({"pos_tags":to_add_val})
#train_ds = train_ds.remove_columns(["pos_tags"])

#train_ds = train_ds.add_column(name="pos_tags", column=to_add)
val_ds = val_ds.add_column(name="pos_tags", column=to_add_val)

In [89]:
print(val_ds[0])

{'input_ids': <tf.Tensor: shape=(512,), dtype=int64, numpy=
array([    0, 12375,    21,   816,    11,     5,   177,   116,     2,
           2,  1640, 16256,    43,   480,  5095,  9005,  1008,  2330,
           6,    39,    78,  1175,    13,  3426,     6,     7,   244,
          39,   950, 15995,  3002,  2361,   412,   155,    12,   288,
          11,   302,    18,  2275,   815,  6376,    23, 18476,     4,
        1437, 50118, 50118, 15075,     6,    54,   956,    10,  1124,
           7,   517,  1065,  3098,    88,   371,   317,    11,     5,
        2103,     6,    58, 12315,   409,    30,    10,  7360,    78,
         457,   819,    31,  3426,     6,    54,    33, 14436,  2958,
         737,    19,    42,   898,     4,  1437, 50118, 50118, 30760,
         880, 28416,     8,   823,   362,    10,  3821,    12,  4530,
         483,    77,  7866, 15612,    18,  2051,  2506,    21, 13402,
        2500,     5,   618,    30,   412,    18,  1156,  7551,  2101,
        4980,     4,  1437, 50

In [None]:
results = []
results_beam = []
for train_seed in [42,1337,2022]:
  set_seed(train_seed)

  encoder = Encoder(model_name=model_name,
                        decoder_units=decoder_units)
    
  # Testing the decoder
  decoder = Decoder(vocab_size=len(output_tokenizer.word_index) + 1,
                        embedding_dim=100,
                        decoder_units=decoder_units,
                        batch_size=BATCH_SIZE,
                        max_sequence_length=max_sequence_length)
  # Training
  trainer = MyTrainer(encoder=encoder,
                        decoder=decoder,
                        max_length=max_sequence_length)
  
  checkpoint = tf.train.Checkpoint(optimizer=trainer.optimizer,
                                 encoder=encoder,
                                 decoder=decoder)
  


  train_loop(trainer, train_ds, EPOCHS, BATCH_SIZE, checkpoint, checkpoint_prefix + str(train_seed))

  prediction = predict_loop(trainer, val_ds, INF_BS, model_name,output_tokenizer)
  save_prediction(prediction, checkpoint_dir + model_name[-4:] + "_" + str(train_seed) + "_pred.pickle")

  prediction_beam = predict_loop(trainer, val_ds, INF_BS, model_name,output_tokenizer, beam_search=True)
  save_prediction(prediction_beam, checkpoint_dir + model_name[-4:] + "_" + str(train_seed) + "_beampred.pickle")

  results.append(squad_metric.compute(predictions=prediction, references=val_ds['references']))
  results_beam.append(squad_metric.compute(predictions=prediction_beam, references=val_ds['references']))

  del(checkpoint)
  del(trainer)
  del(encoder)
  del(decoder) 

print("***VALIDATION RESULTS***")
print(results)
print(results_beam)
print(f"greedy exact match:{sum([res['exact_match'] for res in results])/len(results)}" )
print(f"greedy SQUAD-F1:{sum([res['f1'] for res in results])/len(results)}" )
print(f"beam exact match:{sum([res['exact_match'] for res in results_beam])/len(results_beam)}" )
print(f"beam SQUAD-F1:{sum([res['f1'] for res in results_beam])/len(results_beam)}" )

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.
 96%|█████████▌| 5874/6129 [1:36:12<04:09,  1.02it/s]

In [None]:
encoder = Encoder(model_name=model_name,
                        decoder_units=decoder_units)
    
# Testing the decoder
decoder = Decoder(vocab_size=len(output_tokenizer.word_index) + 1,
                        embedding_dim=100,
                        decoder_units=decoder_units,
                        batch_size=BATCH_SIZE,
                        max_sequence_length=max_sequence_length)
# Training
trainer = MyTrainer(encoder=encoder,
                        decoder=decoder,
                        max_length=max_sequence_length)
  
checkpoint = tf.train.Checkpoint(optimizer=trainer.optimizer,
                                 encoder=encoder,
                                 decoder=decoder)
  
results = []
results_beam = []
for train_seed in [42,1337,2022]:

  checkpoint.restore(checkpoint_prefix + str(train_seed)+"-3")

  prediction = predict_loop(trainer, test_ds, INF_BS, model_name, output_tokenizer)
  save_prediction(prediction, checkpoint_dir + model_name[-4:] + "_" + str(train_seed) + "_testpred.pickle")

  prediction_beam = predict_loop(trainer, test_ds, INF_BS, model_name ,output_tokenizer, beam_search=True)
  save_prediction(prediction_beam, checkpoint_dir + model_name[-4:] + "_" + str(train_seed) + "_testbeampred.pickle")

  results.append(squad_metric.compute(predictions=prediction, references=test_ds['references']))
  results_beam.append(squad_metric.compute(predictions=prediction_beam, references=test_ds['references']))

print("***TEST RESULTS***")
print(results)
print(results_beam)
print(f"greedy exact match:{sum([res['exact_match'] for res in results])/len(results)}" )
print(f"greedy SQUAD-F1:{sum([res['f1'] for res in results])/len(results)}" )
print(f"beam exact match:{sum([res['exact_match'] for res in results_beam])/len(results_beam)}" )
print(f"beam SQUAD-F1:{sum([res['f1'] for res in results_beam])/len(results_beam)}" )

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f59090cd070>

In [None]:
#predictions = [{'prediction_text': translated[i - 128].split("<end>")[0], 'id':str(i)} for i in range(128, 256)]
print(type(prediction[0]))
for i in prediction[0:10]:
  print(i["prediction_text"])
  print(YQA_val[int(i["id"])])
  print(XQA_val[int(i["id"])])

<class 'dict'>
4-2 
the team from Liverpool
['Who was playing in the game?', "(CNN) -- Andy Carroll scored twice, his first goals for Liverpool, to help his club comfortably defeat Manchester City 3-0 in Monday's Premier League encounter at Anfield. \n\nCity, who needed a victory to move above Chelsea into third place in the table, were blown away by a devastating first half performance from Liverpool, who have consolidated sixth position with this result. \n\nLiverpool began brightly and nearly took a seventh-minute lead when Luis Suarez's fine strike was tipped onto the post by City's England goalkeeper Joe Hart. \n\nBut the visiting defense was struggling to cope with Liverpool's wave of attacks and the hosts took a deserved lead six minutes later when Carroll's superbly struck left-footed strike, from just outside the area, swerved past Hart for his first goal since joining the club for a British record transfer fee in January. \n\nLiverpool doubled their lead in the 34th minute wh

## [Task 4] Question generation with text passage $P$ and question $Q$

We want to define $f_\theta(P, Q)$. 

Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$ and $Q_i$ and generate $A_i$.

## [Task 5] Question generation with text passage $P$, question $Q$ and dialogue history $H$

We want to define $f_\theta(P, Q, H)$. Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$, $Q_i$, and $H = \{ Q_0, A_0, \dots, Q_{i-1}, A_{i-1} \}$ to generate $A_i$.

## [Task 6] Train and evaluate $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$

Write your own script to train and evaluate your $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$ models.

### Instructions

* Perform multiple train/evaluation seed runs: [42, 2022, 1337].$^1$
* Evaluate your models with the following metrics: SQUAD F1-score.$^2$
* Fine-tune each transformer-based models for **3 epochs**.
* Report evaluation SQUAD F1-score computed on the validation and test sets.

$^1$ Remember what we said about code reproducibility in Tutorial 2!

$^2$ You can use ```allennlp``` python package for a quick implementation of SQUAD F1-score: ```from allennlp_models.rc.tools import squad```. 

## [Task 7] Error Analysis

Perform a simple and short error analysis as follows:
* Group dialogues by ```source``` and report the worst 5 model errors for each source (w.r.t. SQUAD F1-score).
* Inspect observed results and try to provide some comments (e.g., do the models make errors when faced with a particular question type?)$^1$

$^1$ Check the [paper](https://arxiv.org/pdf/1808.07042.pdf) for some valuable information about question/answer types (e.g., Table 6, Table 8) 

# Assignment Evaluation

The following assignment points will be awarded for each task as follows:

* Task 1, Pre-processing $\rightarrow$ 0.5 points.
* Task 2, Dataset Splitting $\rightarrow$ 0.5 points.
* Task 3 and 4, Models Definition $\rightarrow$ 1.0 points.
* Task 5 and 6, Models Training and Evaluation $\rightarrow$ 2.0 points.
* Task 7, Analysis $\rightarrow$ 1.0 points.
* Report $\rightarrow$ 1.0 points.

**Total** = 6 points <br>

We may award an additional 0.5 points for outstanding submissions. 
 
**Speed Bonus** = 0.5 extra points <br>

# Report

We apply the rules described in Assignment 1 regarding the report.
* Write a clear and concise report following the given overleaf template (**max 2 pages**).
* Report validation and test results in a table.$^1$
* **Avoid reporting** code snippets or copy-paste terminal outputs $\rightarrow$ **Provide a clean schema** of what you want to show

# Comments and Organization

Remember to properly comment your code (it is not necessary to comment each single line) and don't forget to describe your work!

Structure your code for readability and maintenance. If you work with Colab, use sections. 

This allows you to build clean and modular code, as well as easy to read and to debug (notebooks can be quite tricky time to time).

# FAQ (READ THIS!)

---

**Question**: Does Task 3 also include data tokenization and conversion step?

**Answer:** Yes! These steps are usually straightforward since ```transformers``` also offers a specific tokenizer for each model.

**Example**: 

```
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_text = tokenizer(text)
%% Alternatively
inputs = tokenizer.tokenize(text, add_special_tokens=True, max_length=min(max_length, 512))
input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask']
```

**Suggestion**: Hugginface's documentation is full of tutorials and user-friendly APIs.

---
---

**Question**: I'm hitting **out of memory error** when training my models, do you have any suggestions?

**Answer**: Here are some common workarounds:

1. Try decreasing the mini-batch size
2. Try applying a different padding strategy (if you are applying padding): e.g. use quantiles instead of maximum sequence length

---
---

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# The End!

Questions?