<a href="https://colab.research.google.com/github/Danysan1/ai-unibo-nlp-project/blob/main/a2/execution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2 execution

### Test file implementing BERT2BERT using Bert-Tiny model

Adapted the given example. DistilRoBERTa seems too heavy to work with TF, Bert-Tiny can actually be runned with really ugly results, getting out of memory using 1000 samples..
Only 3 epochs used as said in the assignement.

In [1]:
%pip install pandas numpy matplotlib transformers dataset tensorflow_addons

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 9.5 MB/s 
[?25hCollecting dataset
  Downloading dataset-1.5.2-py2.py3-none-any.whl (18 kB)
Collecting tensorflow_addons
  Downloading tensorflow_addons-0.19.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 59.3 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 55.1 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 59.7 MB/s 
Collecting banal>=1.0.1
  Downloading banal-1.0.6-py2.py3-none-any.whl (6.1 kB)
Collecting alembic>=0.6.2
  Down

## Data loading

### Dataset download

In [2]:
import os
import urllib.request
from tqdm import tqdm

class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')
    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [3]:
data_folder = 'Dataset'

In [4]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path=data_folder, url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path=data_folder, url_path=test_url, suffix='test')

Downloading CoQA train data split... (it may take a while)


coqa-train-v1.0.json: 49.0MB [00:08, 5.86MB/s]                            


Download completed!
Downloading CoQA test data split... (it may take a while)


coqa-dev-v1.0.json: 9.09MB [00:03, 2.84MB/s]                            

Download completed!





### Dataset loading

In [5]:
import numpy as np
import pandas as pd
import json
from os import path
from matplotlib import pyplot as plt

In [6]:
def loadDataset(filename):
    with open(path.join(data_folder, filename)) as file_obj:
        df = json.load(file_obj)["data"]
    print(f'{len(df)} stories / {len(df[0]["questions"])} questions in the first row')

    storyDType = pd.CategoricalDtype(pd.unique([story["story"] for story in df]))
    sourceDType = pd.CategoricalDtype(pd.unique([story["source"] for story in df]))
    print(f"Sources: {sourceDType.categories}")

    df = np.array([
        [
            sourceDType.categories.get_loc(story["source"]), # Sources factorization
            storyDType.categories.get_loc(story["story"]), # Sources factorization
            story["questions"][question_index]["input_text"],
            story["answers"][question_index]["input_text"],
            story["answers"][question_index]["span_text"],
        ]
        for story in df
        for question_index in range(len(story["questions"]))
        if story["answers"][question_index]["input_text"] != 'unknown'
    ])
    print(f'{df.shape} question-answer pairs x columns')
    print(f'First row: {df[0]}')
    
    # https://marcobonzanini.com/2021/09/15/tips-for-saving-memory-with-pandas/
    # https://pandas.pydata.org/docs/user_guide/categorical.html
    df = pd.DataFrame({
        "source": pd.Series(pd.Categorical.from_codes(df[:,0].astype(np.int16), dtype=sourceDType)),
        "p": pd.Series(pd.Categorical.from_codes(df[:,1].astype(np.int16), dtype=storyDType)),
        "q": df[:,2],
        "a": df[:,3],
        "span": df[:,4],
    })

    return df

In [7]:
train_df = loadDataset("train.json")
train_df.count()

7199 stories / 20 questions in the first row
Sources: Index(['wikipedia', 'cnn', 'gutenberg', 'race', 'mctest'], dtype='object')
(107276, 5) question-answer pairs x columns
First row: ['0' '0' 'When was the Vat formally opened?'
 'It was formally established in 1475' 'Formally established in 1475']


source    107276
p         107276
q         107276
a         107276
span      107276
dtype: int64

In [8]:
pd.unique(train_df["p"]).size

6605

In [9]:
pd.unique(train_df["span"]).size

99470

In [10]:
pd.unique(train_df["source"]).size

5

In [11]:
train_df.head()

Unnamed: 0,source,p,q,a,span
0,wikipedia,"The Vatican Apostolic Library (), more commonl...",When was the Vat formally opened?,It was formally established in 1475,Formally established in 1475
1,wikipedia,"The Vatican Apostolic Library (), more commonl...",what is the library for?,research,he Vatican Library is a research library
2,wikipedia,"The Vatican Apostolic Library (), more commonl...",for what subjects?,"history, and law",Vatican Library is a research library for hist...
3,wikipedia,"The Vatican Apostolic Library (), more commonl...",and?,"philosophy, science and theology",Vatican Library is a research library for hist...
4,wikipedia,"The Vatican Apostolic Library (), more commonl...",what was started in 2014?,a project,"March 2014, the Vatican Library began an initi..."


In [12]:
train_df.memory_usage(deep=True)

Index          128
source      107764
p         14241201
q          9110271
a          7714559
span      12090637
dtype: int64

In [13]:
#test_df = loadDataset("test.json")
#test_df.count()

## Data Pre-Processing

### Check unanswerable questions in the Train Dataset

In [14]:
idx = (train_df.a == 'unknown')
unanswerable = train_df[idx]
unanswerable.q.count()

0

All unanswerable questions in the Train Dataset have been already removed.

## Exploratory Data Analysis

In [15]:
train_df["p"][42]

'CHAPTER VII. THE DAUGHTER OF WITHERSTEEN \n\n"Lassiter, will you be my rider?" Jane had asked him. \n\n"I reckon so," he had replied. \n\nFew as the words were, Jane knew how infinitely much they implied. She wanted him to take charge of her cattle and horse and ranges, and save them if that were possible. Yet, though she could not have spoken aloud all she meant, she was perfectly honest with herself. Whatever the price to be paid, she must keep Lassiter close to her; she must shield from him the man who had led Milly Erne to Cottonwoods. In her fear she so controlled her mind that she did not whisper this Mormon\'s name to her own soul, she did not even think it. Besides, beyond this thing she regarded as a sacred obligation thrust upon her, was the need of a helper, of a friend, of a champion in this critical time. If she could rule this gun-man, as Venters had called him, if she could even keep him from shedding blood, what strategy to play his flame and his presence against the g

In [16]:
train_df["q"][42]

'Was Lassiter impressed with the horse?'

In [17]:
train_df["a"][42]

'Yes'

In [18]:
train_df["span"][42]

'When Jerd led out this slender, beautifully built horse Lassiter suddenly became all eyes.'

In [19]:
train_df["source"][42]

'gutenberg'

In [20]:
# TODO

## Train-Validation-Test split

In [21]:
# TODO

## Model definition

### Utilities

In [22]:
from sklearn.metrics import f1_score
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from typing import List, Dict, Callable
import random

In [23]:
def predict_data(model: keras.Model,
                x: np.ndarray,
                prediction_info: Dict):
    """
    Inference routine of a given input set of examples

    :param model: Keras built and possibly trained model
    :param x: input set of examples in np.ndarray format
    :param prediction_info: dictionary storing model predict() argument information

    :return
        predictions: predicted labels in np.ndarray format
    """
    print(f'Starting prediction: \n{prediction_info}')
    print(f'Predicting on {x.shape[0]} samples')
    predictions = model.predict(x, **prediction_info)
    return predictions

In [24]:
def compute_f1(model: keras.Model, 
             x: np.ndarray, 
             y: np.ndarray):
    """
    Compute F1_score on the given data with corresponding labels

    :param model: Keras built and possibly trained model
    :param x: data in np.ndarray format
    :param y: ground-truth labels in np.ndarray format

    :return
        score: f1_macro_score
    """
    #predictions on the x set
    prediction_info = {
        'batch_size': 64,
        'verbose': 1
    }
    y_pred = predict_data(model=model, x=x, prediction_info=prediction_info)

    #compute argmax to take the best class for each sample
    y_pred = np.argmax(y_pred, axis=1)
    #compute the f1_macro
    score = f1_score(y, y_pred, average ='macro')
    return score

In [25]:
def set_reproducibility(seed):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'

In [26]:
import tensorflow as tf
import tensorflow_addons as tfa
from tqdm import tqdm
from copy import deepcopy
from transformers import TFAutoModel, AutoTokenizer, TFEncoderDecoderModel

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

### Question generation $f_\theta(P, Q)$ with text passage $P$ and question $Q$

BERT2BERT Bert-Tiny

In [27]:
class MyTrainer(object):
    """
    Simple wrapper class

    train_op -> uses tf.GradientTape to compute the loss
    batch_fit -> receives a batch and performs forward-backward passes (gradient included)
    """

    def __init__(self, keras_model):
        self.keras_model = keras_model
        self.optimizer = tf.keras.optimizers.Adam(learning_rate=5e-05)

    @tf.function
    def compute_loss(self, inputs):
        loss = self.keras_model(inputs=inputs)
        return tf.reduce_mean(loss)

    @tf.function
    def train_op(self, inputs):
        with tf.GradientTape() as tape:
            loss = self.compute_loss(inputs=inputs)

        grads = tape.gradient(loss, self.keras_model.trainable_variables)
        return loss, grads

    @tf.function
    def batch_fit(self, inputs):
        loss, grads = self.train_op(inputs=inputs)
        self.optimizer.apply_gradients(zip(grads, self.keras_model.trainable_variables))
        return loss


class MyModel(tf.keras.Model):
    """
    Custom keras model that wraps the TFEncoderDecoderModel
    """

    def __init__(self, model_name, **kwargs):
        super(MyModel, self).__init__(**kwargs)
        self.model_name = model_name

        # tie_encoder_decoder to share weights and half the number of parameters
        self.model = TFEncoderDecoderModel.from_encoder_decoder_pretrained(model_name, model_name,
                                                                           encoder_from_pt=True,
                                                                           decoder_from_pt=True,
                                                                           tie_encoder_decoder=True)

    def call(self, inputs, **kwargs):
        loss = self.model(input_ids=inputs['input_ids'],
                          attention_mask=inputs['input_attention_mask'],
                          decoder_input_ids=inputs['decoder_input_ids'],
                          decoder_attention_mask=inputs['labels_mask'],
                          labels=inputs['labels']).loss
        return loss

    def generate(self, **kwargs):
        return self.model.generate(decoder_start_token_id=self.model.config.decoder.pad_token_id,
                                   **kwargs)


In [28]:
# Download the model only if not already present.
# Saving  it in ./models/

def get_tokenizer(model_name):  
    data_path = "models"  
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, model_name)
    if not os.path.exists(data_path):
        print(f"Downloading model {model_name}... (it may take a while)")
        tokenizer = AutoTokenizer.from_pretrained(model_name, tie_encoder_decoder=True)
        tokenizer.save_pretrained(data_path)
        print(f"Download completed, saved in {data_path}!")
    else:
        print(f'Model already downloaded, loading from {data_path}')
        tokenizer = AutoTokenizer.from_pretrained(data_path, tie_encoder_decoder=True)

    return tokenizer


In [138]:
model_name = 'prajjwal1/bert-tiny'

#tokenizer = get_tokenizer(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, tie_encoder_decoder=True)
model = MyModel(model_name=model_name)

model.model.config.decoder_start_token_id = tokenizer.cls_token_id
model.model.config_eos_token_id = tokenizer.sep_token_id
model.model.config.pad_token_id = tokenizer.pad_token_id
model.model.config.vocab_size = model.model.config.encoder.vocab_size

trainer = MyTrainer(keras_model=model)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['bert.embeddings.position_ids', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

In [142]:
samples = 100

sample_questions = list(train_df['q'][:samples])
sample_spans = list(train_df['span'][:samples])
sample_answers = list(train_df['a'][:samples])




In [143]:
# input containing the question and the span
input_qs = tokenizer(sample_questions, sample_spans, add_special_tokens=False, padding=True)
input_ids, input_attention_mask = input_qs['input_ids'], input_qs['attention_mask']

# labels containing the answer
label_values = tokenizer(sample_answers, padding=True)
labels, labels_mask = label_values['input_ids'], label_values['attention_mask']

# Every labels has the same len(16), cause of padding=True
max_length = len(labels[0])

# Assigning id -100 to every padding token in labels
masked_labels = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in labels]

for idx in input_ids[0]:
    print("{}\t{}".format(idx, tokenizer.convert_ids_to_tokens(idx) if idx != -100 else "PAD"))

2043	when
2001	was
1996	the
12436	va
2102	##t
6246	formally
2441	opened
1029	?
6246	formally
2511	established
1999	in
16471	147
2629	##5
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]
0	[PAD]


In [144]:
epochs = 3

for epoch in tqdm(range(epochs)):
    batch = {'input_ids': tf.convert_to_tensor(input_ids, dtype=tf.int32),
                'input_attention_mask': tf.convert_to_tensor(input_attention_mask, dtype=tf.int32),
                'labels': tf.convert_to_tensor(labels_mask, dtype=tf.int32),
                'decoder_input_ids': tf.convert_to_tensor(deepcopy(labels), dtype=tf.int32),
                'labels_mask': tf.convert_to_tensor(labels_mask, dtype=tf.int32)
                }
    loss = trainer.batch_fit(inputs=batch)
    print(f'Epoch {epoch} -- Loss {loss}')
    # You can play with generation arguments to enforce
    #  beam search
    #  repetition penalty
    #  other sampling approaches
    generated = trainer.keras_model.generate(input_ids=tf.convert_to_tensor(input_ids, dtype=tf.int32),
                                                max_length=max_length,
                                                repetition_penalty=3.,
                                                min_length=5,
                                                no_repeat_ngram_size=3,
                                                early_stopping=True,
                                                num_beams=4
                                                )
    generated = tokenizer.batch_decode(generated, skip_special_tokens=True)
    print(f'Generated: {generated}')




Epoch 0 -- Loss 19.424400329589844


 33%|███▎      | 1/3 [01:09<02:18, 69.49s/it]

Generated: [') ( ( water water ) water water and water water or, water water', ') ( ( back north ) east east east - west west west - south', ') ( ( back north ) east east east - south east east west -', ') ( ( back north ) east east east - south west - east east', ') ( ( water water ) water water and water water or, water water', ') ( ( back north ) east east east - west west west - south', ') ( ( water water ) water water and water water or, water water', ') ( ( back north ) west west west - south west - east east', ') ( ( water water ) water water and water water or, water water', ') ( ( water water ) water water and water water or, water water', ') ( ( water water ) water water and water water or, water water', ') ( ( water water ) water water and water water or, water water', ') ( ( back north ) west west west - south west - east east', ') ( ( back north ) west west west - south west - east east', ') ( ( back north ) west west west - south west - east east', ') ( ( water water ) wa

 67%|██████▋   | 2/3 [02:16<01:08, 68.22s/it]

Generated: [') ( ( water water ) water water and water water or, water water', ') ( ( back north ) east east east - south east east west -', ') ( ( back north ) west west west - south west - east east', ') ( ( back north ) east east east - south east east west -', ') ( ( water water ) water water and water water or, water water', ') ( ( back north ) west west west - south west - east east', ') ( ( back north ) west west west - south west - east east', ') ( ( back north ) west west west - south west - east east', ') ( ( back north ) west west west - south west - east east', ') ( ( water water ) water water and water water or, water water', ') ( ( water water ) water water and water water or, water water', ') ( ( water water ) water water and water water or, water water', ') ( ( back north ) west west west - south west - east east', ') ( ( back north ) west west west - south west - east east', ') ( ( back north ) west west west - south west - east east', ') ( ( back north ) west west wes

100%|██████████| 3/3 [03:39<00:00, 73.08s/it]

Generated: [') ( ( water water ) water water and water water or, water water', ') ( ( back north ) west west west - south west - east east', ') ( ( back north ) west west west - south west - east east', ') ( ( back north ) east east east - south east east west -', ') ( ( water water ) water water and water water or, water water', ') ( ( water water ) water water and water water or, water water', ') ( ( water water ) water water and water water or, water water', ') ( ( back north ) west west west - south west - east east', ') ( ( water water ) water water and water water or, water water', ') ( ( water water ) water water and water water or, water water', ') ( ( water water ) water water and water water or, water water', ') ( ( water water ) water water and water water or, water water', ') ( ( back north ) west west west - south west - east east', ') ( ( back north ) west west west - south west - east east', ') ( ( water water ) water water and water water or, water water', ') ( ( water 




# TODO - Model evaluation