# Generative Approach for Open Question Answering

This Jupyter notebook evaluates the performance of generative Question-Answering transformers models on Pirá Dataset after the retriever step. 

Generative models produce the answer with the help of the supporting text.

Check the full GitHub at: https://github.com/C4AI/Pira

## Imports

In [None]:
import pandas as pd 
from simpletransformers.t5 import T5Model
from __future__ import print_function
from collections import Counter
import string
import re
import argparse
import json
import sys


## Choose whether also to train the model or just evaluate it.

In [None]:
DO_TRAINING = True

## Dataset information

Be sure to run the BM25.ipynb file before to save retrieved supporting texts

In [1]:
PATH_BASE = 'finetune_PT_PT_100Words_15Passages/'


## Loading Dataset

In [2]:
pd_dataset_train = pd.read_csv(PATH_BASE + "train.csv", index_col = 0)
pd_dataset_train.columns = ["input_text", "target_text"]
pd_dataset_train["prefix"] = "question"


pd_dataset_val = pd.read_csv(PATH_BASE + "val.csv", index_col = 0)
pd_dataset_val.columns = ["input_text", "target_text"]
pd_dataset_val["prefix"] = "question"


pd_dataset_test = pd.read_csv(PATH_BASE + "test.csv", index_col = 0)
pd_dataset_test.columns = ["input_text", "target_text"]
pd_dataset_test["input_text"] = "question: " + pd_dataset_test["input_text"]

test_input = pd_dataset_test["input_text"].values.tolist()
test_target_2 = pd_dataset_test["target_text"].values.tolist()
test_target = []
for i in test_target_2:
    test_target.append([i])

## Model Parameters

MODEL_CHECKPOINT -> Local or from HuggingFace model.

In [3]:
model_checkpoint = "unicamp-dl/ptt5-base-portuguese-vocab"

MAX_SEQ_LENGHT = 1536
MAX_OUTPUT_LENGHT = 16
EPOCHS = 20
BATCH_SIZE = 1
GRADIENT_ACCUMULATION = 32
LEARNING_RATE = 2e-5

In [4]:
model_args = {
    "preprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": MAX_SEQ_LENGHT,
    "train_batch_size": BATCH_SIZE,
    "gradient_accumulation_steps": GRADIENT_ACCUMULATION,
    "num_train_epochs": EPOCHS,
    "no_save": True,
    "save_eval_checkpoints": False,
    "learning_rate": LEARNING_RATE,
    "use_multiprocessing": False,
    "save_steps": 2000,
    "save_model_every_epoch": False,
    "evaluate_during_training": True,
    "fp16": True,
    "target_max_len": MAX_OUTPUT_LENGHT
}

## Training the model

In case DO_TRAINING was set to True

In [5]:
model = T5Model("t5", "t5-base", args=model_args)
if DO_TRAINING:
    model.train_model(pd_dataset_train, eval_data=pd_dataset_val)

  0%|          | 0/1806 [00:00<?, ?it/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



Epoch:   0%|          | 0/20 [00:00<?, ?it/s]

Running Epoch 0 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 1 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 2 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 3 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 4 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 5 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 6 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 7 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 8 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 9 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 10 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 11 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 12 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 13 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 14 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 15 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 16 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 17 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 18 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

Running Epoch 19 of 20:   0%|          | 0/1806 [00:00<?, ?it/s]

  0%|          | 0/225 [00:00<?, ?it/s]

## Generating answers

In [7]:
preds = model.predict(test_input)

preds[0]

Generating outputs:   0%|          | 0/29 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/227 [00:00<?, ?it/s]

## Evaluationg script

SQuAD evaluation script: https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py

Modified slightly for this notebook since we do not remove articles to remain consistent for both Portuguese and English

In [6]:
def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_punc(lower(s)))


def f1_score(prediction, ground_truth):
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


def exact_match_score(prediction, ground_truth):
    return (normalize_answer(prediction) == normalize_answer(ground_truth))


def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
    scores_for_ground_truths = []
    for ground_truth in ground_truths:
        score = metric_fn(prediction, ground_truth)
        scores_for_ground_truths.append(score)
    return max(scores_for_ground_truths)


def evaluate(gold_answers, predictions):
    f1 = exact_match = total = 0

    for ground_truths, prediction in zip(gold_answers, predictions):
      total += 1
      exact_match += metric_max_over_ground_truths(
                    exact_match_score, prediction, ground_truths)
      f1 += metric_max_over_ground_truths(
          f1_score, prediction, ground_truths)
    
    exact_match = 100.0 * exact_match / total
    f1 = 100.0 * f1 / total

    return {'exact_match': exact_match, 'f1': f1}

## Performing Evaluation

In [8]:
evaluate(test_target, preds)

{'exact_match': 1.7621145374449338, 'f1': 24.46817166146708}