<a href="https://colab.research.google.com/github/KhushnurLaboni/Question-Generation/blob/main/T5_End_to_End_Question_Generation_FineTuning_answer_agnostic_squadV2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning the t5 model on a end-to-end question generation (answer agnostic) using squad_v2_data

### Example
The process:
- You provide the context (the text you want to generate questions from).
- The model generates multiple questions simultaneously.

`Context: 
"Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace."`

`Questions:`

- `Who created Python?`,
- `When was Python first released?`
- `What is Python's design philosophy?`

### Sources
- [Transformer-based End-to-End Question Generation's Paper](https://arxiv.org/pdf/2005.01107v1.pdf)
- [Patil Suraj's work on question generation](https://github.com/patil-suraj/question_generation/tree/bffa0a51e3ecba3922cafd13f424521135677303)


# Download and install the packages

In [None]:
!pip install transformers
!pip install datasets
!pip install sentencepiece

!pip install tqdm

!pip install wandb

!sudo apt-get install git-lfs

In [29]:
import torch

from datasets import load_dataset, load_metric, list_metrics
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollator, T5ForConditionalGeneration, T5TokenizerFast

from tqdm import tqdm

from typing import Dict, List, Optional

import dataclasses
from dataclasses import dataclass, field

import logging
import os
import sys

import numpy as np
import torch

from huggingface_hub import notebook_login

from transformers import (
    T5ForConditionalGeneration, 
    T5Tokenizer, 
    EvalPrediction,
    DataCollator,
    Trainer,
    TrainingArguments)

from google.colab import files

- Connect to Weight and Biases:

In [30]:
import wandb
wandb.login()

%env WANDB_PROJECT=t5-small-end-to-end-questions-generation

env: WANDB_PROJECT=t5-small-end-to-end-questions-generation


## Connect to Hugging Face
- To be able to share the model in the Hub, **store the authentification token from the Hugging Face website**.


In [31]:
notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


- Then install Git-lfs and add your mail and username to the config

In [32]:
!git config --global user.email "khushnur.uiu006@gmail.com"
!git config --global user.name "Khushnur"

## Loading the dataset 📚
- [SQuAD v2.0],modified version where questions for a context are concatenated.


In [28]:
files.upload()

Saving squad_V2_modified_for_t5_qg.zip to squad_V2_modified_for_t5_qg.zip


{'squad_V2_modified_for_t5_qg.zip': b'PK\x03\x04\x14\x00\x00\x00\x08\x00TapVqcx\x93\xa8\x06\x00\x00\x9f\x10\x00\x00\x1e\x00\x00\x00squad_V2_modified_for_t5_qg.py\x95Wmo\xdb6\x10\xfe\xae_\xc1)\x18b\xb7\x8eb\x17h6\x1800\'i\x0bo\x9d\x9b\xc6i\xf7\x92\x05\x02#Q2\x13\x99TI\xca\x8eg\xf8\xbf\xef\x8e\x92()M\xb6\x95@\x10\x93\xbc;>\xf7\xf6\x90: G/\x8eH$c.\xd21)Lr\xf4#\xaex\xbe\xef\xeb/\x05\x8d\xc3\x15l%\x9c\xc5a"Uh^\x87_\xd2Q\xc0\xf3\xad\xb8\xf5\xbcia\xe4\x8a\x1a\x1e\xd1,\xdb\x92\x94\t\xa6\xa8a1\xb9\xdd\x923\x99\xd1[\tS\xa9\xb6\x81\xe7}P<\xe5\x82f$\xe1\x19#\\\x93LFV\x94\x1a\x8f\xc0X\x1a\x93\xeb\xf1\xf1q\x84j\x81b\x9aQ\x15-\x83T\xca4cA$W\xc7\xb1\xe2kv<:\xcd>m\xe7o\x7f\x1c\xfd}\x7fJ\xff\xb8\xfb\xfbdqA\xdfN\xa3\xb3\x87??\x9e\xfc\xb2H?\xff\x80\xb8=\xef\xe0\xbb\x9c\xe7\x84\x0bm\x00\x19\x89\xa9\xa1\x9a\x19\xed\xe1\xe6\xe2\xe3\xa7\xe9\xf9\x98\\-\x19Y\x18*\xc0\xa9\x98|,\x986\\\n2\x15z\xc3\x14D\x82\x9c\x97:\x01\x9a\x83\xbf_\xab \x905S\x1a%A\x0f|\x11\x8c\x98B\xa0\xfc\xd5k\x02\xab\xce\xd0\xbb2\x16\xf8\xd3\x

In [33]:
raw_dataset = load_dataset("squad_V2_modified_for_t5_qg.py")



  0%|          | 0/2 [00:00<?, ?it/s]

- Let see one example of the dataset:

In [10]:
raw_dataset["train"][0]


{'context': 'generate questions: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'questions': "When did Beyonce start becoming popular? {sep_token} What areas did Beyonce compete in when she was growing up? {sep_token} When did Beyonce leave Destiny's Child and become a solo singer? {sep_token} In what city and state did Beyonce  grow up? {sep_t

## Preprocessing the data 🔧
- Load the model: `"t5-base"` or "t5-small" and the `T5TokenizerFast` tokenizer


In [35]:
checkpoint = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(checkpoint)
tokenizer = T5TokenizerFast.from_pretrained(checkpoint)

- Because we separate each of our questions with `<sep>` token, we need to add it to the tokenizer tokens.

In [36]:
tokenizer.sep_token = '<sep>'

In [37]:
tokenizer.add_tokens(['<sep>'])
model.resize_token_embeddings(len(tokenizer))

Embedding(32101, 512)

In [38]:
# Check the sep_token_id to verify that it was added to the tokenizer
tokenizer.sep_token_id

32100

- Now, we need to preprocess the data in 3 steps:
1. `add_eos_examples`: Add `</s>` (end of string) at the end of each context and each questions combination.
2. `add_special_tokens`: Replace `{sep_token}` to `<sep>` token between each question.
3. `convert_to_features`: Tokenize the examples with 

In [39]:
max_input_length =  512
max_target_length = 64

In [40]:
# tokenize the examples
def convert_to_features(example_batch):

    input_encodings = tokenizer.batch_encode_plus(example_batch['context'], 
                                                  max_length=max_input_length, 
                                                  add_special_tokens=True,
                                                  truncation=True, 
                                                  pad_to_max_length=True)
    
    target_encodings = tokenizer.batch_encode_plus(example_batch['questions'], 
                                                   max_length=max_target_length, 
                                                   add_special_tokens=True,
                                                   truncation=True, pad_to_max_length=True)
                                                   
    encodings = {
        'input_ids': input_encodings['input_ids'], 
        'attention_mask': input_encodings['attention_mask'],
        'decoder_input_ids': target_encodings['input_ids']
        ,'decoder_attention_mask': target_encodings['attention_mask']
    }

    return encodings

def add_eos_examples(example):
  example['context'] = example['context'] + " </s>"
  example['questions'] = example['questions'] + " </s>"
  return example


def add_special_tokens(example):
  example['questions'] = example['questions'].replace("{sep_token}", '<sep>')
  return example

In [41]:
tokenized_dataset  = raw_dataset.map(add_eos_examples)
tokenized_dataset = tokenized_dataset.map(add_special_tokens)
tokenized_dataset  = tokenized_dataset.map(convert_to_features,  batched=True)



Map:   0%|          | 0/1204 [00:00<?, ? examples/s]



In [18]:
tokenized_dataset["train"][0]["context"]

'generate questions: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy". </s>'

In [19]:
tokenized_dataset["train"][0]["questions"]

"When did Beyonce start becoming popular? <sep> What areas did Beyonce compete in when she was growing up? <sep> When did Beyonce leave Destiny's Child and become a solo singer? <sep> In what city and state did Beyonce  grow up? <sep> In which decade did Beyonce become famous? <sep> In what R&B group was she the lead singer? <sep> What album made her a worldwide known artist? <sep> Who managed the Destiny's Child group? <sep> When did Beyoncé rise to fame? <sep> What role did Beyoncé have in Destiny's Child? <sep> What was the first album Beyoncé released as a solo artist? <sep> When did Beyoncé release Dangerously in Love? <sep> How many Grammy awards did Beyoncé win for her first solo album? <sep> What was Beyoncé's role in Destiny's Child? <sep> What was the name of Beyoncé's first solo album? <sep> </s>"

- Finally, we remove the useless columns `context` and `questions` and we split the tokenized_dataset between train and validation dataset.

In [42]:
tokenized_dataset = tokenized_dataset.remove_columns(
    ["context", "questions"]
)

train_dataset = tokenized_dataset["train"]
valid_dataset = tokenized_dataset["validation"]

columns = ['input_ids', 'decoder_input_ids', 'attention_mask', 'decoder_attention_mask']
train_dataset.set_format(type='torch', columns=columns)
valid_dataset.set_format(type='torch', columns=columns)

In [43]:
torch.save(train_dataset, 'train_data.pt')
torch.save(valid_dataset, 'valid_data.pt')

## Fine-Tuning the t5 model
- A custom DataCollator. A DataCollator **will form a batch using a list of dataset elements as input.** 

In [45]:
# This dataclass implementation is taken from Suraj Patil: https://github.com/patil-suraj/question_generation
@dataclass
class T2TDataCollator():
  def __call__(self, batch: List) -> Dict[str, torch.Tensor]:
    """
    Take a list of samples from a Dataset and collate them into a batch.
    Returns:
    A dictionary of tensors
    """
    
    input_ids = torch.stack([example['input_ids'] for example in batch])
    lm_labels = torch.stack([example['decoder_input_ids'] for example in batch])
    lm_labels[lm_labels[:, :] == 0] = -100 
    attention_mask = torch.stack([example['attention_mask'] for example in batch])
    decoder_attention_mask = torch.stack([example['decoder_attention_mask'] for example in batch])
    
    return {
        'input_ids': input_ids, 
        'attention_mask': attention_mask,
        'labels': lm_labels, 
        'decoder_attention_mask': decoder_attention_mask
    }

In [46]:
from transformers import TrainingArguments, Trainer, EvalPrediction
from sklearn.metrics import f1_score, accuracy_score

# Define the compute_metrics function
def compute_metrics(p: EvalPrediction):
    predictions = p.predictions.argmax(-1)
    return {
        "F1": f1_score(p.label_ids, predictions, average="weighted"),
        "EM": accuracy_score(p.label_ids, predictions)
    }

- We define the `TrainingArguments` object that contains every hyperparameters (learning_rate, nb of epochs...)

In [48]:
training_args = TrainingArguments(output_dir="./gdrive/My Drive/models", 
                                  per_device_train_batch_size=16, 
                                  per_device_eval_batch_size=16,
                                  gradient_accumulation_steps=64,
                                  learning_rate=1e-4, 
                                  num_train_epochs=1,
                                  logging_steps=100,
                                  run_name="end2end-questions-generation",
                                  evaluation_strategy="steps",
                                  save_steps=300,
                                  report_to="wandb",
                                  push_to_hub=True,
                                  push_to_hub_model_id="t5-small-end2end-questions-generation")

In [49]:
logger = logging.getLogger(__name__)

# Initialize our Trainer
# Initialize our Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    compute_metrics=compute_metrics,
    data_collator=T2TDataCollator()
)


# Training
trainer.train()

# When training is done, we push the fine-tuned model to the Hub
trainer.push_to_hub("t5-small-end2end-questions-generation")

wandb.finish()

/content/./gdrive/My Drive/models is already a clone of https://huggingface.co/Khushnur/t5-small-end2end-questions-generation. Make sure you pull the latest changes with `repo.git_pull()`.


Step,Training Loss,Validation Loss


Upload file training_args.bin: 100%|##########| 3.62k/3.62k [00:00<?, ?B/s]

remote: Scanning LFS files of refs/heads/main for validity...        
remote: LFS file scan complete.        
To https://huggingface.co/Khushnur/t5-small-end2end-questions-generation
   fd154ca..21e7ecf  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/Khushnur/t5-small-end2end-questions-generation
   fd154ca..21e7ecf  main -> main

To https://huggingface.co/Khushnur/t5-small-end2end-questions-generation
   21e7ecf..f967bba  main -> main

   21e7ecf..f967bba  main -> main



VBox(children=(Label(value='0.001 MB of 0.018 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.046238…

0,1
train/epoch,▁
train/global_step,▁
train/total_flos,▁
train/train_loss,▁
train/train_runtime,▁
train/train_samples_per_second,▁
train/train_steps_per_second,▁

0,1
train/epoch,0.97
train/global_step,18.0
train/total_flos,2494620084731904.0
train/train_loss,4.85679
train/train_runtime,628.5558
train/train_samples_per_second,30.284
train/train_steps_per_second,0.029


In [None]:
'''
import gc
torch.cuda.empty_cache()
gc.collect() '''

In [39]:
!pip install numba

from numba import cuda 
device = cuda.get_current_device()
device.reset()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [50]:
# Initialize wandb
wandb.init(project="t5-small-end-to-end-questions-generation")

# Evaluate the model
evaluation_results = trainer.evaluate(valid_dataset)
print(evaluation_results)
# Print F1 and EM results
print("F1 Score: {:.3f}".format(evaluation_results["eval_F1"]))
print("EM Score: {:.3f}".format(evaluation_results["eval_EM"]))

# Log the evaluation results to wandb
wandb.log(evaluation_results)

# Finish the wandb run
wandb.finish()

OutOfMemoryError: ignored

## Testing the model 📝
- You can now load the model from HuggingFace and test it.

In [None]:
from transformers import T5ForConditionalGeneration, T5TokenizerFast

hfmodel = T5ForConditionalGeneration.from_pretrained("Khushnur/t5-end2end-questions-generation")

In [None]:
def hf_run_model(input_string, **generator_args):
  generator_args = {
  "max_length": 256,
  "num_beams": 4,
  "length_penalty": 1.5,
  "no_repeat_ngram_size": 3,
  "early_stopping": True,
  }
  input_string = "generate questions: " + input_string + " </s>"
  input_ids = tokenizer.encode(input_string, return_tensors="pt")
  res = hfmodel.generate(input_ids, **generator_args)
  output = tokenizer.batch_decode(res, skip_special_tokens=True)
  output = [item.split("<sep>") for item in output]
  return output

In [None]:
text = "Forrest Gump is a 1994 American comedy-drama film directed by Robert Zemeckis and written by Eric Roth. \
It is based on the 1986 novel of the same name by Winston Groom and stars Tom Hanks, Robin Wright, Gary Sinise, \
Mykelti Williamson and Sally Field. The story depicts several decades in the life of Forrest Gump (Hanks), \
a slow-witted but kind-hearted man from Alabama who witnesses and unwittingly influences several defining \
historical events in the 20th century United States. The film differs substantially from the novel."

In [None]:
hf_run_model(text)

In [None]:
text= "The abolition of feudal privileges by the National Constituent Assembly on 4 August 1789 and the Declaration \
of the Rights of Man and of the Citizen (La Déclaration des Droits de l'Homme et du Citoyen), drafted by Lafayette \
with the help of Thomas Jefferson and adopted on 26 August, paved the way to a Constitutional Monarchy \
(4 September 1791 – 21 September 1792). Despite these dramatic changes, life at the court continued, while the situation \
in Paris was becoming critical because of bread shortages in September. On 5 October 1789, a crowd from Paris descended upon Versailles \
and forced the royal family to move to the Tuileries Palace in Paris, where they lived under a form of house arrest under \
the watch of Lafayette's Garde Nationale, while the Comte de Provence and his wife were allowed to reside in the \
Petit Luxembourg, where they remained until they went into exile on 20 June 1791."

In [None]:
hf_run_model(text)

## What's next?
- **This notebook is a work in progress** , the first next step is to add evaluation test using Rouge metrics, if you don't know about this metric, check this [article](https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460)
- As explained in [the paper](https://arxiv.org/pdf/2005.01107v1.pdf), most of the question are closed questions. This is explained because SQuAD contains 88.26% identification type questions in the training set => **you can improve the model by adding other datasets, by first trying SQuAD v2**
- What about making a webapp? Check [Spaces](https://huggingface.co/spaces)


## My TODO:
- Add Rouge eval test
- Wandb didn't recorded training loss but only evaluation loss.
- Add SQuAD v2
- Pushing the SQuAD version for question generation on HF Hub (instead of using this upload .py file system that's not scalable)
- Solve the issue with Accelerated Inference API => because of the tokenizer

✅ Improve the postprocessing of questions

✅ Make a Spaces web app?
