# Fine-tuning the 🤗 t5 model on a end-to-end question generation (answer agnostic)
In this notebook, we're going to learn to fine-tune the 🤗 t5 model to **generate questions without providing answers** and use [Weight and Biases](https://wandb.ai/site) for measurements and logs.

### Dataset 🛢️
As dataset we use [SQuAD v1.1](https://rajpurkar.github.io/SQuAD-explorer/): *Stanford Question Answering Dataset is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.*

### Example 🤖
The process:
- You provide the context (the text you want to generate questions from).
- The model generates multiple questions simultaneously.

`Context: 
"Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace."`

`Questions:`

- `Who created Python?`,
- `When was Python first released?`
- `What is Python's design philosophy?`

### Requirements
- This is **not an introduction** to Hugging Face Transformer library, it's a **hands-on on how to fine tune t5** for this specific task. 
- If you're not familiar with Hugging Face, **you can watch the HF Course on Transformer models** (it's free) [here](https://huggingface.co/course/chapter1)
- 🏗️ This notebook is a work in progress, some elements (check todo at the end) will change.

### Sources 📚
- [Transformer-based End-to-End Question Generation's Paper](https://arxiv.org/pdf/2005.01107v1.pdf)
- [Patil Suraj's work on question generation](https://github.com/patil-suraj/question_generation/tree/bffa0a51e3ecba3922cafd13f424521135677303)

## Download and install the packages 📦

In [1]:
!pip install transformers
!pip install datasets
!pip install sentencepiece

!pip install tqdm

!pip install wandb

!sudo apt-get install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m62.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m99.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1
Looking in indexes: https://pypi.org/simple

In [2]:
import torch

from datasets import load_dataset, load_metric, list_metrics
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollator, T5ForConditionalGeneration, T5TokenizerFast

from tqdm import tqdm

from typing import Dict, List, Optional

import dataclasses
from dataclasses import dataclass, field

import logging
import os
import sys

import numpy as np
import torch

from huggingface_hub import notebook_login

from transformers import (
    T5ForConditionalGeneration, 
    T5Tokenizer, 
    EvalPrediction,
    DataCollator,
    Trainer,
    TrainingArguments)

from google.colab import files

- Connect to Weight and Biases:

In [22]:
import wandb
wandb.login()

%env WANDB_PROJECT=t5-end-to-end-questions-generation

[34m[1mwandb[0m: Currently logged in as: [33mnateethon04[0m ([33mman_01[0m). Use [1m`wandb login --relogin`[0m to force relogin


env: WANDB_PROJECT=t5-end-to-end-questions-generation


## Connect to Hugging Face 🤗
- To be able to share the model in the Hub, we need to **store our authentification token from the Hugging Face website**.


In [4]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

- Then install Git-lfs and add your mail and username to the config

In [5]:
!git config --global user.email "youremail@gmail.com"
!git config --global user.name "userName"

## Loading the dataset 📚
- We use [SQuAD v1.1](https://rajpurkar.github.io/SQuAD-explorer/), but a **modified version** where questions for a context are **concatenated**.
- You need to [download the file here](https://www.simoninithomas.com/hfdataset/squad_modified_for_t5_qg.zip), unzip it and upload it in the next cell.

In [6]:
files.upload()

Saving squad_modified_for_t5_qg.py to squad_modified_for_t5_qg.py


{'squad_modified_for_t5_qg.py': b'# coding=utf-8\r\n# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.\r\n#\r\n# Licensed under the Apache License, Version 2.0 (the "License");\r\n# you may not use this file except in compliance with the License.\r\n# You may obtain a copy of the License at\r\n#\r\n#     http://www.apache.org/licenses/LICENSE-2.0\r\n#\r\n# Unless required by applicable law or agreed to in writing, software\r\n# distributed under the License is distributed on an "AS IS" BASIS,\r\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\r\n# See the License for the specific language governing permissions and\r\n# limitations under the License.\r\n\r\n# Lint as: python3\r\n"""SQUAD: The Stanford Question Answering Dataset."""\r\n"""Modified version for fine tuning T5 on Question Generation """\r\n\r\nimport json\r\n\r\nimport datasets\r\n#from datasets.tasks import QuestionAnsweringExtractive\r\n\r\nlogger = datasets.l

In [7]:
raw_dataset = load_dataset("squad_modified_for_t5_qg.py")

Downloading and preparing dataset squad_modified_for_t5_qg/plain_text to /root/.cache/huggingface/datasets/squad_modified_for_t5_qg/plain_text/1.0.0/02ae0815e8483cc76579286179faeb8c8fdbdd328e6741f5c465d9b0bddb8a77...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset squad_modified_for_t5_qg downloaded and prepared to /root/.cache/huggingface/datasets/squad_modified_for_t5_qg/plain_text/1.0.0/02ae0815e8483cc76579286179faeb8c8fdbdd328e6741f5c465d9b0bddb8a77. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

- Let see one example of the dataset:

In [8]:
raw_dataset["train"][0]

{'context': 'generate questions: Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'questions': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? {sep_token} What is in front of the Notre Dame Main Building? {sep_token} The Basilica of the Sacred heart at Notre Dame is beside to which structure? {sep_token} What is the Grotto

## Preprocessing the data 🔧
- We first load the model: `"t5-base"` and the `T5TokenizerFast` tokenizer


In [9]:
checkpoint = "t5-base"
model = T5ForConditionalGeneration.from_pretrained(checkpoint)
tokenizer = T5TokenizerFast.from_pretrained(checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


- Because we separate each of our questions with `<sep>` token, we need to add it to the tokenizer tokens.

In [10]:
tokenizer.sep_token = '<sep>'

In [11]:
tokenizer.add_tokens(['<sep>'])
model.resize_token_embeddings(len(tokenizer))

Embedding(32101, 768)

In [12]:
# Check the sep_token_id to verify that it was added to the tokenizer
tokenizer.sep_token_id

32100

- Now, we need to preprocess the data in 3 steps:
1. `add_eos_examples`: Add `</s>` (end of string) at the end of each context and each questions combination.
2. `add_special_tokens`: Replace `{sep_token}` to `<sep>` token between each question.
3. `convert_to_features`: Tokenize the examples with 

In [13]:
max_input_length =  512
max_target_length = 64

In [14]:
# tokenize the examples
def convert_to_features(example_batch):

    input_encodings = tokenizer.batch_encode_plus(example_batch['context'], 
                                                  max_length=max_input_length, 
                                                  add_special_tokens=True,
                                                  truncation=True, 
                                                  pad_to_max_length=True)
    
    target_encodings = tokenizer.batch_encode_plus(example_batch['questions'], 
                                                   max_length=max_target_length, 
                                                   add_special_tokens=True,
                                                   truncation=True, pad_to_max_length=True)
                                                   
    encodings = {
        'input_ids': input_encodings['input_ids'], 
        'attention_mask': input_encodings['attention_mask'],
        'decoder_input_ids': target_encodings['input_ids']
        ,'decoder_attention_mask': target_encodings['attention_mask']
    }

    return encodings

def add_eos_examples(example):
  example['context'] = example['context'] + " </s>"
  example['questions'] = example['questions'] + " </s>"
  return example


def add_special_tokens(example):
  example['questions'] = example['questions'].replace("{sep_token}", '<sep>')
  return example

In [15]:
tokenized_dataset  = raw_dataset.map(add_eos_examples)
tokenized_dataset = tokenized_dataset.map(add_special_tokens)
tokenized_dataset  = tokenized_dataset.map(convert_to_features,  batched=True)

Map:   0%|          | 0/18896 [00:00<?, ? examples/s]

Map:   0%|          | 0/2067 [00:00<?, ? examples/s]

Map:   0%|          | 0/18896 [00:00<?, ? examples/s]

Map:   0%|          | 0/2067 [00:00<?, ? examples/s]

Map:   0%|          | 0/18896 [00:00<?, ? examples/s]



Map:   0%|          | 0/2067 [00:00<?, ? examples/s]

In [16]:
tokenized_dataset["train"][0]["context"]

'generate questions: Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary. </s>'

- Finally, we remove the useless columns `context` and `questions` and we split the tokenized_dataset between train and validation dataset.

In [17]:
tokenized_dataset = tokenized_dataset.remove_columns(
    ["context", "questions"]
)

train_dataset = tokenized_dataset["train"]
valid_dataset = tokenized_dataset["validation"]

columns = ['input_ids', 'decoder_input_ids', 'attention_mask', 'decoder_attention_mask']
train_dataset.set_format(type='torch', columns=columns)
valid_dataset.set_format(type='torch', columns=columns)

In [18]:
torch.save(train_dataset, 'train_data.pt')
torch.save(valid_dataset, 'valid_data.pt')

## Fine-Tuning the t5 model 🧮
- We built a custom DataCollator. A DataCollator **will form a batch using a list of dataset elements as input.** 

In [19]:
# This dataclass implementation is taken from Suraj Patil: https://github.com/patil-suraj/question_generation
@dataclass
class T2TDataCollator():
  def __call__(self, batch: List) -> Dict[str, torch.Tensor]:
    """
    Take a list of samples from a Dataset and collate them into a batch.
    Returns:
    A dictionary of tensors
    """
    
    input_ids = torch.stack([example['input_ids'] for example in batch])
    lm_labels = torch.stack([example['decoder_input_ids'] for example in batch])
    lm_labels[lm_labels[:, :] == 0] = -100 
    attention_mask = torch.stack([example['attention_mask'] for example in batch])
    decoder_attention_mask = torch.stack([example['decoder_attention_mask'] for example in batch])
    
    return {
        'input_ids': input_ids, 
        'attention_mask': attention_mask,
        'labels': lm_labels, 
        'decoder_attention_mask': decoder_attention_mask
    }

- We define the `TrainingArguments` object that contains every hyperparameters (learning_rate, nb of epochs...)

In [23]:
training_args = TrainingArguments(output_dir="./gdrive/My Drive/models", 
                                  per_device_train_batch_size=4, 
                                  per_device_eval_batch_size=4,
                                  gradient_accumulation_steps=16,
                                  learning_rate=1e-4, 
                                  num_train_epochs=7,
                                  logging_steps=100,
                                  run_name="end2end-questions-generation",
                                  evaluation_strategy="steps",
                                  save_steps=500,
                                  report_to="wandb",
                                  push_to_hub=True,
                                  push_to_hub_model_id="t5-end2end-questions-generation")



In [24]:
logger = logging.getLogger(__name__)

# Initialize our Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=T2TDataCollator()
)

# Training
trainer.train()

# When training is done, we push the fine-tuned model to the Hub
trainer.push_to_hub("t5-end2end-questions-generation")

wandb.finish()

Cloning https://huggingface.co/nateethon/t5-end2end-questions-generation into local empty directory.


Step,Training Loss,Validation Loss
100,2.5856,1.910648
200,1.9634,1.722611
300,1.8421,1.663294
400,1.742,1.634251
500,1.7129,1.612926
600,1.69,1.61103
700,1.6315,1.596406
800,1.6273,1.590179
900,1.6109,1.589966
1000,1.571,1.584867


Upload file pytorch_model.bin:   0%|          | 1.00/850M [00:00<?, ?B/s]

To https://huggingface.co/nateethon/t5-end2end-questions-generation
   70392b4..6009795  main -> main

   70392b4..6009795  main -> main

To https://huggingface.co/nateethon/t5-end2end-questions-generation
   6009795..3a53b96  main -> main

   6009795..3a53b96  main -> main



VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/loss,█▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁
eval/runtime,▂▂▁▂▂▂▁▂▁▄▂▂▂▂▁▇▁▃▂█
eval/samples_per_second,▇▇█▇▇▇█▇█▅▇▇▇▇█▂█▆▇▁
eval/steps_per_second,▇▇█▇▇▇█▇█▅▇▇▇▇█▂█▆▇▁
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇██
train/global_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇██
train/learning_rate,██▇▇▇▆▆▅▅▅▄▄▄▃▃▂▂▂▁▁
train/loss,█▄▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.56704
eval/runtime,32.9865
eval/samples_per_second,62.662
eval/steps_per_second,15.673
train/epoch,6.99
train/global_step,2065.0
train/learning_rate,0.0
train/loss,1.479
train/total_flos,8.04798748164096e+16
train/train_loss,1.64998


## Testing the model 📝
- You can now load the model from HuggingFace and test it.

In [38]:
from transformers import T5ForConditionalGeneration, T5TokenizerFast

hfmodel = T5ForConditionalGeneration.from_pretrained("nateethon/t5-end2end-questions-generation")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

In [39]:
def hf_run_model(input_string, **generator_args):
  generator_args = {
  "max_length": 256,
  "num_beams": 4,
  "length_penalty": 1.5,
  "no_repeat_ngram_size": 3,
  "early_stopping": True,
  }
  input_string = "generate questions: " + input_string + " </s>"
  input_ids = tokenizer.encode(input_string, return_tensors="pt")
  res = hfmodel.generate(input_ids, **generator_args)
  output = tokenizer.batch_decode(res, skip_special_tokens=True)
  output = [item.split("<sep>") for item in output]
  return output

In [40]:
text = "Forrest Gump is a 1994 American comedy-drama film directed by Robert Zemeckis and written by Eric Roth. \
It is based on the 1986 novel of the same name by Winston Groom and stars Tom Hanks, Robin Wright, Gary Sinise, \
Mykelti Williamson and Sally Field. The story depicts several decades in the life of Forrest Gump (Hanks), \
a slow-witted but kind-hearted man from Alabama who witnesses and unwittingly influences several defining \
historical events in the 20th century United States. The film differs substantially from the novel."

In [41]:
hf_run_model(text)

[['Who directed the 1994 film Forrest Gump?',
  ' Who wrote the book of the same name for the film?',
  ' What is the name of the film based on the novel by Winston Groom?',
  '']]

In [42]:
text= "The abolition of feudal privileges by the National Constituent Assembly on 4 August 1789 and the Declaration \
of the Rights of Man and of the Citizen (La Déclaration des Droits de l'Homme et du Citoyen), drafted by Lafayette \
with the help of Thomas Jefferson and adopted on 26 August, paved the way to a Constitutional Monarchy \
(4 September 1791 – 21 September 1792). Despite these dramatic changes, life at the court continued, while the situation \
in Paris was becoming critical because of bread shortages in September. On 5 October 1789, a crowd from Paris descended upon Versailles \
and forced the royal family to move to the Tuileries Palace in Paris, where they lived under a form of house arrest under \
the watch of Lafayette's Garde Nationale, while the Comte de Provence and his wife were allowed to reside in the \
Petit Luxembourg, where they remained until they went into exile on 20 June 1791."

In [43]:
hf_run_model(text)

[['When did the National Constituent Assembly abolish feudal privileges?',
  ' Who drafted the Declaration of the Rights of Man and of the Citizen?',
  ' When was the Constitutional Monarchy established?',
  ' What happened to the royal family on 5 October 1789?',
  ' Where did the Comte de']]

In [44]:
text1=''' The signature dish of the state in northeast Mexico is carne asada, meaning “grilled meat.” The Spanish term, however, signifies more than a meal; it’s a beloved social ritual.

The meat-heavy cuisine of Nuevo León reminds actor, producer and TV host Eva Longoria of the kinds of foods she ate during her childhood in Texas, which was once a part of the Spanish Empire and then Mexico.

“I’m Mexican American. We’ve been in Texas for 13 generations,” Longoria said in an episode of the CNN Original Series “Eva Longoria: Searching for Mexico.” “We never crossed the border; the border crossed us. And I think that’s why I have so much in common with Nuevo León and the North. It’s so similar to how I grew up.”

While shooting in Monterrey, the state capital, Longoria joined Alejandro Gutiérrez, founder of the Sociedad Mexicana de Parrilleros, or Mexican Society of Grill Masters, for a feast of carne asada.

Gutiérrez’s tip for extra-juicy aguja norteña steaks, which are similar to chuck eye steaks, is grilling the fillets at a searingly hot temperature and flipping them frequently.'''

In [45]:
hf_run_model(text1)

[['What is the signature dish of Nuevo León?',
  ' What does carne asada mean in Spanish?',
  ' How long has Eva Longoria lived in Texas?',
  ' Who is the founder of Sociedad Mexicana de Parrilleros?',
  '']]

In [46]:
text2="Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum \
and first released in 1991, Python's design philosophy emphasizes code \
readability with its notable use of significant whitespace."

In [47]:
hf_run_model(text2)

[['Who created Python?',
  ' When was Python first released?',
  " What is Python's design philosophy?",
  '']]

## What's next?
- **This notebook is a work in progress** , the first next step is to add evaluation test using Rouge metrics, if you don't know about this metric, check this [article](https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460)
- As explained in [the paper](https://arxiv.org/pdf/2005.01107v1.pdf), most of the question are closed questions. This is explained because SQuAD contains 88.26% identification type questions in the training set => **you can improve the model by adding other datasets, by first trying SQuAD v2**
- What about making a webapp? Check [Spaces](https://huggingface.co/spaces)


## My TODO:
- Add Rouge eval test
- Wandb didn't recorded training loss but only evaluation loss.
- Add SQuAD v2
- Pushing the SQuAD version for question generation on HF Hub (instead of using this upload .py file system that's not scalable)
- Solve the issue with Accelerated Inference API => because of the tokenizer

✅ Improve the postprocessing of questions

✅ Make a Spaces web app?
