<a href="https://colab.research.google.com/github/Siddhivar/Boston_House_Pricing/blob/main/AutomaticQuestionGenerator_T5Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [30]:
!pip install --quiet  datasets #to access squad dataset
!pip install --quiet pyarrow   #to deal with parquet files for saving dataset if required
!pip install --quiet  tqdm     #for progress bars
!pip install --quiet transformers # for t5 model
!pip install --quiet tokenizers  #tokenizers from HuggingFace
!pip install --quiet sentencepiece #subword tokenizer used by T5
!pip install --quiet pytorch-lightning # pytorch wr|apper
!pip install --quiet torchtext # text utilities

**Fetching Datasets**

In [31]:
#imports
import pandas as pd
import torch
from tqdm import tqdm
from datasets import load_dataset
from torch.utils.data import Dataset, DataLoader
from pprint import pprint
import copy

In [32]:
device  = 'cuda' if torch.cuda.is_available() else "cpu"

In [33]:
import pandas as pd

def create_pandas_dataset_from_csv(csv_file, answer_threshold=7, verbose=False):
    '''Create a Pandas DataFrame from CSV file.

    Params:
        csv_file: Path to the CSV file containing data.
        answer_threshold: Only consider those Question Answer pairs where the Answer is short.
    '''
    data = pd.read_csv(csv_file)
    count_long, count_short = 0, 0
    result_df = pd.DataFrame(columns=['context', 'answer_text', 'question'])
    for index, row in data.iterrows():
        passage = row['context']
        question = row['question']
        answer = row['answer_text']
        no_of_words = len(answer.split())
        if no_of_words >= answer_threshold:
            count_long += 1
            continue
        else:
            result_df.loc[count_short] = [passage, answer, question]
            count_short += 1
    if verbose:
        return result_df, count_long, count_short
    else:
        return result_df


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [34]:
import pandas as pd
train_dataset = pd.read_csv('/content/drive/MyDrive/train_dataset.csv')
test_dataset = pd.read_csv('/content/drive/MyDrive/test_dataset.csv')
print(f"Total Train Samples: {len(train_dataset)}, Total Test Samples: {len(test_dataset)}")

Total Train Samples: 86819, Total Test Samples: 20302


In [35]:
train_dataset

Unnamed: 0,answer_text,answer_start,question,context,subject
0,in the late 1990s,269,When did Beyonce start becoming popular?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,Beyoncé
1,singing and dancing,207,What areas did Beyonce compete in when she was...,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,Beyoncé
2,2003,526,When did Beyonce leave Destiny's Child and bec...,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,Beyoncé
3,"Houston, Texas",166,In what city and state did Beyonce grow up?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,Beyoncé
4,late 1990s,276,In which decade did Beyonce become famous?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,Beyoncé
...,...,...,...,...,...
86814,Oregon,229,In what US state did Kathmandu first establish...,"Kathmandu Metropolitan City (KMC), in order to...",Kathmandu
86815,Rangoon,414,What was Yangon previously known as?,"Kathmandu Metropolitan City (KMC), in order to...",Kathmandu
86816,Minsk,476,With what Belorussian city does Kathmandu have...,"Kathmandu Metropolitan City (KMC), in order to...",Kathmandu
86817,1975,199,In what year did Kathmandu create its initial ...,"Kathmandu Metropolitan City (KMC), in order to...",Kathmandu


In [36]:
sample_train_dataset = train_dataset.iloc[0]  # Selecting the first row as an example

# Print the sample training dataset
print(sample_train_dataset)

# Accessing individual fields
context = sample_train_dataset['context']
question = sample_train_dataset['question']
answer = sample_train_dataset['answer_text']

# Print the individual fields
print('---------------' * 9)
print('\nBreaking it Down\n')
print("context:", context)
print("question:", question)
print("answer:", answer)

answer_text                                     in the late 1990s
answer_start                                                  269
question                 When did Beyonce start becoming popular?
context         Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...
subject                                                   Beyoncé
Name: 0, dtype: object
---------------------------------------------------------------------------------------------------------------------------------------

Breaking it Down

context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of B

In [37]:
import pandas as pd

def create_pandas_dataset(data):
    '''Create a Pandas DataFrame from a given dataset.

    Params:
        data: DataFrame containing the dataset.
    '''
    return data

# Load training dataset from CSV
train_dataset = pd.read_csv('/content/drive/MyDrive/train_dataset.csv')

# Load validation dataset from CSV
valid_dataset = pd.read_csv('/content/drive/MyDrive/test_dataset.csv')

# Create pandas datasets
df_train = create_pandas_dataset(train_dataset)
df_validation = create_pandas_dataset(valid_dataset)

# Print the shape of the datasets
print(f"Total Train Samples: {df_train.shape}, Total Validation Samples: {df_validation.shape}")


Total Train Samples: (86819, 5), Total Validation Samples: (20302, 5)


In [38]:
# Saving training dataset as Parquet
df_train.to_parquet('train_squad.parquet')

# Saving validation dataset as Parquet
df_validation.to_parquet('validation_squad.parquet')

**Creating a Pytorch DataSet for T5 Training and Validation**

In [39]:
from transformers import (
    AdamW,
    T5ForConditionalGeneration,
    T5Tokenizer,
    get_linear_schedule_with_warmup
)

In [40]:
t5_tokenizer = T5Tokenizer.from_pretrained('t5-small',model_max_length=512)
t5_model = T5ForConditionalGeneration.from_pretrained('t5-small')

In [41]:
import pandas as pd

class QuestionGenerationDataset(Dataset):
    def __init__(self, tokenizer, filepath, max_len_inp=512, max_len_out=96):
        self.path = filepath

        self.passage_column = "context"
        self.answer = "answer_text"  # Change to match the column name in your dataset
        self.question = "question"

        self.data = pd.read_parquet(self.path).iloc[:5000,:]  # Read data from Parquet file

        self.max_len_input = max_len_inp
        self.max_len_output = max_len_out
        self.tokenizer = tokenizer
        self.inputs = []
        self.targets = []
        self._build()

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, index):
        source_ids = self.inputs[index]["input_ids"].squeeze()
        target_ids = self.targets[index]["input_ids"].squeeze()

        src_mask = self.inputs[index]["attention_mask"].squeeze()  # squeeze to get rid of the batch dimension
        target_mask = self.targets[index]["attention_mask"].squeeze()  # convert [batch,dim] to [dim]

        labels = target_ids.clone()  # make a copy of target_ids
        labels[labels == 0] = -100

        return {"source_ids": source_ids, "source_mask": src_mask, "target_ids": target_ids, "target_mask": target_mask, "labels": labels}

    def _build(self):
        for rownum, val in tqdm(self.data.iterrows()):  # Iterating over the dataframe
            passage, answer, target = val[self.passage_column], val[self.answer], val[self.question]

            input_ = f"context: {passage}  answer: {answer}"  # T5 Input format for question answering tasks
            target = f"question: {str(target)}"  # Output format we require

            # tokenize inputs
            tokenized_inputs = self.tokenizer.batch_encode_plus(
                [input_], max_length=self.max_len_input, padding='max_length',
                truncation=True, return_tensors="pt"
            )
            # tokenize targets
            tokenized_targets = self.tokenizer.batch_encode_plus(
                [target], max_length=self.max_len_output, padding='max_length',
                truncation=True, return_tensors="pt"
            )

            self.inputs.append(tokenized_inputs)
            self.targets.append(tokenized_targets)


In [42]:
train_path = '/content/train_squad.parquet' # change this accordingly
validation_path = '/content/validation_squad.parquet'
train_dataset = QuestionGenerationDataset(t5_tokenizer,train_path)
validation_dataset = QuestionGenerationDataset(t5_tokenizer,validation_path)

5000it [00:26, 187.72it/s]
5000it [00:18, 263.52it/s]


In [43]:
# Data Sample

train_sample = train_dataset[50] # thanks to __getitem__
decoded_train_input = t5_tokenizer.decode(train_sample['source_ids'])
decoded_train_output = t5_tokenizer.decode(train_sample['target_ids'])

print(decoded_train_input)
print(decoded_train_output)

context: Beyoncé Giselle Knowles was born in Houston, Texas, to Celestine Ann "Tina" Knowles (née Beyincé), a hairdresser and salon owner, and Mathew Knowles, a Xerox sales manager. Beyoncé's name is a tribute to her mother's maiden name. Beyoncé's younger sister Solange is also a singer and a former member of Destiny's Child. Mathew is African-American, while Tina is of Louisiana Creole descent (with African, Native American, French, Cajun, and distant Irish and Spanish ancestry). Through her mother, Beyoncé is a descendant of Acadian leader Joseph Broussard. She was raised in a Methodist household. answer: Joseph Broussard.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pa

**Fine Tuning T5**

In [44]:
import pytorch_lightning as pl
from torch.optim import AdamW
import argparse
from transformers import (
    get_linear_schedule_with_warmup
  )

class T5Tuner(pl.LightningModule):

    def __init__(self,t5model, t5tokenizer,batchsize=4):
        super().__init__()
        self.model = t5model
        self.tokenizer = t5tokenizer
        self.batch_size = batchsize

    def forward( self, input_ids, attention_mask=None,
                decoder_attention_mask=None,
                lm_labels=None):

         outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            decoder_attention_mask=decoder_attention_mask,
            labels=lm_labels,
        )

         return outputs

    def training_step(self, batch, batch_idx):
        outputs = self.forward(
            input_ids=batch["source_ids"],
            attention_mask=batch["source_mask"],
            decoder_attention_mask=batch['target_mask'],
            lm_labels=batch['labels']
        )

        loss = outputs[0]
        self.log('train_loss',loss)
        return loss

    def validation_step(self, batch, batch_idx):
        outputs = self.forward(
            input_ids=batch["source_ids"],
            attention_mask=batch["source_mask"],
            decoder_attention_mask=batch['target_mask'],
            lm_labels=batch['labels']
        )

        loss = outputs[0]
        self.log("val_loss",loss)
        return loss

    def train_dataloader(self):
        return DataLoader(train_dataset, batch_size=self.batch_size,
                          num_workers=2)

    def val_dataloader(self):
        return DataLoader(validation_dataset,
                          batch_size=self.batch_size,
                          num_workers=2)

    def configure_optimizers(self):
        optimizer = AdamW(self.parameters(), lr=3e-4, eps=1e-8)
        return optimizer

In [45]:
model = T5Tuner(t5_model,t5_tokenizer)

trainer = pl.Trainer(max_epochs = 3,accelerator=device)

trainer.fit(model)


INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                       | Params | Mode
------------------------------------------------------------
0 | model | T5ForConditionalGeneration | 60.5 M | eval
------------------------------------------------------------
60.5 M    Trainable params
0         Non-trainable params
60.5 M    Total params
242.026   Total estimated model params size (MB)
0         Modules in train mode
277       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


In [47]:
# saving the model
!mkdir "t5_tokenizer"
!mkdir "t5_trained_model"
model.model.save_pretrained('t5_trained_model')
t5_tokenizer.save_pretrained('t5_tokenizer')

mkdir: cannot create directory ‘t5_tokenizer’: File exists
mkdir: cannot create directory ‘t5_trained_model’: File exists


('t5_tokenizer/tokenizer_config.json',
 't5_tokenizer/special_tokens_map.json',
 't5_tokenizer/spiece.model',
 't5_tokenizer/added_tokens.json')

**Inference / Predictions**

In [48]:
trained_model_path = 't5_trained_model'
trained_tokenizer = 't5_tokenizer'
device = 'cpu'

In [49]:
import torch
import pickle

# Save model parameters
model_state = model.state_dict()
with open("model_state.pkl", "wb") as f:
    pickle.dump(model_state, f)

# Save tokenizer (optional)
with open("tokenizer.pkl", "wb") as f:
    pickle.dump(tokenizer, f)


In [50]:
model = T5ForConditionalGeneration.from_pretrained(trained_model_path)
tokenizer = T5Tokenizer.from_pretrained(trained_tokenizer)

Text Sample

In [51]:
context ="President Donald Trump said and predicted that some states would reopen this month."
answer = "Donald Trump"
text = "context: "+context + " " + "answer: " + answer
print(text)

context: President Donald Trump said and predicted that some states would reopen this month. answer: Donald Trump


In [52]:
context ="Since its topping out in 2013, One World Trade Center in New York City has been the tallest skyscraper in the United States."
answer = "World Trade Center"
text = "context: "+context + " " + "answer: " + answer
print(text)

context: Since its topping out in 2013, One World Trade Center in New York City has been the tallest skyscraper in the United States. answer: World Trade Center


In [53]:
encoding = tokenizer.encode_plus(text,max_length =512,padding='max_length',
                                 truncation = True,
                                 return_tensors="pt").to(device)
print (encoding.keys())
input_ids,attention_mask  = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)

dict_keys(['input_ids', 'attention_mask'])


In [54]:
model.eval()
beam_outputs = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_length=72, # How long the generated questions should be
    early_stopping=True,
    num_beams=5,
    num_return_sequences=2
)

for beam_output in beam_outputs:
    sent = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    print(sent)

question: What is the tallest skyscraper in the US?
question: What is the tallest skyscraper in New York City?


**Deployment Demo**

In [55]:
!pip install --quiet gradio==3.9

In [56]:
!pip install httpx==0.23.0
!pip install httpcore==0.15.0



In [57]:
def get_question(sentence,answer,mdl,tknizer):

  text = "context: {} answer: {}".format(sentence,answer)
  print (text)
  max_len = 256
  encoding = tknizer.encode_plus(text,max_length=max_len, pad_to_max_length=False,truncation=True, return_tensors="pt")

  input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

  outs = mdl.generate(input_ids=input_ids,
                                  attention_mask=attention_mask,
                                  early_stopping=True,
                                  num_beams=5,
                                  num_return_sequences=1,
                                  no_repeat_ngram_size=2,
                                  max_length=72)


  dec = [tknizer.decode(ids,skip_special_tokens=True) for ids in outs]


  Question = dec[0].replace("question:","")
  Question= Question.strip()
  return Question

In [58]:
context = "I am Siddhi Varshney. I am doing my Bachelor in Technology in Artificial Intelligence from Aligarh Muslim University."
answer = "Siddhi Varshney"

ques = get_question(context,answer,model,tokenizer)
print ("question: ",ques)

context: I am Siddhi Varshney. I am doing my Bachelor in Technology in Artificial Intelligence from Aligarh Muslim University. answer: Siddhi Varshney
question:  Who is I?


In [59]:
context = "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries."
answer = "France"

ques = get_question(context,answer,model,tokenizer)
print ("question: ",ques)

context: The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (Norman comes from Norseman) raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries. answer: France
question:  In what region did the Normandy region originate from?


In [60]:
import gradio as gr

context = gr.inputs.Textbox(lines=5,placeholder="Enter paragraph/context here...")
answer = gr.inputs.Textbox(lines=3, placeholder="Enter answer/keyword here...")
question = gr.outputs.Textbox( type="auto", label="Question")

def generate_question(context,answer):
  return get_question(context,answer,model,tokenizer)

iface = gr.Interface(
  fn=generate_question,
  inputs=[context,answer],
  outputs=question)

iface.launch(debug=False,share=True)



IMPORTANT: You are using gradio version 3.9, however version 4.44.1 is available, please upgrade.
--------
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()

Could not create share link, please check your internet connection.


<IPython.core.display.Javascript object>

(<gradio.routes.App at 0x7cc79a31af80>, 'http://127.0.0.1:7861/', None)