<a href="https://colab.research.google.com/github/NnekaAsuzu/fine-tuning-on-any-question-answering-dataset-from-HuggingFace/blob/main/Finetuning_on_HG_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Project**: Perform fine-tuning on any question-answering dataset from HuggingFace and save the model. Use the saved model to build a Gradio interface. The interface should display a context window as the first input, a text field for the user's question as the second input, and the model's response as the output. This interface will allow users to input a context and a question to receive the model's answer.



#####Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [47]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
!apt install git-lfs
!pip install datasets
!pip install transformers datasets torch gradio

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 38 not upgraded.


##setup git, adapt your email and name in the following cell

In [2]:
!git config --global user.email "nnasuzu@gmail.com"
!git config --global user.name "NnekaAsuzu"

##log in to the Hugging Face Hub

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [48]:
#Import  libraries
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments,DataCollatorWithPadding
import torch
from transformers import default_data_collator
from torch.utils.data import DataLoader
import gradio as gr

In [49]:
# Download Load the dataset Stanford Question Answering Dataset (SQuAD)

from datasets import load_dataset

dataset = load_dataset("squad")


In [30]:
dataset #viewdatasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [46]:
#To see the first element of the training set
print("Context: ", dataset["train"][0]["context"])
print("Question: ", dataset["train"][0]["question"])
print("Answer: ",dataset["train"][0]["answers"])

Context:  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer:  {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}


In [31]:
dataset['train'][0] #to look at the data

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

In [93]:
 #QAModel used:bert-large-uncased-whole-word-masking-finetuned-squad
# Define your model and tokenizer
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load a pretrained model
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [77]:
#load checkpoint
checkpoint ='bert-large-uncased-whole-word-masking-finetuned-squad'
tokenizer = AutoTokenizer.from_pretrained(checkpoint) #load tokenizer from the checkpoint

In [78]:
dataset['train']['context'][0] #to view the context

'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

In [79]:
# Tokenize the dataset
def tokenize_function(example):
    return tokenizer(example["question"], example["context"], truncation=True)
    #map is used to apply tokenizer function in the whole dataset
#batch is used to apply function in batches and iterate over the dataset much faster

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [80]:
# compare the orginal to the tockenized dataset ( more inputs are seen)
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [81]:
#To check the tokenized data
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 10570
    })
})

Training with a Custom Trainer Method: In this method, the custom trainer class is used to manage the training process.

In [82]:
# Define a custom training step
def compute_loss(model, inputs):
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    start_positions = inputs.get("start_positions")
    end_positions = inputs.get("end_positions")
    loss = None
    if start_positions is not None and end_positions is not None:
        loss_fct = torch.nn.CrossEntropyLoss()
        start_loss = loss_fct(start_logits, start_positions)
        end_loss = loss_fct(end_logits, end_positions)
        loss = (start_loss + end_loss) / 2
    return loss if loss is not None else torch.tensor(0.0, device=model.device, requires_grad=True)


In [83]:
 #Define the trainer with the overridden compute_loss method
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs):
        return compute_loss(model, inputs)


In [84]:
# Define training arguments with fewer epochs and a smaller batch size
training_args = TrainingArguments(
    "output_dir",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,  # Decreased batch size
    per_device_eval_batch_size=2,   # Decreased batch size
    num_train_epochs=1,             # Reduced number of epochs
    weight_decay=0.01,
)


In [85]:
# Sample a smaller subset of the training dataset
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))

In [56]:
# Create an instance of the custom trainer with the updated training arguments
#Using the sample of smaller subset(500)
trainer = CustomTrainer(
    model=pretrained_model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=tokenized_datasets["validation"],
    data_collator=DataCollatorWithPadding(tokenizer),
)
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss


Checkpoint destination directory output_dir/checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


Epoch,Training Loss,Validation Loss
1,0.0,No log


TrainOutput(global_step=500, training_loss=0.0, metrics={'train_runtime': 6990.2571, 'train_samples_per_second': 0.143, 'train_steps_per_second': 0.072, 'total_flos': 104149212749328.0, 'train_loss': 0.0, 'epoch': 1.0})

In [86]:
# Save the fine-tuned model
model.save_pretrained("fine_tuned_squad_model")

In [87]:
# Load the saved model
model = AutoModelForQuestionAnswering.from_pretrained("fine_tuned_squad_model")

In [91]:
# Define a function to get the model's response
def get_response(context, question):
    inputs = tokenizer(question, context, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits

    all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer = tokenizer.decode(input_ids[torch.argmax(start_logits) : torch.argmax(end_logits) + 1])

    return answer

In [92]:
# Example usages
examples = [
    ("Artificial intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems.", "What is artificial intelligence?"),
    ("Machine learning is a subset of artificial intelligence that focuses on the development of computer programs that can access data and use it to learn for themselves.", "What is machine learning?"),
    ("Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language.", "What is natural language processing?"),
    ("Deep learning is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data.", "What is deep learning?")
]

for context, question in examples:
    response = get_response(context, question)
    print("Context:", context)
    print("Question:", question)
    print("Model's Response:", response)
    print()


Context: Artificial intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems.
Question: What is artificial intelligence?
Model's Response: the simulation of human intelligence processes by machines

Context: Machine learning is a subset of artificial intelligence that focuses on the development of computer programs that can access data and use it to learn for themselves.
Question: What is machine learning?
Model's Response: a subset of artificial intelligence

Context: Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language.
Question: What is natural language processing?
Model's Response: a subfield of linguistics, computer science, and artificial intelligence

Context: Deep learning is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts o

In [90]:
# Build the Gradio interface
def qa_app(context, question):
    answer = get_response(context, question)
    return answer

iface = gr.Interface(fn=qa_app, inputs=["text", "text"], outputs="text")
iface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://db271e94c1b7fa3af0.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


