<a href="https://colab.research.google.com/github/KarthikAlagarsamy/AIML-Final-Project/blob/main/Karthik_AIML_M18.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Building and Deploying Question Answering System with Hugging Face**

In [None]:
!pip install transformers datasets            # accessing datasets used for NLP tasks
!pip install transformers datasets evaluate   # evaluating machine learning models and datasets
!pip install transformers[torch]              # deep learning tasks, including the training and execution of neural networks
!pip install accelerate -U                    # accelerate training on multiple GPUs
!pip install gradio                           # create web interfaces for machine learning models

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets

In [None]:
from huggingface_hub import notebook_login    # interact with the Hugging Face Model Hub for sharing pretrained models

notebook_login()                              # log in to Hugging Face account directly

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from datasets import load_dataset, load_metric          # load datasets and evaluation metrics from the Hugging Face hub

squad = load_dataset("squad", split="train[:5000]")     # load subset of the SQuAD dataset from the Hugging Face Datasets library

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [None]:
squad = squad.train_test_split(test_size=0.2)       # splits the dataset into training subsets and testing subsets

In [None]:
squad["train"][0]                                   # retrieves first data example from the training subset of SQuAD dataset

{'id': '5733ccbe4776f41900661270',
 'title': 'University_of_Notre_Dame',
 'context': 'In the film Knute Rockne, All American, Knute Rockne (played by Pat O\'Brien) delivers the famous "Win one for the Gipper" speech, at which point the background music swells with the "Notre Dame Victory March". George Gipp was played by Ronald Reagan, whose nickname "The Gipper" was derived from this role. This scene was parodied in the movie Airplane! with the same background music, only this time honoring George Zipp, one of Ted Striker\'s former comrades. The song also was prominent in the movie Rudy, with Sean Astin as Daniel "Rudy" Ruettiger, who harbored dreams of playing football at the University of Notre Dame despite significant obstacles.',
 'question': 'Ronald Reagan had a nickname, what was it?',
 'answers': {'text': ['The Gipper'], 'answer_start': [267]}}

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")    # load DistilBERT tokenizer to process the question and context fields

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
def preprocess_function(examples):                          # truncate and map the start and end tokens of the answer to the context
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=512,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [None]:
# apply the preprocessing function over the entire dataset and process multiple elements of the dataset at once
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()             # create a batch of examples

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")      # automatically select the appropriate model for question answering tasks

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="distilbertfinetuneHS5E8BHLR", # specifies the directory where model checkpoints and outputs will be saved
    evaluation_strategy="epoch",              # evaluation will be performed at the end of each epoch
    learning_rate=2e-5,                       # sets the learning rate for the optimizer
    per_device_train_batch_size=8,            # specifies the batch size for training data
    per_device_eval_batch_size=8,             # specifies the batch size for evaluation data
    num_train_epochs=5,                       # specifies the number of training epochs
    weight_decay=0.01,                        # specifies the weight decay parameter for regularization
    push_to_hub=True,                         # pushed to the Hugging Face Model Hub after training
)

trainer = Trainer(
    model=model,                              # model to be trained
    args=training_args,                       # training arguments defined earlier
    train_dataset=tokenized_squad["train"],   # training dataset for training
    eval_dataset=tokenized_squad["test"],     # evaluate the model performance
    tokenizer=tokenizer,                      # preprocess the data
    data_collator=data_collator,              # collate batches of data
)

trainer.train()                               #  model finetuning

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss
1,3.0251,1.72679
2,1.4512,1.414267
3,0.9326,1.43455
4,0.6653,1.580358
5,0.5143,1.640075


TrainOutput(global_step=2500, training_loss=1.3176983764648438, metrics={'train_runtime': 1109.9717, 'train_samples_per_second': 18.018, 'train_steps_per_second': 2.252, 'total_flos': 2613062000640000.0, 'train_loss': 1.3176983764648438, 'epoch': 5.0})

In [None]:
trainer.push_to_hub()                 # share model to Hugging Face Model Hub

events.out.tfevents.1713086646.d5287ba18518.238.0:   0%|          | 0.00/7.24k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/KarthikAlagarsamy/distilbertfinetuneHS5E8BHLR/commit/7cb1f9112b4ccf967383214acc8361446191bb0d', commit_message='End of training', commit_description='', oid='7cb1f9112b4ccf967383214acc8361446191bb0d', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
question = "How many programming languages does BLOOM support?"
context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."

In [None]:
from transformers import pipeline       # model inference

question_answerer = pipeline("question-answering", model="distilbertfinetuneHS5E8BHLR")
question_answerer(question=question, context=context)

{'score': 0.3527219891548157,
 'start': 58,
 'end': 95,
 'answer': '46 languages natural languages and 13'}

In [None]:
# Load the evaluation dataset
squad_eval = load_dataset("squad", split='validation[:1000]')

# Load the SQuAD metric
squad_metric = load_metric("squad")

# Initialize lists to store predictions and references in the required format
formatted_predictions = []
formatted_references = []

# Iterate through the evaluation dataset and make predictions
for example in squad_eval:
    example_id = example["id"]
    context = example["context"]
    question = example["question"]
    # Make a prediction using the question answering pipeline
    prediction = question_answerer(question=question, context=context)
    # Extract predicted answer
    predicted_answer = prediction["answer"]
    reference = example["answers"]["text"][0]
    # Append predictions and references in the required format
    formatted_predictions.append({"id": example_id, "prediction_text": predicted_answer, "context": context, "question": question})
    formatted_references.append({"id": example_id, "answers": {"text": [reference]}})

# Compute Exact Match (EM) and F1 score
evaluation_result = squad_metric.compute(predictions=formatted_predictions, references=formatted_references)

print("Exact Match (EM):", evaluation_result["exact_match"])
print("F1 Score:", evaluation_result["f1"])

  squad_metric = load_metric("squad")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

Exact Match (EM): 59.1
F1 Score: 67.27261303024456


In [None]:
import gradio as gr                         #  create customizable web interfaces using Gradio

model_checkpoint = "KarthikAlagarsamy/distilbertfinetuneHS5E8BHLR"
question_answerer = pipeline("question-answering", model=model_checkpoint)

def answer_question(question, context):
    answer = question_answerer(question=question, context=context)
    return answer['answer']

iface = gr.Interface(                       # creates Gradio interface object
    fn=answer_question,
    inputs=["text", "text"],
    outputs="text",
    title="Question Answering",
    description="Enter a question and context"
)

iface.launch()                               # launches Gradio interface

config.json:   0%|          | 0.00/561 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://64a1ba80931889caf7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


