Installs the required libraries: transformers for model and tokenizer, datasets for loading the SQuAD dataset, and torch for PyTorch.

In [1]:
!pip install transformers datasets torch

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests (from transformers)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda

Import necessary modules from the transformers and datasets libraries

In [2]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from datasets import load_dataset
import torch

Load the SQuAD datase :

SQuAD (Stanford Question Answering Dataset) is a large-scale dataset for training and evaluating QA systems

In [3]:
dataset = load_dataset("squad")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Define the pre-trained model name


We are using a smaller, distilled version of BERT (DistilBERT) fine-tuned on the SQuAD dataset

In [4]:
model_name = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Define a function that takes a context and a question, tokenizes them, performs inference with the model, and decodes the predicted answer.

  Which includes :

    Tokenization: To Convert the input text (question and context) into tokens that the model can process.

    Inference: The model processes the tokenized input to generate logits, which indicate the probable start and end positions of the answer in the context.

    Identifying Answer Span: Determines the most probable start and end positions of the answer.

    Decoding: Converts the token indices back into human-readable text.


In [5]:
def QnA(context, question):
    inputs = tokenizer(question, context, return_tensors="pt")
    # torch.no_grad Disables gradient calculation for efficiency since we are in inference mode.
    with torch.no_grad():
        #Feed the tokenized input into the model to get the output logits.
        outputs = model(**inputs)

    #To find the index with the highest score for the start of the answer.
    answerstart = outputs.start_logits.argmax()

    #To find the index with the highest score for the end of the answer and adjusts for inclusive slicing.
    answerend = outputs.end_logits.argmax() + 1

    answer = tokenizer.decode(inputs["input_ids"][0][answerstart:answerend])
    return answer

Random sample selection

    To dynamically demonstrate the QA system's ability to handle various contexts and questions, showing its versatility and adaptability in real-time.


Provide user to select a Question and Receives an Answer
    To engage the user in selecting a question, making the demo interactive and hands-on.
    To demonstrate how the QA system processes and answers questions based on user input.

In [6]:
import random
# random example from the dataset
example = random.choice(dataset["train"])
print('Example-',example)
# extract the context from the selected example.
context = example["context"]

# Get all questions for this context which are related
questions = [question for question in dataset["train"] if question["context"] == context]

# print the context
print("Context:")
print(context)
print("\nAvailable questions from the context:")

# Print available questions
for i, question in enumerate(questions, 1):
    print(f"{i}. {question['question']}")


Example- {'id': '572839b44b864d190016479a', 'title': 'God', 'context': 'The earliest written form of the Germanic word God (always, in this usage, capitalized) comes from the 6th-century Christian Codex Argenteus. The English word itself is derived from the Proto-Germanic * ǥuđan. The reconstructed Proto-Indo-European form * ǵhu-tó-m was likely based on the root * ǵhau(ə)-, which meant either "to call" or "to invoke". The Germanic words for God were originally neuter—applying to both genders—but during the process of the Christianization of the Germanic peoples from their indigenous Germanic paganism, the words became a masculine syntactic form.', 'question': 'Where is the English word God derived from?', 'answers': {'text': ['the Proto-Germanic * ǥuđan'], 'answer_start': [182]}}
Context:
The earliest written form of the Germanic word God (always, in this usage, capitalized) comes from the 6th-century Christian Codex Argenteus. The English word itself is derived from the Proto-Germanic

In [8]:

# Let the user select a question
while True:
    try:
        choice = int(input("Choose a question number: "))
        if 1 <= choice <= len(questions):
            selected_question = questions[choice - 1]["question"]
            break
        else:
            print("Invalid choice. Please select a valid number.")
    except ValueError:
        print("Please enter a valid number.")

# Answer the selected question
answer = QnA(context, selected_question)

print(f"\nSelected Question: {selected_question}")
print(f"Answer: {answer}")

Choose a question number: 8

Selected Question: What gender where the original Germanic words meaning God in?
Answer: neuter
