In [None]:
!pip install transformers -q
!pip install torch -q

This cell imports the specific modules and classes we'll need for our script.

from transformers import pipeline, BertForQuestionAnswering, BertTokenizer:

pipeline: A high-level, easy-to-use API from the Hugging Face library that simplifies the process of using models for specific tasks (like question-answering).

BertForQuestionAnswering: The specific BERT model architecture designed for question-answering tasks.

BertTokenizer: The tokenizer that corresponds to the BERT model, responsible for converting text into a format the model can understand (tokens).

import textwrap: A standard Python library used here to format our long context paragraph for cleaner printing.

import time: A standard Python library used to measure the inference time (how long it takes for the model to generate an answer).



---



**Reflections and Observations**

By importing these specific components, we are setting up the building blocks for our Q&A system. While the pipeline function is powerful enough to handle model and tokenizer loading on its own, explicitly importing BertForQuestionAnswering and BertTokenizer helps in understanding the underlying components that make the pipeline work. Measuring performance is crucial, so importing the time library from the start was a key step in planning our model evaluation.

In [None]:
# Import necessary libraries
from transformers import pipeline, BertForQuestionAnswering, BertTokenizer
import textwrap
import time

Code Explanation

This cell downloads and initializes our first question-answering model.

We define model_name as 'deepset/bert-base-cased-squad2'. This is an identifier for a specific model hosted on the Hugging Face Hub. This particular model is a BERT base model that has been fine-tuned on the SQuAD 2.0 dataset, which is a benchmark for question-answering.

BertForQuestionAnswering.from_pretrained(model_name) and BertTokenizer.from_pretrained(model_name) download the necessary model weights and tokenizer files.

pipeline('question-answering', ...) creates a convenient object (qna_pipeline) that takes care of all the preprocessing, model inference, and post-processing steps.


---



Output Explanation

The output shows the progress bars for downloading the model's configuration (config.json), the model weights (model.safetensors), and the tokenizer files (tokenizer_config.json, vocab.txt, etc.). The final line confirms that the model was loaded successfully. The warning about "Some weights of the model checkpoint... were not used" is expected and simply means that parts of the pre-trained model not needed for question-answering were discarded.


---



**Reflections and Observations**

This model serves as our baseline. It's a standard and widely-used model, making it a great starting point for comparison. The "deepset" prefix indicates it's a version provided by the company Deepset, known for their work in NLP. The download size is substantial (~433MB), which is typical for BERT-base models. The real magic is in the pipeline function, which abstracts away a lot of complex code and lets us focus on the task itself.


In [None]:
# Define the model name from the Hugging Face Hub
model_name = 'deepset/bert-base-cased-squad2'

# Load the pre-trained model and its tokenizer
# The from_pretrained() method downloads and caches the model files
model = BertForQuestionAnswering.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Create the question-answering pipeline
# This pipeline bundles the model and tokenizer for easy use
qna_pipeline = pipeline('question-answering', model=model, tokenizer=tokenizer)

print(f"Model '{model_name}' loaded successfully.")

Code Explanation

In this cell, we define the "context" that the model will use to find answers.

A multi-line string is used to store a news article about the repatriation of Filipino workers from Lebanon. This text will be the sole source of information for the model.

textwrap.dedent(context).strip() is used to clean up the string by removing any common leading whitespace from each line.

textwrap.fill(...) formats the cleaned text to have a maximum line width of 100 characters, making it easy to read in the output.



---



**Reflections and Observations**

I think the quality and content of the context are critical. The model's performance is entirely dependent on this text. it cannot use any external knowledge. This specific article was chosen by Sir Raga, probably because it contains a good mix of names, numbers, locations, and reasons, which allows for a diverse set of test questions.

In [None]:
# The full text from the article will serve as our context
# Triple quotes (""") are used for multi-line strings in Python
context = """
MANILA – The government is arranging chartered flights for the repatriation of more than 200 overseas Filipino workers in Beirut, Lebanon, the Department of Migrant Workers (DMW) said Wednesday. “We are trying to provide for chartered flights. We're talking to airline companies so that the chartered flights would be able to accommodate for example, no less than 300 overseas Filipino workers from Beirut,” DMW Undersecretary Bernard Olalia said in a Palace press briefing. This was after the scheduled flights of around 15 OFWs on Sept. 25 were cancelled because of the recent bombings in Beirut. Olalia said around 111 OFWs are staying in four temporary shelters in Beirut and waiting for their repatriation. An additional 110 OFWs are applying for exit permits from the Lebanese government, Olalia said. “Apart from the documented OFWs, we have undocumented OFWs who need to secure travel documents and once they're given travel documents, we will help them in securing also exit visas or exit permits from the Immigration of the Lebanese government,” he said. Olalia, however, said the Philippine government is facing several challenges, including securing landing rights for chartered flights. He said land and sea routes are being considered, in case the situation escalates and makes it “impossible” to take the air route. “The DMW is also studying the possibility of other routes. Apart from air route, we will be assessing the sea and the land route, should the case or the situation there worsen,” Olalia said. He said the DMW, the Overseas Workers Welfare Administration (OWWA), and other concerned agencies will adopt a “whole-of-government assistance" upon the directive of President Ferdinand R. Marcos Jr. He said each repatriated OFW will get PHP150,000 in financial assistance from the DMW and OWWA, as well as psychosocial services. Israel has intensified its airstrikes across the northern border into Lebanon, targeting the Iran-backed militant group Hezbollah. Iran fired ballistic missiles in Israel on Tuesday night, following the deadly attacks on Gaza and Lebanon and the recent killings of Hamas, Hezbollah, and Islamic Revolutionary Guard Corps leaders. Olalia said no Filipinos were hurt since the attacks were launched. “We have men on the ground. They work around the clock. At 'yung mga staff po natin, dinagdagan na po natin (And we augmented our staff) both in Lebanon at (and) nearby posts to be able to provide safest route, to evacuate and ultimately to facilitate the repatriation of our OFWs both either in Lebanon or in Israel,” he said. (PNA)
"""

# Format the context for clean printing
# .strip() removes any leading/trailing whitespace
dedented_text = textwrap.dedent(context).strip()

# Print the context to verify it's loaded correctly
print("Context Article:\n")
print(textwrap.fill(dedented_text, width=100)) # 'width' adjusts the line length

Code Explanation

This cell creates an interactive loop to test the model.

A while True loop runs continuously.

input() prompts the user to enter a question.

If the user enters *, the break statement exits the loop.

time.time() records the start time right before the model processes the input.

qna_pipeline(...) takes the user's question and the context as input and returns a dictionary containing the predicted answer.

The end time is recorded, and the difference is calculated to get the inference_time.

The answer, its confidence score, and the inference time are printed.


---



Output Explanation

The output displays a series of dialogues where a question is asked and the model provides an answer, a confidence score, and the time it took. For example:

Question: Who is the DMW Undersecretary mentioned in the press briefing?

Answer: Bernard Olalia

Score: 0.9981 (Very high confidence)

Inference Time: 3.7034 seconds



---



**Reflections and Observations**

This interactive session is where we see the model's capabilities firsthand.

Accuracy: The model performed quite well, correctly identifying names (Bernard Olalia), numbers (111), locations (Beirut), and reasons (recent bombings). However, it made a notable mistake on the last question, identifying "Bernard Olalia" as the source of the directive instead of "President Ferdinand R. Marcos Jr." This highlights that even powerful models can make errors.

Confidence Score: The scores generally correlate with the answer's quality. The 0.9981 score for "Bernard Olalia" was very high and correct. The score of 0.1991 for the reason for flight cancellations was low, but the answer was still correct, indicating the model was less certain.

Inference Speed: The inference time for this baseline BERT model is around 3.7 to 4.6 seconds per question. This is a crucial metric we will use to compare against other models.

In [None]:
# This loop will continue asking for questions until you enter '*'
while True:
    inquiry = input("\nType your question (or enter '*' to stop): ")

    if inquiry == '*':
        break

    # --- Measure Inference Time ---
    start_time = time.time()

    # Feed the question and context to the pipeline
    answer = qna_pipeline({'question': inquiry, 'context': context})

    end_time = time.time()
    inference_time = end_time - start_time
    # -----------------------------

    # The pipeline returns a dictionary. Let's print the details.
    print(f"\nAnswer: {answer['answer']}")
    print(f"Score (Confidence): {answer['score']:.4f}")
    print(f"Inference Time: {inference_time:.4f} seconds")

Code Explanation

This cell loads our second model. The process is identical to the first, but we are now using a different model name: 'distilbert-base-cased-distilled-squad'. This time, we let the pipeline function handle the loading of both the model and the tokenizer by passing the model name string directly.



---


**Reflections and Observations**

The key difference here is the model itself. DistilBERT is a "distilled" version of BERT. This means it's a smaller, faster, and lighter model that was trained to mimic the behavior of the larger BERT model. My hypothesis is that this model will have a significantly faster inference time, but we might see a slight drop in accuracy or confidence as a trade-off. Its download size (~261MB) is much smaller than the first model's, reinforcing its "lightweight" nature.

In [None]:
# --- MODEL 2: distilbert-base-cased-distilled-squad ---

# Define the model name
model_name_2 = 'distilbert-base-cased-distilled-squad'

# Create the pipeline for this model
qna_pipeline_2 = pipeline('question-answering', model=model_name_2, tokenizer=model_name_2)
print(f"Model '{model_name_2}' loaded successfully.")

Code Explanation

This cell uses the same interactive loop as before but now calls qna_pipeline_2 to get answers from the DistilBERT model. This allows for a direct, one-to-one comparison with Model 1 using the same set of questions.



---


Output Explanation

The output shows the answers from DistilBERT for the same 10 questions.

Question: Whose directive led to the "whole-of-government assistance"?

Answer: President Ferdinand R. Marcos Jr

Score: 0.5624

Inference Time: 1.1969 seconds



---



**Reflections and Observations**

This is where our comparison gets interesting.

Accuracy: DistilBERT correctly answered all the questions, including the one that the baseline BERT model got wrong! It correctly identified "President Ferdinand R. Marcos Jr." as the source of the directive. This is an impressive result for a smaller model.

Confidence Score: The scores were generally high and reliable. It's interesting to note that some scores are displayed as greater than 1.0; this is an artifact of the model's raw output (logits) and how the pipeline sometimes scales them. The relative values are still meaningful.

Inference Speed: As hypothesized, DistilBERT is significantly faster. The inference times are mostly in the 1.2 to 2.6 second range, which is roughly 2-3 times faster than the first model. This confirms its efficiency.

For applications where speed is critical without a major sacrifice in accuracy, DistilBERT appears to be an excellent choice.

In [None]:
# --- Testing Loop for Model 2 ---
print(f"--- Now testing model: {model_name_2} ---")

while True:
    inquiry = input("\nType your question (or enter '*' to stop): ")
    if inquiry == '*':
        break

    start_time = time.time()
    # IMPORTANT: Use the correct pipeline variable for this model
    answer = qna_pipeline_2({'question': inquiry, 'context': context})
    end_time = time.time()

    inference_time = end_time - start_time

    print(f"\nAnswer: {answer['answer']}")
    print(f"Score (Confidence): {answer['score']:.4f}")
    print(f"Inference Time: {inference_time:.4f} seconds")

Code Explanation

These cells load and test our third model, RoBERTa (deepset/roberta-base-squad2). RoBERTa (A Robustly Optimized BERT Pretraining Approach) is a variation of BERT that was trained with an improved methodology, including training on a much larger dataset for a longer time.



---


Output Explanation

The output shows the answers from RoBERTa for our standard set of questions.

Question: According to Undersecretary Olalia, were any Filipinos hurt in the attacks?

Answer: no

Score: 0.2952

Inference Time: 2.2756 seconds



---


**Reflections and Observations**

Accuracy: RoBERTa also performed perfectly, correctly answering all questions. Its answer to the question about Filipino casualties was just "no," which is more concise and natural than the first model's "no Filipinos were hurt."

Confidence Score: The confidence scores were solid. The score for "no" was low (0.2952), but the answer was correct, again showing that low scores don't always mean wrong answers.

Inference Speed: The inference times were mostly in the 2.2 to 3.3 second range. This makes RoBERTa faster than our baseline BERT model but slightly slower than the highly optimized DistilBERT.

RoBERTa seems to offer a great balance of high accuracy and reasonable speed, making it a very strong contender.

In [None]:
# --- MODEL 3: deepset/roberta-base-squad2 (Corrected Name) ---

# Define the correct model name
model_name_3 = 'deepset/roberta-base-squad2'

# Create the pipeline for this model
qna_pipeline_3 = pipeline('question-answering', model=model_name_3, tokenizer=model_name_3)

print(f"Model '{model_name_3}' loaded successfully.")

In [None]:
# --- Testing Loop for Model 3 ---
print(f"--- Now testing model: {model_name_3} ---")

# Run the interactive loop to ask your 10 questions
while True:
    inquiry = input("\nType your question (or enter '*' to stop): ")
    if inquiry == '*':
        break

    start_time = time.time()

    # Use the pipeline variable specific to this model: qna_pipeline_3
    answer = qna_pipeline_3({'question': inquiry, 'context': context})

    end_time = time.time()
    inference_time = end_time - start_time

    # Print the results for recording
    print(f"\nAnswer: {answer['answer']}")
    print(f"Score (Confidence): {answer['score']:.4f}")
    print(f"Inference Time: {inference_time:.4f} seconds")

In [None]:
# --- Testing Loop for Model 3 ---
print(f"--- Now testing model: {model_name_3} ---")

# Run the interactive loop to ask your 10 questions
while True:
    inquiry = input("\nType your question (or enter '*' to stop): ")
    if inquiry == '*':
        break

    start_time = time.time()

    # Use the pipeline variable specific to this model: qna_pipeline_3
    answer = qna_pipeline_3({'question': inquiry, 'context': context})

    end_time = time.time()
    inference_time = end_time - start_time

    # Print the results for recording
    print(f"\nAnswer: {answer['answer']}")
    print(f"Score (Confidence): {answer['score']:.4f}")
    print(f"Inference Time: {inference_time:.4f} seconds")

Code Explanation

These final cells load and test our largest model, 'deepset/bert-large-uncased-whole-word-masking-squad2'.

bert-large: This model has more layers and parameters than the bert-base models, making it more powerful but also more computationally expensive.

whole-word-masking: This refers to a specific pre-training technique that can improve a model's understanding of language.


---



Output Explanation

The output shows the answers from the BERT-large model.

Question: How much financial assistance will each repatriated OFW receive?

Answer: PHP15

Score: 0.9909

Inference Time: 8.3573 seconds



---



Reflections and Observations

Accuracy: Surprisingly, this larger model made an error! For the financial assistance question, it answered PHP15 instead of PHP150,000. This is a critical failure and a great lesson: bigger is not always better. The way the tokenizer splits "PHP150,000" might have confused this specific model. Otherwise, its other answers were correct.

Confidence Score: The scores were generally very high, but this can be misleading, as shown by the incorrect answer with a score of 0.9909. This highlights the importance of not relying on confidence scores alone.

Inference Speed: As expected, this model was by far the slowest. Inference times were consistently in the 8 to 13 second range, making it more than twice as slow as the baseline BERT model and nearly 8 times slower than DistilBERT.



In [None]:
# --- MODEL 4: deepset/bert-large-uncased-whole-word-masking-squad2 ---

# Define the correct model name, including the 'deepset/' prefix
model_name_4 = 'deepset/bert-large-uncased-whole-word-masking-squad2'

# Create the pipeline for this final model.
# This model is significantly larger than the others, so the download may be over 1GB.
qna_pipeline_4 = pipeline('question-answering', model=model_name_4, tokenizer=model_name_4)

print(f"Model '{model_name_4}' loaded successfully.")

In [None]:
# --- Testing Loop for Model 4 ---
print(f"--- Now testing model: {model_name_4} ---")

# Run the interactive loop to ask your 10 questions
while True:
    inquiry = input("\nType your question (or enter '*' to stop): ")
    if inquiry == '*':
        break

    start_time = time.time()

    # Use the pipeline variable for this model: qna_pipeline_4
    answer = qna_pipeline_4({'question': inquiry, 'context': context})

    end_time = time.time()
    inference_time = end_time - start_time

    # Print the results for recording
    print(f"\nAnswer: {answer['answer']}")
    print(f"Score (Confidence): {answer['score']:.4f}")
    print(f"Inference Time: {inference_time:.4f} seconds")

My Final Conclusion

This activity was a fascinating comparative study of different transformer architectures for question-answering.

Baseline (BERT-base): Performed well but was the second slowest and made one clear error.

Efficient (DistilBERT): The standout performer. It was the fastest by a large margin and was 100% accurate on the test questions. This makes it an ideal choice for production environments where speed is important.

Robust (RoBERTa): Also 100% accurate and faster than the baseline BERT. It provides a fantastic balance of speed and reliability.

Large (BERT-large): The slowest and, surprisingly, not the most accurate in this test. Its failure on a simple number-based question shows that even the most complex models have weaknesses.

Overall, for this specific task, DistilBERT provided the best combination of speed and accuracy, proving that a lighter, well-optimized model can outperform its larger counterparts.

In [None]:
import json

path = "/content/ExerciseMidterm3_Timoteo.ipynb"

with open(path, "r") as f:
    nb = json.load(f)

# Remove broken widget metadata
if "widgets" in nb.get("metadata", {}):
    del nb["metadata"]["widgets"]

with open(path, "w") as f:
    json.dump(nb, f)
