# Unit 1: Model Benchmark Challenge

**Name:** Anshul Banda  
**SRN:** PES2UG23CS081  
**Section:** B

In this notebook, we compare different transformer models to better understand how their architecture affects what they can and cannot do. Instead of just using models for tasks they are good at, this assignment focuses on testing them in situations where they may struggle, and analyzing why that happens.

We experiment with three models: **BERT**, **RoBERTa**, and **BART**. While all three are transformer-based, they are built differently. BERT and RoBERTa are encoder-only models, which are mainly designed for understanding text, while BART uses an encoder–decoder architecture that is better suited for generating text.

The notebook includes three experiments:
1. Text generation  
2. Fill-in-the-blank (masked language modeling)  
3. Question answering  

For each task, all three models are tested and the results are recorded in an observation table. The goal is not just to see which model performs best, but to explain the results based on how each model is designed. This helps build a clearer understanding of how transformer architectures impact model behavior in practice.


In [1]:
# Install the transformers library
!pip install transformers



The following cell installs the `transformers` library, which provides pre-trained transformer models and high-level APIs used throughout this notebook. This library allows us to easily load BERT, RoBERTa, and BART and run them on different NLP tasks using pipelines. Installing it here ensures all required tools are available before running the experiments.


In [2]:
from transformers import pipeline

# Define the models
bert = "bert-base-uncased"
roberta = "roberta-base"
bart = "facebook/bart-base"

models = [bert, roberta, bart]

In this cell, we define the three transformer models used throughout the experiments. BERT and RoBERTa are encoder-only models, while BART uses an encoder–decoder architecture. Storing the model names as variables and grouping them into a list makes it easier to loop through each model and apply the same tasks consistently.


In [3]:
prompt = "The future of Artificial Intelligence is"

print("--- Experiment 1: Text Generation ---")
for model_id in models:
    print(f"\nTesting Model: {model_id}")
    try:
        # We use the 'text-generation' pipeline
        generator = pipeline("text-generation", model=model_id)
        result = generator(prompt, max_length=20, num_return_sequences=1)
        print(f"Result: {result[0]['generated_text']}")
    except Exception as e:
        print(f"Error/Failure: {model_id} is likely not designed for this task. Error: {str(e)[:100]}")

--- Experiment 1: Text Generation ---

Testing Model: bert-base-uncased


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Result: The future of Artificial Intelligence is................................................................................................................................................................................................................................................................

Testing Model: roberta-base


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Result: The future of Artificial Intelligence is

Testing Model: facebook/bart-base


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Result: The future of Artificial Intelligence is Yoga Yoga Bet Finn Bet YogaICESdos indexed Bet Mal Yoga Yoga Yoga Finn drankidentallyidentallyidentally constructsaidhaust Veteran Cooperativeiniainia caller Yogaidentallyidentally Yoga Yoga totalsidentallyFatidentallysaidarist homelessness Yoga Yoga on Neural unknow logicidentallyidentally specifiedwcsstore Yoga Yoga throb黒identallyidentally Jeffersonarantine Veteranarantinearantinearantine unequivinia awe Yogaarantine questions questions palatearantinearantine Yogaarantine Veteran questionsarantinearantineidentally objectiniaidentallyinia755 unknow PVinducingarantinearantineiniaarantinequestion Veteran Veteran Veteraninia access democraticiniaarantine rooftoparantineiniaichi Veteranwcsstore VeteranCanadaarantineidentally democratic Veteran755755755orousinia democraticwcsstore electorinia Veteranorousarantinearantine vibrarantine Veteranorousperia thumbsoperatoriniainiainia questions田iniainiaperia Veteran Veteran backfieldiniainia Veter

| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Generation** | BERT | Failure | The model produced a long sequence of repeated punctuation instead of meaningful text. | BERT is an encoder-only model trained with masked language modeling, not next-token prediction. |
|  | RoBERTa | Failure | The output stopped at the prompt and did not generate any continuation. | RoBERTa is also encoder-only and is not designed for text generation tasks. |
|  | BART | Partial Success | The model generated text, but the output was mostly incoherent, repetitive, and nonsensical. | BART supports generation, but the base model is not fine-tuned for open-ended text generation. |


In [4]:
print("--- Experiment 2: Fill-Mask ---")

# We handle the different mask tokens for each model
for model_id in models:
    mask_token = "[MASK]" if "bert-base" in model_id else "<mask>"
    text = f"The goal of Generative AI is to {mask_token} new content."

    print(f"\nTesting Model: {model_id}")
    filler = pipeline("fill-mask", model=model_id)
    results = filler(text)

    # Print the top prediction
    print(f"Top Prediction: {results[0]['token_str']} (Score: {results[0]['score']:.4f})")

--- Experiment 2: Fill-Mask ---

Testing Model: bert-base-uncased


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


Top Prediction: create (Score: 0.5397)

Testing Model: roberta-base


Device set to use cuda:0


Top Prediction:  generate (Score: 0.3711)

Testing Model: facebook/bart-base


Device set to use cuda:0


Top Prediction:  create (Score: 0.0746)


| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Fill-Mask** | BERT | Success | Correctly predicted the missing word as “create” with a high confidence score. | BERT is trained using masked language modeling, making it well-suited for fill-mask tasks. |
|  | RoBERTa | Success | Predicted “generate” as the missing word with reasonable confidence. | RoBERTa is an optimized encoder-only model also trained on masked language modeling. |
|  | BART | Partial Success | Predicted a reasonable word (“create”) but with much lower confidence. | BART supports masked inputs but is primarily designed for sequence-to-sequence tasks, not MLM. |

In [5]:
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

print("--- Experiment 3: Question Answering ---")
for model_id in models:
    print(f"\nTesting Model: {model_id}")
    try:
        qa = pipeline("question-answering", model=model_id)
        result = qa(question=question, context=context)
        print(f"Answer: {result['answer']} (Score: {result['score']:.4f})")
    except Exception as e:
        print(f"Failure: {model_id} encountered an error. Error: {str(e)[:100]}")

--- Experiment 3: Question Answering ---

Testing Model: bert-base-uncased


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


Answer: , bias, and deepfakes. (Score: 0.0044)

Testing Model: roberta-base


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


Answer: , and deepfakes. (Score: 0.0042)

Testing Model: facebook/bart-base


Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


Answer: , bias (Score: 0.0242)


| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **QA** | BERT | Partial Success | Returned part of the correct answer but missed key context and had a very low confidence score. | The base BERT model is not fine-tuned for question answering, so the QA head is poorly trained. |
|  | RoBERTa | Partial Success | Extracted a fragment of the correct answer but omitted important details. | RoBERTa requires fine-tuning on QA datasets (e.g., SQuAD) to perform reliable answer extraction. |
|  | BART | Partial Success | Produced a short, incomplete answer with slightly higher confidence than the others. | Although BART can handle QA, the base model lacks task-specific fine-tuning for answer spans. |
