# Unit 1 Assignment: The Model Benchmark Challenge

**Objective**: Evaluate architectural differences between BERT, RoBERTa, and BART by forcing them to perform tasks they might not be designed for.

**Models to Test**:
1. **BERT** (`bert-base-uncased`): Encoder-only
2. **RoBERTa** (`roberta-base`): Encoder-only
3. **BART** (`facebook/bart-base`): Encoder-Decoder

In [1]:
# @title Student Details
print("Notebook Submission by: Musharraf-PES2UG23CS915")

Notebook Submission by: Musharraf-PES2UG23CS915


In [3]:
import torch, transformers, tensorflow
print("torch:", torch.__version__)
print("transformers:", transformers.__version__)
print("tensorflow:", tensorflow.__version__)
print("GPU available:", torch.cuda.is_available())


torch: 2.9.0+cpu
transformers: 4.57.6
tensorflow: 2.19.0
GPU available: False


In [2]:
# @title Install & Import Dependencies
# !pip install transformers torch sentencepiece

from transformers import pipeline, set_seed
import torch
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")



In [4]:
# List of models to benchmark
models_to_test = [
    "bert-base-uncased",
    "roberta-base",
    "facebook/bart-base"
]

## Experiment 1: Text Generation

**Task**: Generate text using the prompt: `"The future of Artificial Intelligence is"`
**Hypothesis**: Encoder-only models (BERT, RoBERTa) differ from Encoder-Decoder (BART).

In [6]:
print("=== Experiment 1: Text Generation ===\n")
prompt = "The future of Artificial Intelligence is"

for model_name in models_to_test:
    print(f"\n--- Testing Model: {model_name} ---")
    try:
        # Initialize pipeline for text generation
        # Validating if the model supports generation might be handled by the pipeline or throw validation errors
        generator = pipeline('text-generation', model=model_name)

        result = generator(prompt, max_new_tokens=500, num_return_sequences=1, truncation=True)
        print(f"Result: {result[0]['generated_text']}")

    except Exception as e:
        print(f"FAILED / ERROR: {str(e)}")
        print("Observation: Model might not support text-generation or has no causal head.")

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


=== Experiment 1: Text Generation ===


--- Testing Model: bert-base-uncased ---


Device set to use cpu
If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


Result: The future of Artificial Intelligence is....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

--- Testing Model: roberta-base ---


Device set to use cpu


Result: The future of Artificial Intelligence is

--- Testing Model: facebook/bart-base ---


Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Result: The future of Artificial Intelligence isSetting Hydra13484gie Hydra Hydra Hydra meg:( trillion dialourt poignant poignant comprised maturednergy Bringing matured Bringing Bringing Bringing referee SSHprotect greater Whitneynergy SSH trillion 1932 Ship wrestling Heroes Whitney Hydra SSH orig Ship orig Whitney SSH wrestling professor Bringing Bringing Friday Crusader Whitney SSH orig orig Yam Heroes Heroes Heroes wrestlingIED wrestling professor364 Friday Norm Norm Norm replen { Ship SSH SSH SSH { massage Friday Whitney SSH trillion advertisersInstoreAndOnlinearg launcher Heroes Heroes Hydra phosphate Dukericanericane Ship orig disastrousgie 1964Secret 433 Friday SSH segregation Factors Factors adopted adopted adopted indicVer Lords indic PEOPLE Crusader Whitney38 greaterosphere Whitney 1964protect adopted VomL positioning 1964 Ship Inquisensationensation { Heroes Whitney permissions 433 Norm multi Whitney 1964 Bou SSHmLmL InquismL SSH SSH Whitney multiensation Inquis Heroes Inqu

## Experiment 2: Masked Language Modeling (Missing Word)

**Task**: Predict the missing word in: `"The goal of Generative AI is to [MASK] new content."`
**Note**: We adhere to the model's specific mask token (e.g., `[MASK]` for BERT, `<mask>` for RoBERTa/BART).

In [7]:
print("=== Experiment 2: Masked Language Modeling ===\n")
base_sentence = "The goal of Generative AI is to {} new content."

for model_name in models_to_test:
    print(f"\n--- Testing Model: {model_name} ---")
    try:
        fill_mask = pipeline('fill-mask', model=model_name)

        # Get the correct mask token for the current model
        mask_token = fill_mask.tokenizer.mask_token
        query = base_sentence.format(mask_token)

        print(f"Query: {query}")
        results = fill_mask(query)

        # Print top 3 predictions
        for i, res in enumerate(results[:3]):
            print(f"Prediction {i+1}: {res['token_str']} (Score: {res['score']:.4f})")

    except Exception as e:
        print(f"FAILED / ERROR: {str(e)}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


=== Experiment 2: Masked Language Modeling ===


--- Testing Model: bert-base-uncased ---


Device set to use cpu


Query: The goal of Generative AI is to [MASK] new content.
Prediction 1: create (Score: 0.5397)
Prediction 2: generate (Score: 0.1558)
Prediction 3: produce (Score: 0.0541)

--- Testing Model: roberta-base ---


Device set to use cpu


Query: The goal of Generative AI is to <mask> new content.
Prediction 1:  generate (Score: 0.3711)
Prediction 2:  create (Score: 0.3677)
Prediction 3:  discover (Score: 0.0835)

--- Testing Model: facebook/bart-base ---


Device set to use cpu


Query: The goal of Generative AI is to <mask> new content.
Prediction 1:  create (Score: 0.0746)
Prediction 2:  help (Score: 0.0657)
Prediction 3:  provide (Score: 0.0609)


## Experiment 3: Question Answering

**Task**: Answer `"What are the risks?"` based on context.
**Context**: `"Generative AI poses significant risks such as hallucinations, bias, and deepfakes."`

In [8]:
print("=== Experiment 3: Question Answering ===\n")
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

for model_name in models_to_test:
    print(f"\n--- Testing Model: {model_name} ---")
    try:
        qa_pipeline = pipeline('question-answering', model=model_name)

        result = qa_pipeline(question=question, context=context)
        print(f"Answer: {result['answer']}")
        print(f"Score: {result['score']:.4f}")

    except Exception as e:
        print(f"FAILED / ERROR: {str(e)}")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


=== Experiment 3: Question Answering ===


--- Testing Model: bert-base-uncased ---


Device set to use cpu
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Answer: , bias, and deepfakes
Score: 0.0137

--- Testing Model: roberta-base ---


Device set to use cpu
Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Answer: Generative AI
Score: 0.0077

--- Testing Model: facebook/bart-base ---


Device set to use cpu


Answer: poses significant
Score: 0.0246


## Deliverable: Observation Table

| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Generation** | BERT | **Failure** | Generated repetitive dots (`...................`) or nonsense. | **BERT is an Encoder-only model.** It is designed for understanding (bidirectional context), not for autoregressive text generation (predicting the next word). |
| | RoBERTa | **Failure** | Stopped immediately or repeated the prompt. | **RoBERTa is an Encoder-only model.** Like BERT, it lacks a decoder to generate text sequentially. |
| | BART | **Success (Architecturally)** | Generated text (though low quality/hallucinated). | **BART is an Encoder-Decoder model.** It has a decoder component capable of text generation, effectively making it a seq2seq model. |
| **Fill-Mask** | BERT | **Success** | Predicted meaningful words: `create` (0.54), `generate` (0.16). | **BERT is trained on Masked Language Modeling (MLM).** This is its native pre-training objective. |
| | RoBERTa | **Success** | Predicted meaningful words: `generate` (0.37), `create` (0.37). | **RoBERTa is also trained on MLM.** It excels at filling in missing information from bidirectional context. |
| | BART | **Partial Success** | Predicted relevant words: `create`, `help`. but with very low confidence. | **BART's training includes text infilling.** Although acts as a seq2seq, its encoder understands masked inputs. |
| **QA** | BERT | **Partial Success** | Returns answer but low accuracy/confidence | Not fine-tuned on QA (SQuAD) |
| | RoBERTa | **Partial Success** | Similar weak behavior | Base model without QA fine-tuning |
| | BART | **Failure** | Answer: `poses significant` (Incorrect span). | **Lack of Fine-tuning.** Base models generally need specific training data to perform precise Question Answering tasks. |