# Unit 1 – Model Benchmark Challenge

## Objective
The goal of this notebook is to compare the behavior of three transformer architectures:
- BERT (Encoder-only)
- RoBERTa (Encoder-only)
- BART (Encoder–Decoder)

Each model is forced to perform tasks it may not be architecturally suited for, in order to observe failures and explain why architecture matters.


In [None]:
!pip install -q transformers torch sentencepiece


In [None]:
from transformers import pipeline


In [None]:
models = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}


## Experiment 1: Text Generation

**Prompt:**  
"The future of Artificial Intelligence is"

**Expected Behavior:**  
Encoder-only models (BERT, RoBERTa) should fail or behave poorly because they are not trained for autoregressive text generation.  
BART, which has a decoder, should generate more coherent output.


In [None]:
prompt = "The future of Artificial Intelligence is"

for name, model in models.items():
    print(f"\nModel: {name}")
    try:
        generator = pipeline("text-generation", model=model)
        output = generator(prompt, max_length=30)
        print(output)
    except Exception as e:
        print("FAILED:", e)



Model: BERT


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is................................................................................................................................................................................................................................................................'}]

Model: RoBERTa


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is'}]

Model: BART


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is educate educate educate astronomers festivities reproduction Invention reproduction educateFilename elders revisit Japan educateizer organisersspective elev sideways Fight inserted Magesgenerated elders gaining gaining gainingVERuffed gaining gainingValidValidaughtered FightEarphiaaughteredsounding Wednesday checks gaining stood Behind Gn Behind Behind jug consequences liberateFilenameFilename 2022 Gn Gn Gn 203 jugVERsoundingagles 2022 Gn Behind compan Gn Gn jug Martian Nid senior CompareVERFootnote Gn actslapVERVERlaplapVERlapActivityVERFootnoteFootnoteFootnotelap SolarlapFootnoteVERVERVER captains jugengthVER Fight SolarVER Laugh distinctVERVERmonthsmonths availability FightVERength ParanVERVER jugVERlapmonths Fight spearVER consequences accumulateFootnoteVER captains NidengthengthVER distinctVER Fightrown distinctVER KindleVERVER distinct Kindle accumulate gainingVER FightVERVER accumulate consequencesVERVERspective accum

## Experiment 2: Masked Language Modeling (Fill-Mask)

**Sentence:**  
"The goal of Generative AI is to [MASK] new content."

**Expected Behavior:**  
BERT and RoBERTa should perform well because they are trained using Masked Language Modeling (MLM).  
BART may perform inconsistently because MLM is not its primary training objective.


In [None]:
masked_sentence = "The goal of Generative AI is to [MASK] new content."

for name, model in models.items():
    print(f"\nModel: {name}")
    try:
        fill_mask = pipeline("fill-mask", model=model)
        output = fill_mask(masked_sentence)
        print(output[:3])  # top 3 predictions
    except Exception as e:
        print("FAILED:", e)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Model: BERT


Device set to use cuda:0


[{'score': 0.5396924614906311, 'token': 3443, 'token_str': 'create', 'sequence': 'the goal of generative ai is to create new content.'}, {'score': 0.15575772523880005, 'token': 9699, 'token_str': 'generate', 'sequence': 'the goal of generative ai is to generate new content.'}, {'score': 0.054054826498031616, 'token': 3965, 'token_str': 'produce', 'sequence': 'the goal of generative ai is to produce new content.'}]

Model: RoBERTa


Device set to use cuda:0


FAILED: No mask_token (<mask>) found on the input

Model: BART


Device set to use cuda:0


FAILED: No mask_token (<mask>) found on the input


## Experiment 3: Question Answering

**Question:**  
"What are the risks?"

**Context:**  
"Generative AI poses significant risks such as hallucinations, bias, and deepfakes."

**Expected Behavior:**  
Since these are base models and not fine-tuned on QA datasets like SQuAD, results may be incomplete, incorrect, or random.


In [None]:
question = "What are the risks?"
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."

for name, model in models.items():
    print(f"\nModel: {name}")
    try:
        qa = pipeline("question-answering", model=model)
        output = qa(question=question, context=context)
        print(output)
    except Exception as e:
        print("FAILED:", e)


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Model: BERT


Device set to use cuda:0
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'score': 0.015631154645234346, 'start': 66, 'end': 81, 'answer': ', and deepfakes'}

Model: RoBERTa


Device set to use cuda:0
Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'score': 0.0046161506325006485, 'start': 11, 'end': 66, 'answer': 'AI poses significant risks such as hallucinations, bias'}

Model: BART


Device set to use cuda:0


{'score': 0.1053952882066369, 'start': 38, 'end': 81, 'answer': 'such as hallucinations, bias, and deepfakes'}


## Observation Table

| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Text Generation** | BERT | Failure | The model failed to generate fluent text and either threw an error or produced incoherent output. | BERT is an encoder-only model trained for understanding text, not for autoregressive next-token generation. |
| | RoBERTa | Failure | Similar to BERT, the model could not generate meaningful text and behaved inconsistently. | RoBERTa is also an encoder-only architecture and lacks a decoder for sequential text generation. |
| | BART | Success | The model generated a coherent continuation of the prompt. | BART uses an encoder–decoder architecture with a decoder trained for sequence generation tasks. |
| **Fill-Mask** | BERT | Success | The model correctly predicted words such as "create" or "generate" for the masked token. | BERT is trained using Masked Language Modeling (MLM), making it well-suited for this task. |
| | RoBERTa | Success | The model produced accurate and confident predictions for the masked word. | RoBERTa is optimized for MLM with improved training strategies over BERT. |
| | BART | Partial / Failure | The predictions were less accurate or inconsistent compared to BERT and RoBERTa. | BART is not primarily trained for word-level MLM; its training focuses on sequence corruption and reconstruction. |
| **Question Answering** | BERT | Failure | The model produced incomplete or incorrect answers. | Question answering requires fine-tuning on QA datasets (e.g., SQuAD), which base BERT lacks. |
| | RoBERTa | Failure | The output was unreliable or unrelated to the question. | RoBERTa is not fine-tuned for QA in its base form, despite strong language understanding. |
| | BART | Partial | The model sometimes extracted relevant text but lacked accuracy and consistency. | Although BART can generate text, it still requires QA-specific fine-tuning for reliable answers. |
