# Unit 1 – Model Benchmark Challenge

## Objective
The goal of this notebook is to compare the behavior of three transformer architectures:

- **BERT (Encoder-only)**
- **RoBERTa (Encoder-only)**
- **BART (Encoder–Decoder)**

Each model is deliberately forced to perform tasks it may not be architecturally suited for, in order to observe failures and understand why architecture matters.


In [None]:
!pip install transformers torch --quiet


In [None]:
from transformers import pipeline




## Models Used

| Model | Hugging Face Identifier | Architecture |
|------|------------------------|--------------|
| BERT | bert-base-uncased | Encoder-only |
| RoBERTa | roberta-base | Encoder-only |
| BART | facebook/bart-base | Encoder–Decoder |

In [12]:
models = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}

## Experiment 1: Text Generation

### Task
Generate text given the prompt:

> **"The future of Artificial Intelligence is"**

### Expected Behavior
- Encoder-only models (BERT, RoBERTa) should fail or behave poorly because they are not trained for autoregressive text generation.
- BART, which has a decoder, should generate longer text, though quality may vary.


In [13]:
prompt = "The future of Artificial Intelligence is"

for name, model in models.items():
    print(f"\nModel: {name}")
    try:
        generator = pipeline("text-generation", model=model)
        output = generator(prompt, max_length=30)
        print(output)
    except Exception as e:
        print("FAILED:", e)


Model: BERT


If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is................................................................................................................................................................................................................................................................'}]

Model: RoBERTa


If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is'}]

Model: BART


Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is Ji Ji Marriott Ji Ji Ji PHumentbj Ji torrent torrent bent packed Ji Jiacteria disdain Marriott bent playoff TPPacters Marriott Marriottisconsin bentisconsin bent bent Ji Cesisconsin Ji Jiisconsinisconsinisconsin arsenicBAT Prohibition OCT OCT OCT Pyramid rollout bent OCT Prohibition------------- Troll spam spam OCT wells Ji Ji spam OCT OCT Ji fraternityacteria Ji fraternity Jiacteria OCT spam OCT solitude exceedingly Ji exceedingly bent Sunder Ji sperm Ji Ji fraternity injected Ji Ji phosphate Sunder spam provoked injected Combat fraternity injected injected spam injected spam sperm OCT OCT provoked OCT OCTichick provokedobal Sunder Ji Ji Sunder arsenic Ji Prohibition 275 OCT erotic OCT OCT erotic spam OCTatche OCT OCTcas provoked OCT Prohibition solitude spam erotic OCT ISPs spam OCT provoked provoked Judicial spam spam Ji provoked Sunder Sunder sperm Subaru provoked Ji Sunder 275cas OCT spam spam bent OCT Sunder OCT OCT Ma

### Observations (Experiment 1 – Text Generation)

- **BERT** generated a sequence consisting almost entirely of repeated punctuation (dots) and failed to produce any meaningful continuation of the prompt.
- **RoBERTa** simply repeated the original prompt without generating any additional text.
- **BART** generated a long continuation of the prompt; however, the output was highly noisy and incoherent, containing unrelated words and repetitive patterns.

### Explanation

BERT and RoBERTa are encoder-only transformer models. They are trained for language understanding tasks such as masked language modeling and classification, not for predicting the next token in a sequence. As a result, they are unable to perform autoregressive text generation, leading to repetitive or empty outputs.

BART uses an encoder–decoder architecture, which enables text generation. However, the `facebook/bart-base` model is not trained as a causal language model. Although it can generate tokens due to the presence of a decoder, the lack of causal language modeling training results in poor-quality and incoherent text generation.


## Experiment 2: Masked Language Modeling (Fill-Mask)

### Task
Predict the missing word in the sentence:

> **"The goal of Generative AI is to [MASK] new content."**

### Expected Behavior
- BERT and RoBERTa should perform well due to MLM training.
- BART may behave inconsistently as MLM is not its primary training objective.


In [14]:
masked_sentence = "The goal of Generative AI is to [MASK] new content."

for name, model in models.items():
    print(f"\nModel: {name}")
    try:
        fill_mask = pipeline("fill-mask", model=model)
        output = fill_mask(masked_sentence)
        print(output[:3])
    except Exception as e:
        print("FAILED:", e)


Model: BERT


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.5396937131881714, 'token': 3443, 'token_str': 'create', 'sequence': 'the goal of generative ai is to create new content.'}, {'score': 0.15575705468654633, 'token': 9699, 'token_str': 'generate', 'sequence': 'the goal of generative ai is to generate new content.'}, {'score': 0.05405480042099953, 'token': 3965, 'token_str': 'produce', 'sequence': 'the goal of generative ai is to produce new content.'}]

Model: RoBERTa


Device set to use cpu


FAILED: No mask_token (<mask>) found on the input

Model: BART


Device set to use cpu


FAILED: No mask_token (<mask>) found on the input


### Observations (Experiment 2 – Masked Language Modeling)

- **BERT** successfully predicted contextually appropriate words such as "create", "generate", and "produce" for the masked token with high confidence.
- **RoBERTa** failed to execute the task and raised an error indicating that no `<mask>` token was found in the input sentence.
- **BART** also failed with a similar error due to incompatible mask token usage.

### Explanation

BERT is explicitly trained using Masked Language Modeling (MLM), where tokens are replaced with the `[MASK]` symbol during training. As a result, it performs very well on fill-mask tasks using this format.

RoBERTa and BART follow different tokenization and training conventions and expect the `<mask>` token instead of `[MASK]`. Since the input sentence did not match their expected mask format, both models failed to process the task. This highlights the importance of model-specific input requirements, even for similar NLP tasks.


## Experiment 3: Question Answering

### Question
> **"What are the risks?"**

### Context
> **"Generative AI poses significant risks such as hallucinations, bias, and deepfakes."**

### Expected Behavior
Since these are base models and not fine-tuned on QA datasets like SQuAD, outputs may be incomplete or inconsistent.


In [15]:
question = "What are the risks?"
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."

for name, model in models.items():
    print(f"\nModel: {name}")
    try:
        qa = pipeline("question-answering", model=model)
        output = qa(question=question, context=context)
        print(output)
    except Exception as e:
        print("FAILED:", e)


Model: BERT


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.004681949038058519, 'start': 20, 'end': 31, 'answer': 'significant'}

Model: RoBERTa


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.004456312395632267, 'start': 32, 'end': 81, 'answer': 'risks such as hallucinations, bias, and deepfakes'}

Model: BART


Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.0396338626742363, 'start': 0, 'end': 10, 'answer': 'Generative'}


### Observations (Experiment 3 – Question Answering)

- **BERT** returned an extremely short and incomplete answer ("significant") with a very low confidence score.
- **RoBERTa** extracted a longer and more relevant span mentioning "risks such as hallucinations, bias, and deepfakes", but the confidence score remained very low.
- **BART** produced an incorrect and incomplete answer ("Generative"), failing to capture the actual risks described in the context.

### Explanation

All three models used in this experiment are base models and are not fine-tuned on Question Answering datasets such as SQuAD. As a result, the question-answering heads were randomly initialized, leading to unreliable and inconsistent outputs.

Although RoBERTa extracted a more relevant text span compared to BERT and BART, the low confidence scores across all models indicate that pretraining alone is insufficient for accurate question answering. This experiment highlights the importance of task-specific fine-tuning for reliable QA performance.


## Final Observation Table

| Task | Model | Classification (Success / Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
|-----|------|-----------------------------------|----------------------------------------|---------------------------------------------|
| Text Generation | BERT | Failure | Generated repetitive punctuation (dots) and failed to produce meaningful text. | BERT is an encoder-only model and is not trained for autoregressive next-token generation. |
| Text Generation | RoBERTa | Failure | Repeated the input prompt without generating any continuation. | RoBERTa is also encoder-only and lacks a decoder for sequential text generation. |
| Text Generation | BART | Partial Success | Generated a long continuation, but the output was noisy and incoherent. | BART has an encoder–decoder architecture enabling generation, but `bart-base` is not trained as a causal language model. |
| Fill-Mask | BERT | Success | Correctly predicted words like "create", "generate", and "produce" with high confidence. | BERT is trained using Masked Language Modeling (MLM), making it well-suited for fill-mask tasks. |
| Fill-Mask | RoBERTa | Failure | Failed with an error indicating that no `<mask>` token was found in the input. | RoBERTa expects the `<mask>` token instead of `[MASK]`, highlighting model-specific input requirements. |
| Fill-Mask | BART | Failure | Failed with a similar mask token error and did not produce predictions. | BART is not designed for classic MLM tasks and expects different masking formats. |
| Question Answering | BERT | Failure | Returned an extremely short and incomplete answer ("significant") with very low confidence. | The model is not fine-tuned on QA datasets, resulting in unreliable predictions. |
| Question Answering | RoBERTa | Partial | Extracted a relevant answer span mentioning hallucinations, bias, and deepfakes, but with very low confidence. | Although RoBERTa has strong language understanding, it lacks QA-specific fine-tuning. |
| Question Answering | BART | Failure | Returned an incorrect and incomplete answer ("Generative"). | BART requires task-specific QA fine-tuning despite having an encoder–decoder architecture. |


## Key Insight

This benchmark demonstrates that transformer models are highly task-dependent.
Model architecture and training objectives strongly influence performance, and pretraining alone is insufficient for reliable task execution without proper fine-tuning.
