# Unit 1 Benchmark: Model Architecture Challenge

This notebook benchmarks BERT, RoBERTa, and BART across text generation, fill-mask, and question answering tasks to observe how architecture impacts performance.

**Models**:
- BERT (`bert-base-uncased`) — encoder-only
- RoBERTa (`roberta-base`) — encoder-only
- BART (`facebook/bart-base`) — encoder-decoder


In [2]:
# Install (uncomment if needed)
%pip install -q transformers torch

from transformers import pipeline, set_seed

set_seed(42)


Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
models = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base",
}

prompt = "The future of Artificial Intelligence is"
mask_text = "The goal of Generative AI is to [MASK] new content."
question = "What are the risks?"
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."


def try_text_generation(model_name: str):
    try:
        gen = pipeline("text-generation", model=model_name)
        out = gen(prompt, max_new_tokens=20, do_sample=True, temperature=0.9)
        return out
    except Exception as exc:
        return f"ERROR: {type(exc).__name__}: {exc}"


def try_fill_mask(model_name: str):
    try:
        fm = pipeline("fill-mask", model=model_name)
        # Use the model's own mask token (BERT: [MASK], RoBERTa/BART: <mask>)
        text = mask_text.replace("[MASK]", fm.tokenizer.mask_token)
        out = fm(text, top_k=5)
        return out
    except Exception as exc:
        return f"ERROR: {type(exc).__name__}: {exc}"


def try_qa(model_name: str):
    try:
        qa = pipeline("question-answering", model=model_name)
        out = qa(question=question, context=context)
        return out
    except Exception as exc:
        return f"ERROR: {type(exc).__name__}: {exc}"


In [4]:
# Experiment 1: Text Generation
results_generation = {name: try_text_generation(model) for name, model in models.items()}
results_generation

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use mps:0
If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use mps:0
Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use mps:0


{'BERT': [{'generated_text': 'The future of Artificial Intelligence is....................'}],
 'RoBERTa': [{'generated_text': 'The future of Artificial Intelligence is'}],
 'BART': [{'generated_text': 'The future of Artificial Intelligence isproduct bankruptcy Bradford Bradford Bradford cc appendix WC bankruptcyOSEabba appendix XIVhetic Yamaha Morocco Morocco NaziGMT'}]}

In [5]:
# Experiment 2: Fill-Mask
results_fill_mask = {name: try_fill_mask(model) for name, model in models.items()}
results_fill_mask

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0
Device set to use mps:0
Device set to use mps:0


{'BERT': [{'score': 0.5397014021873474,
   'token': 3443,
   'token_str': 'create',
   'sequence': 'the goal of generative ai is to create new content.'},
  {'score': 0.15575793385505676,
   'token': 9699,
   'token_str': 'generate',
   'sequence': 'the goal of generative ai is to generate new content.'},
  {'score': 0.054055824875831604,
   'token': 3965,
   'token_str': 'produce',
   'sequence': 'the goal of generative ai is to produce new content.'},
  {'score': 0.04451547563076019,
   'token': 4503,
   'token_str': 'develop',
   'sequence': 'the goal of generative ai is to develop new content.'},
  {'score': 0.01757771521806717,
   'token': 5587,
   'token_str': 'add',
   'sequence': 'the goal of generative ai is to add new content.'}],
 'RoBERTa': [{'score': 0.3711313009262085,
   'token': 5368,
   'token_str': ' generate',
   'sequence': 'The goal of Generative AI is to generate new content.'},
  {'score': 0.36772096157073975,
   'token': 1045,
   'token_str': ' create',
   'sequ

In [6]:
# Experiment 3: Question Answering
results_qa = {name: try_qa(model) for name, model in models.items()}
results_qa

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use mps:0
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use mps:0
Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use mps:0


{'BERT': {'score': 0.008298373781144619,
  'start': 66,
  'end': 81,
  'answer': ', and deepfakes'},
 'RoBERTa': {'score': 0.004165984224528074,
  'start': 32,
  'end': 67,
  'answer': 'risks such as hallucinations, bias,'},
 'BART': {'score': 0.01434946060180664,
  'start': 20,
  'end': 60,
  'answer': 'significant risks such as hallucinations'}}

## Observation Table

| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Generation** | BERT | Failure | Output was mostly dots; no meaningful continuation. | Encoder-only; not trained for autoregressive next-token generation. |
| | RoBERTa | Failure | Returned only the prompt with no real continuation. | Encoder-only; lacks a causal decoding head for generation. |
| | BART | Failure | Generated long, incoherent tokens (“product bankruptcy…”) | Encoder-decoder not trained as a causal LM here; mismatched head and no task fine-tuning. |
| **Fill-Mask** | BERT | Success | Predicted “create”, “generate”, “produce” as top candidates. | Trained with Masked Language Modeling (MLM). |
| | RoBERTa | Success | Predicted “generate”, “create” with high scores. | Optimized MLM objective; strong masked token prediction. |
| | BART | Partial | Predicted reasonable words but with lower confidence. | Denoising seq2seq objective; can fill masks but not specialized MLM. |
| **QA** | BERT | Failure | Low score; incomplete span (“…and deepfakes”). | Base model not fine-tuned for QA (no SQuAD head training). |
| | RoBERTa | Failure | Low score; partial span only. | Base model not QA-fine-tuned; span head randomly initialized. |
| | BART | Failure | Low score; partial span only. | Base model not QA-fine-tuned; QA head randomly initialized. |
