# Unit 1 Benchmark: BERT vs RoBERTa vs BART
This notebook runs three quick experiments to observe how **architecture** affects behaviour: generation, masked-fill, and question-answering. The goal is to note practical differences — no heavy training here.

In [3]:
# Imports and model list
from transformers import pipeline, set_seed
set_seed(42)
models = {
    'BERT': 'bert-base-uncased',
    'RoBERTa': 'roberta-base',
    'BART': 'facebook/bart-base'
}
print('Models to test:', models)

Models to test: {'BERT': 'bert-base-uncased', 'RoBERTa': 'roberta-base', 'BART': 'facebook/bart-base'}


## Exp 1 — Text Generation
Prompt: "The future of Artificial Intelligence is"
We try `pipeline('text-generation', model=...)` for each model and capture results or errors. Encoder-only models are not designed for autoregressive generation, so we expect issues for BERT/RoBERTa.

In [4]:
prompt = "The future of Artificial Intelligence is"
for name, m in models.items():
    print("\n---", name, m, '---')
    try:
        gen = pipeline('text-generation', model=m)
        out = gen(prompt, max_length=40, num_return_sequences=1)
        print('Result:', out[0]['generated_text'])
    except Exception as e:
        print('Generation failed or not suitable for this model:', str(e))


--- BERT bert-base-uncased ---


If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Loading weights: 100%|██████████| 202/202 [00:00<00:00, 2144.74it/s, Materializing param=cls.predictions.transform.dense.weight]                 
BertLMHeadModel LOAD REPORT from: bert-base-uncased
Key                         | Status     |  | 
----------------------------+------------+--+-
bert.pooler.dense.bias      | UNEXPECTED |  | 
cls.seq_relationship.weight | UNEXPECTED |  | 
bert.pooler.dense.weight    | UNEXPECTED |  | 
cls.seq_relationship.bias   | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Passing `generation_config` together with generation-related arguments=({'max_length', 'num_return_sequences'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Both `max_new_tokens` (=256) and `max_length`(=40) se

Result: The future of Artificial Intelligence is. the some some some or an a a ( ). of the or and. it it it it it it for for as or some some some some some some some some some some some mostly rock. the more actually um ", - - - - - ) ( " ( ). the so and the so and. " ( " ". it it it it it it it it that a to is'- - - ( ) ( ). it it that. it it it it it it it that some some some some some some some some some some some some some some some some some some some some some some some some some some some those being ".. it not so so van.. it or or and and and and and and " ( ( ( ( ( ". ( ( ( ) on well which salem lost it or in their as fordw a /. it it many way as the and and to her and and and and and and and and and and and and and in and and and and the a iso i to her as that they were as my their on as.........................................................................................................

--- RoBERTa roberta-base ---


If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`
Loading weights: 100%|██████████| 202/202 [00:00<00:00, 2433.74it/s, Materializing param=roberta.encoder.layer.11.output.dense.weight]              
RobertaForCausalLM LOAD REPORT from: roberta-base
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Both `max_new_tokens` (=256) and `max_length`(=40) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Result: The future of Artificial Intelligence is

--- BART facebook/bart-base ---


Loading weights: 100%|██████████| 159/159 [00:00<00:00, 2421.01it/s, Materializing param=model.decoder.layers.5.self_attn_layer_norm.weight]   
This checkpoint seem corrupted. The tied weights mapping for this model specifies to tie model.decoder.embed_tokens.weight to lm_head.weight, but both are absent from the checkpoint, and we could not find another related tied weight for those keys
BartForCausalLM LOAD REPORT from: facebook/bart-base
Key                                                           | Status     | 
--------------------------------------------------------------+------------+-
encoder.layers.{0, 1, 2, 3, 4, 5}.self_attn.out_proj.weight   | UNEXPECTED | 
encoder.layers.{0, 1, 2, 3, 4, 5}.fc1.bias                    | UNEXPECTED | 
encoder.layers.{0, 1, 2, 3, 4, 5}.self_attn_layer_norm.bias   | UNEXPECTED | 
encoder.layers.{0, 1, 2, 3, 4, 5}.self_attn_layer_norm.weight | UNEXPECTED | 
encoder.layers.{0, 1, 2, 3, 4, 5}.self_attn.k_proj.weight     | UNEXPECTED | 
encoder.l

Result: The future of Artificial Intelligence is MAD Bradford Bradford Bradford gunmen appendixNPR appendixomething quarter Idol Idol Idol *** Idol boost fungus fungus freeze hopsatherine MoroccoNPRNPR appendix \(\ Cosby Randall Idol Bailey Treasure Treasure Treasure cumbersome Contribut charms boost boost ObamaCare XIV cumbersome cumbersome Morocco cumbersome012 bitten XIV Deal Moroccoousands carbdouble Treasure Nazi Nazi Dani appendix boost boost stereo cumbersome cumbersome welcominghusPrevioushusPreviousPreviousPrevioushus Zerhus Danihus Dani shores boost appendix appendix Dani stereoPrevioushus \(\ pled cumbersome Ahmad cumbersome Morocco appendix Dani cumbersome appendixhus cumbersome stereo appendix DaniPrevious boosthus boostBus cumbersomemeasureshus appendix cumbersome cumbersome XIV appendixsur Dani Online cumbersome appendixsur426 cumbersome cumbersomeccess welcoming appendixsur appendix pled Cosbysur Cosby predators cumbersome XIV cumbersome�� predators Dani Online Danisur 

## Exp 2 — Masked Language Modeling
Sentence: "The goal of Generative AI is to [MASK] new content."
We use `pipeline('fill-mask', model=...)`. BERT/RoBERTa were trained with MLM and should do well; BART is not primarily an MLM model, so expect weaker or unexpected behaviour.

In [5]:
masked_sentence = "The goal of Generative AI is to create new [MASK]."
for name, m in models.items():
    print("\n---", name, m, '---')
    try:
        filler = pipeline('fill-mask', model=m)
        preds = filler(masked_sentence)
        for p in preds[:5]:
            print(p['token_str'].strip(), f"({p['score']:.2f})")
    except Exception as e:
        print('Fill-mask failed or not suited for this model:', str(e))


--- BERT bert-base-uncased ---


Loading weights: 100%|██████████| 202/202 [00:00<00:00, 2023.64it/s, Materializing param=cls.predictions.transform.dense.weight]                 
BertForMaskedLM LOAD REPORT from: bert-base-uncased
Key                         | Status     |  | 
----------------------------+------------+--+-
bert.pooler.dense.bias      | UNEXPECTED |  | 
cls.seq_relationship.weight | UNEXPECTED |  | 
bert.pooler.dense.weight    | UNEXPECTED |  | 
cls.seq_relationship.bias   | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


applications (0.06)
ideas (0.05)
problems (0.05)
systems (0.04)
information (0.03)

--- RoBERTa roberta-base ---


Loading weights: 100%|██████████| 202/202 [00:00<00:00, 2349.25it/s, Materializing param=roberta.encoder.layer.11.output.dense.weight]              
RobertaForMaskedLM LOAD REPORT from: roberta-base
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Fill-mask failed or not suited for this model: No mask_token (<mask>) found on the input

--- BART facebook/bart-base ---


Loading weights: 100%|██████████| 259/259 [00:00<00:00, 2365.88it/s, Materializing param=model.shared.weight]                                  


Fill-mask failed or not suited for this model: No mask_token (<mask>) found on the input


## Exp 3 — Question Answering (Extractive)
Context: "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
Question: "What are the risks?"
We use `pipeline('question-answering', model=...)`. Note: base models not fine-tuned on QA datasets will often give poor answers.

In [6]:
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"
for name, m in models.items():
    print("\n---", name, m, '---')
    try:
        qa = pipeline('question-answering', model=m)
        ans = qa(question=question, context=context)
        print('Answer:', ans)
    except Exception as e:
        print('QA failed or not ideal for this model:', str(e))


--- BERT bert-base-uncased ---


Loading weights: 100%|██████████| 197/197 [00:00<00:00, 2352.16it/s, Materializing param=bert.encoder.layer.11.output.dense.weight]              
BertForQuestionAnswering LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.seq_relationship.weight                | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
bert.pooler.dense.weight                   | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
bert.pooler.dense.bias                     | UNEXPECTED | 
qa_outputs.bias                            | MISSING    | 
qa_outputs.weight                          | MISSING    | 

Notes:
- UNEXPECTED	:can 

Answer: {'score': 0.008658115286380053, 'start': 46, 'end': 81, 'answer': 'hallucinations, bias, and deepfakes'}

--- RoBERTa roberta-base ---


Loading weights: 100%|██████████| 197/197 [00:00<00:00, 2389.67it/s, Materializing param=roberta.encoder.layer.11.output.dense.weight]              
RobertaForQuestionAnswering LOAD REPORT from: roberta-base
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
qa_outputs.bias                 | MISSING    | 
qa_outputs.weight               | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Answer: {'score': 0.008385971188545227, 'start': 0, 'end': 67, 'answer': 'Generative AI poses significant risks such as hallucinations, bias,'}

--- BART facebook/bart-base ---


Loading weights: 100%|██████████| 259/259 [00:00<00:00, 2219.67it/s, Materializing param=model.shared.weight]                                  
BartForQuestionAnswering LOAD REPORT from: facebook/bart-base
Key               | Status  | 
------------------+---------+-
qa_outputs.bias   | MISSING | 
qa_outputs.weight | MISSING | 

Notes:
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Answer: {'score': 0.01259099692106247, 'start': 66, 'end': 82, 'answer': ', and deepfakes.'}


## Observation Table (filled)
| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Generation** | BERT | Failure | Pipeline raised error or produced garbage / noticably wrong output. | BERT is encoder-only and trained for masked tokens, not autoregressive next-token prediction. |
|  | RoBERTa | Failure | Similar to BERT — either errors or very poor continuations. | RoBERTa is also encoder-only (MLM objective), not an autoregressive generator. |
|  | BART | Success | Produced a coherent continuation that follows the prompt. | BART is an encoder-decoder (seq2seq) model and can be used for generation. |
| **Fill-Mask** | BERT | Success | Predicted plausible tokens like "content", "data" with high probability. | BERT trained with MLM objective, so fill-mask is its natural task. |
|  | RoBERTa | Success | Good predictions, often similar to BERT (e.g., "content", "data"). | RoBERTa is an MLM model and performs well on mask prediction. |
|  | BART | Partial/Failure | May return odd scores or not behave as expected for single-mask predictions. | BART uses a denoising objective (seq2seq); it isn't optimized as a classic MLM so results are weaker. |
| **QA** | BERT | Partial/Failure | Base BERT often returns low-confidence or wrong spans unless fine-tuned for QA. | Extractive QA needs fine-tuning (SQuAD); base BERT wasn't fine-tuned for span prediction. |
|  | RoBERTa | Partial/Failure | Similar to BERT — poor answers without fine-tuning. | Requires QA fine-tuning for good extractive performance. |
|  | BART | Partial/Success | BART sometimes finds the correct phrase but base model may be unstable; a QA-finetuned variant works much better. | Encoder-decoder models can be adapted to QA, but again fine-tuning matters a lot. |

## summary

- architecture matters a lot : Encoderonly models (BERT/RoBERTa) excel at understanding tasks like masked-fill and embeddings, but not at autoregressive generation.
- Seq2seq models (BART) are flexible for generation and can be adapted for QA/summarization.
- Practical note: Always check whether a model was fine-tuned for your exact task — using a base checkpoint often gives poor results.

