# Unit 1 Benchmark: The Model Benchmark Challenge

This notebook compares **BERT**, **RoBERTa**, and **BART** across three tasks to show how architecture affects performance.

In [1]:
# Install dependencies (run once)
!pip -q install transformers torch

In [2]:
# Imports
from transformers import pipeline



## Models to Test
- **BERT**: `bert-base-uncased` (Encoder-only)
- **RoBERTa**: `roberta-base` (Encoder-only)
- **BART**: `facebook/bart-base` (Encoder-Decoder)

In [3]:
models = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}

## Experiment 1: Text Generation
**Prompt**: `"The future of Artificial Intelligence is"`

In [4]:
prompt = "The future of Artificial Intelligence is"
for name, model_id in models.items():
    print(f"\n=== {name} ===")
    try:
        generator = pipeline("text-generation", model=model_id)
        out = generator(prompt, max_length=30, num_return_sequences=1)
        print(out[0]["generated_text"])
    except Exception as e:
        print(f"Failed: {e}")


=== BERT ===


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of Artificial Intelligence is................................................................................................................................................................................................................................................................

=== RoBERTa ===


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of Artificial Intelligence is

=== BART ===


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of Artificial Intelligence is humble humble humble UID tion batAnimal Dress sitavid Jill Cup Front minced mincedorc impoverished western Notice sit sitIntegInteg measure measure impoverishedrates impoverished impoverished Lore Noticeavidvon impoverishediotic mixture massivelyinduced driveinducedinducedinducedDVDencia Patient Grab impoverishedumption)- burner superficial changing Instant Patientenciaenciaencia334umption signs Allaah)-encia404 circuitsenciaencia Patient Instant Nephenciaencia impoverished camer Grain Allaahenciaencia Avalon Principles Instant577DVDvon CLioticvon measure Koran KoranUFCumptionioticenciavon impoverishedvon Allaah Athe Allaah Allaah measureerver Koran Enterpriseumption Koran massively Koran inscribedprivate changing reinforced restructuring reinforced reinforced reinforced Avalon changing Koranratesipt reinforced)- Continue Koran circuitsencia Avalon driverates Protestants reinforced impoverished404 Koranvon Avalon reinforced)-encia pleasures Aval

## Experiment 2: Masked Language Modeling
**Prompt**: `"The goal of Generative AI is to [MASK] new content."`

In [5]:
masked_sentence = "The goal of Generative AI is to [MASK] new content."
for name, model_id in models.items():
    print(f"\n=== {name} ===")
    try:
        filler = pipeline("fill-mask", model=model_id)
        preds = filler(masked_sentence)
        for p in preds[:3]:
            print(f"{p['token_str']}: {p['score']:.2f}")
    except Exception as e:
        print(f"Failed: {e}")


=== BERT ===


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


create: 0.54
generate: 0.16
produce: 0.05

=== RoBERTa ===


Device set to use cpu


Failed: No mask_token (<mask>) found on the input

=== BART ===


Device set to use cpu


Failed: No mask_token (<mask>) found on the input


## Experiment 3: Question Answering
**Question**: `"What are the risks?"`
**Context**: `"Generative AI poses significant risks such as hallucinations, bias, and deepfakes."`

In [6]:
question = "What are the risks?"
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
for name, model_id in models.items():
    print(f"\n=== {name} ===")
    try:
        qa = pipeline("question-answering", model=model_id)
        res = qa(question=question, context=context)
        print(f"Answer: {res['answer']} (score={res['score']:.2f})")
    except Exception as e:
        print(f"Failed: {e}")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



=== BERT ===


Device set to use cpu
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Answer: hallucinations, bias, and deepfakes (score=0.00)

=== RoBERTa ===


Device set to use cpu
Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Answer: hallucinations, bias, and deepfakes (score=0.01)

=== BART ===


Device set to use cpu


Answer: Generative AI poses (score=0.03)


## Observation Table

Fill this table after running the experiments:

| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Generation** | BERT | Failure | Generated nonsense or random symbols. | BERT is an Encoder; it isn't trained to predict the next word. |
|  | RoBERTa | Failure | Pipeline raised an error or output was unusable. | Encoder-only; not trained for autoregressive generation. |
|  | BART | Partial Success | Short/generic text, sometimes off-topic. | Encoder–decoder can generate, but `bart-base` isn’t fine-tuned for open-ended generation. |
| **Fill-Mask** | BERT | Success | Predicted "create", "generate". | BERT is trained on Masked Language Modeling (MLM). |
|  | RoBERTa | Success | Predicted plausible words with strong confidence. | Optimized MLM training on larger data. |
|  | BART | Partial Success | Returned plausible tokens but less sharp than BERT/RoBERTa. | Denoising autoencoder; supports masking but not pure MLM. |
| **QA** | BERT | Failure | Answers were random/low quality. | Base model not fine-tuned for QA (SQuAD). |
|  | RoBERTa | Failure | Answers were random/low quality. | Base model not fine-tuned for QA (SQuAD). |
|  | BART | Failure | Answers were random/low quality. | Base model not fine-tuned for extractive QA. |