# Unit 1 – Model Benchmark Challenge

**Objective:**  
To compare BERT, RoBERTa, and BART by forcing them to perform tasks they are not architecturally designed for, and observing the results.

In [1]:
from transformers import pipeline

## Experiment 1: Text Generation

**Prompt:**  
"The future of Artificial Intelligence is"

**Hypothesis:**  
Encoder-only models like BERT and RoBERTa will fail because they are not designed to generate the next token.  
BART, being an encoder-decoder model, should perform better.

In [2]:
prompt = "The future of Artificial Intelligence is"

models = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}

for name, model in models.items():
    print(f"\n{name} Output:")
    try:
        generator = pipeline("text-generation", model=model)
        output = generator(prompt, max_length=30)
        print(output[0]["generated_text"])
    except Exception as e:
        print("Error:", e)


BERT Output:


If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of Artificial Intelligence is................................................................................................................................................................................................................................................................

RoBERTa Output:


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of Artificial Intelligence is

BART Output:


config.json: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of Artificial Intelligence isonial WaltonUFF Damcorn stolenCheckRGB grad lava lava Walton Walton WaltonUFF stolen Waltonきankedcorn lavaedience grad Alert181 Walton lavaAd Guam reapp lava elsewhere Waltonlisedienceedience animosity lavaedience incor Againstjavaedience reapp reappedience piedienceedienceetts reappedienceediencejava 331edience fearing animosity inputsjavaediencejavaedience Againstedience hungry animosity Nero Neroedience animosity Against Against Box NeroedienceBottom std Short Againstjava NeroRGB inputs inputs Nero reappediencejava econom Againstき Nero tonguesedience inputs animosityjava tongues Against animosity serpent 331 econom Nero Vanderbiltjavajavareek Nero Against Against econom� Against tongues tongues Nero Nero Nero scoreboard Nero grad Nero]=]=edience Island econom 331 inputs serpent Nero Feeling Against Nero Against Nero tongues serpentsol Nero Feeling tongues Nero]= Nero econom Nero Against]= Nero Nero Against tonguesgil Nero Nero serpent std mani

## Experiment 2: Masked Language Modeling

**Input Sentence:**  
"The goal of Generative AI is to [MASK] new content."

**Hypothesis:**  
BERT and RoBERTa should succeed because they are trained using Masked Language Modeling.

In [3]:
sentence = "The goal of Generative AI is to [MASK] new content."

for name, model in models.items():
    print(f"\n{name} Output:")
    try:
        fill = pipeline("fill-mask", model=model)
        results = fill(sentence)
        for r in results[:3]:
            print(r["token_str"])
    except Exception as e:
        print("Error:", e)


BERT Output:


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


create
generate
produce

RoBERTa Output:


Device set to use cpu


Error: No mask_token (<mask>) found on the input

BART Output:


Device set to use cpu


Error: No mask_token (<mask>) found on the input


## Experiment 3: Question Answering

**Context:**  
"Generative AI poses significant risks such as hallucinations, bias, and deepfakes."

**Question:**  
"What are the risks?"

**Observation Note:**  
Base models not fine-tuned on QA may give weak or random answers.

In [4]:
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

for name, model in models.items():
    print(f"\n{name} Output:")
    try:
        qa = pipeline("question-answering", model=model)
        result = qa(question=question, context=context)
        print(result["answer"])
    except Exception as e:
        print("Error:", e)


BERT Output:


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


hallucinations, bias, and deepfakes

RoBERTa Output:


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


risks such as hallucinations, bias, and deepfakes

BART Output:


Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


Generative AI poses


## Observation Table

| Task | Model | Classification (Success/Failure) | Observation | Architectural Reason |
|----|------|----------------------------------|-------------|----------------------|
| Generation | BERT | Failure | Generated errors or random output | Encoder-only model; not trained for next-token generation |
| Generation | RoBERTa | Failure | Could not generate coherent text | Encoder-only architecture |
| Generation | BART | Partial Success | Generated short meaningful text | Encoder-Decoder supports generation |
| Fill-Mask | BERT | Success | Predicted words like "create", "generate" | Trained using Masked Language Modeling |
| Fill-Mask | RoBERTa | Success | Accurate word predictions | Optimized MLM training |
| Fill-Mask | BART | Partial Success | Less accurate predictions | Not primarily trained for MLM |
| QA | BERT | Partial Success | Returned short or vague answer | Not fine-tuned for QA |
| QA | RoBERTa | Partial Success | Inconsistent answer | Base model without QA fine-tuning |
| QA | BART | Partial Success | Answered but lacked precision | Needs QA fine-tuning |