# Unit 1 Assignment: The Model Benchmark Challenge

**Objective:** Evaluate architectural differences between BERT, RoBERTa, and BART by testing them on tasks they may not be designed for.

**Models to Test:**
1. BERT (`bert-base-uncased`) - Encoder-only
2. RoBERTa (`roberta-base`) - Encoder-only (optimized)
3. BART (`facebook/bart-base`) - Encoder-Decoder

## Setup: Install and Import Libraries

In [1]:
!pip install transformers torch -q

In [15]:
from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


---
## Experiment 1: Text Generation

**Task:** Generate text using the prompt: `"The future of Artificial Intelligence is"`

**Hypothesis:** BERT and RoBERTa will fail because they're encoder-only models designed to understand text, not generate it. BART might do better since it's an encoder-decoder model, but probably won't be great without proper training.

In [16]:
prompt = "The future of Artificial Intelligence is"
print(f"Prompt: '{prompt}'\n")

Prompt: 'The future of Artificial Intelligence is'



### Test 1.1: BERT

In [17]:
print("\n Testing BERT for Text Generation...")
try:
    bert_generator = pipeline('text-generation', model='bert-base-uncased')
    result = bert_generator(prompt, max_length=20, num_return_sequences=1)
    print(f"Result: {result[0]['generated_text']}")
except Exception as e:
    print(f"Error: {str(e)}")

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`



 Testing BERT for Text Generation...


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Result: The future of Artificial Intelligence is................................................................................................................................................................................................................................................................


### Test 1.2: RoBERTa

In [18]:
print("\nTesting RoBERTa for Text Generation...")
try:
    roberta_generator = pipeline('text-generation', model='roberta-base')
    result = roberta_generator(prompt, max_length=20, num_return_sequences=1)
    print(f"Result: {result[0]['generated_text']}")
except Exception as e:
    print(f"Error: {str(e)}")

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`



Testing RoBERTa for Text Generation...


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Result: The future of Artificial Intelligence is


### Test 1.3: BART

In [19]:
print("\nTesting BART for Text Generation...")
try:
    bart_generator = pipeline('text-generation', model='facebook/bart-base')
    result = bart_generator(prompt, max_length=20, num_return_sequences=1)
    print(f"Result: {result[0]['generated_text']}")
except Exception as e:
    print(f" Error: {str(e)}")


Testing BART for Text Generation...


Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Result: The future of Artificial Intelligence is mythology ampUBAntiAntiAntiExt83jandroprisonprisonprison UNESCOprisonExtExt doctrinesListprisonprison rallying definitelyExt Deputy Deputy extracted extracted insultsjerprisonExtCVprisonExtprison concedesprisonprison ruininglaws Motorsport Motorsport necessprisonCSExtExt ampprisonprison taxedprison necessExtExtprison extracted extracted accomplishedprison doctrinesExt accomplished necess necessthatprison doctrines padded necesshodExt AsExt Deputyhodprisonprison doctrinesprisonGROUP glorious As nervprison AsExt Customs conjectureprison extracted Customs Ancients CustomsERSON extracted extracted extracted conjectureision Conc distribution necess necess fractures Customs CustomsAnti Customs Customs sock doctrines vulnerability extracted extracted Customs Customs Customs accomplished Customseneg Customs Customsclair Customs Ancientseneg oslaws vulnerability vulnerability As Customs necessenegERSON extractedclairprison CustomsAnti eventual vu

---
## Experiment 2: Masked Language Modeling (Fill-Mask)

**Task:** Predict the missing word in: `"The goal of Generative AI is to [MASK] new content."`

**Hypothesis:** BERT and RoBERTa should crush this task since they were literally trained on predicting masked words. BART might work but won't be as good.

### Test 2.1: BERT

In [20]:
print("\nTesting BERT for Fill-Mask...")
try:
    bert_mask = pipeline('fill-mask', model='bert-base-uncased')
    sentence = "The goal of Generative AI is to [MASK] new content."
    result = bert_mask(sentence)
    print(f"Top 3 Predictions:")
    for i, pred in enumerate(result[:3], 1):
        print(f"   {i}. {pred['token_str']} (score: {pred['score']:.4f})")
except Exception as e:
    print(f" Error: {str(e)}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Testing BERT for Fill-Mask...


Device set to use cpu


Top 3 Predictions:
   1. create (score: 0.5397)
   2. generate (score: 0.1558)
   3. produce (score: 0.0541)


### Test 2.2: RoBERTa

In [21]:
print("\nTesting RoBERTa for Fill-Mask...")
try:
    roberta_mask = pipeline('fill-mask', model='roberta-base')
    # RoBERTa uses <mask> instead of [MASK]
    sentence = "The goal of Generative AI is to <mask> new content."
    result = roberta_mask(sentence)
    print(f"Top 3 Predictions:")
    for i, pred in enumerate(result[:3], 1):
        print(f"   {i}. {pred['token_str']} (score: {pred['score']:.4f})")
except Exception as e:
    print(f" Error: {str(e)}")


Testing RoBERTa for Fill-Mask...


Device set to use cpu


Top 3 Predictions:
   1.  generate (score: 0.3711)
   2.  create (score: 0.3677)
   3.  discover (score: 0.0835)


### Test 2.3: BART

In [22]:
print("\nTesting BART for Fill-Mask...")
try:
    bart_mask = pipeline('fill-mask', model='facebook/bart-base')
    # BART uses <mask> token
    sentence = "The goal of Generative AI is to <mask> new content."
    result = bart_mask(sentence)
    print(f"Top 3 Predictions:")
    for i, pred in enumerate(result[:3], 1):
        print(f"   {i}. {pred['token_str']} (score: {pred['score']:.4f})")
except Exception as e:
    print(f" Error: {str(e)}")


Testing BART for Fill-Mask...


Device set to use cpu


Top 3 Predictions:
   1.  create (score: 0.0746)
   2.  help (score: 0.0657)
   3.  provide (score: 0.0609)


---
## Experiment 3: Question Answering

**Task:** Answer the question `"What are the risks?"` based on the context:
`"Generative AI poses significant risks such as hallucinations, bias, and deepfakes."`

**Hypothesis:** All three base models will probably struggle since none of them are fine-tuned for QA tasks.

In [23]:
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

print(f"Context: {context}")
print(f"Question: {question}\n")

Context: Generative AI poses significant risks such as hallucinations, bias, and deepfakes.
Question: What are the risks?



### Test 3.1: BERT

In [24]:
print("\n Testing BERT for Question Answering...")
try:
    bert_qa = pipeline('question-answering', model='bert-base-uncased')
    result = bert_qa(question=question, context=context)
    print(f"Answer: {result['answer']}")
    print(f"Confidence: {result['score']:.4f}")
except Exception as e:
    print(f" Error: {str(e)}")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



 Testing BERT for Question Answering...


Device set to use cpu


Answer: deepfakes.
Confidence: 0.0112


### Test 3.2: RoBERTa

In [25]:
print("\nTesting RoBERTa for Question Answering...")
try:
    roberta_qa = pipeline('question-answering', model='roberta-base')
    result = roberta_qa(question=question, context=context)
    print(f"Answer: {result['answer']}")
    print(f"Confidence: {result['score']:.4f}")

except Exception as e:
    print(f" Error: {str(e)}")

Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Testing RoBERTa for Question Answering...


Device set to use cpu


Answer: deepfakes
Confidence: 0.0119


### Test 3.3: BART

In [26]:
print("\n Testing BART for Question Answering...")
try:
    bart_qa = pipeline('question-answering', model='facebook/bart-base')
    result = bart_qa(question=question, context=context)
    print(f" Answer: {result['answer']}")
    print(f"   Confidence: {result['score']:.4f}")

except Exception as e:
    print(f" Error: {str(e)}")


Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



 Testing BART for Question Answering...


Device set to use cpu


 Answer: deepfakes.
   Confidence: 0.0286


---
## Summary & Observation Table

Based on the experiments above, fill in your observations below:

| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
|------|-------|----------------------------------|---------------------------------------|---------------------------------------------|
| **Generation** | BERT | Failure | Generated repeated dots/periods instead of coherent text | BERT is an Encoder-only model trained for understanding (MLM), not for generating sequential text |
| | RoBERTa | Failure | Simply repeated the input prompt without generating new text | RoBERTa is an Encoder-only model (optimized BERT), not designed for sequential text generation |
| | BART | Failure | Generated completely incoherent text with random repetitive words (roach, Bella, distributors, juice, etc.) | BART is designed for seq2seq tasks, not causal generation. The causal LM head wasn't properly initialized, resulting in nonsense output |
| **Fill-Mask** | BERT | Success | Predicted 'create' (53.97%), 'generate' (15.58%), 'produce' (5.41%) - all semantically correct | BERT was trained specifically on Masked Language Modeling (MLM), making this its core strength |
| | RoBERTa | Success | Predicted 'generate' (37.11%), 'create' (36.77%), 'discover' (8.35%) - all semantically correct with more balanced confidence | RoBERTa is an optimized BERT with improved MLM training, resulting in better-calibrated predictions |
| | BART | Partial Success | Predicted 'create' (7.46%), 'help' (6.57%), 'provide' (6.09%) - much lower confidence than BERT/RoBERTa; less optimal predictions | BART is trained with denoising but optimized for seq2seq tasks, not pure MLM; less specialized for fill-mask |
| **QA** | BERT | Failure | Extracted only 'deepfakes.' (1.12%) with very low confidence; incomplete answer missing 'hallucinations' and 'bias' | Base BERT not fine-tuned for QA; randomly initialized QA head lacks specialized training for complete answer extraction |
| | RoBERTa | Failure | Extracted only 'deepfakes' (1.19%) with very low confidence; incomplete answer, same as BERT | Base RoBERTa not fine-tuned for QA; despite better representations, randomly initialized QA head performs similarly poorly |
| | BART | Failure | Extracted only 'deepfakes.' (2.86%) with very low confidence; incomplete answer but highest confidence among all three models | BART designed for seq2seq generation, not extractive QA; randomly initialized QA head still performs poorly despite higher relative confidence |