NAME: ANANYAA SREE SP

SRN: PES2UG23CS066

SECTION: A

# Unit 1 Assignment: Model Benchmark Challenge

This notebook evaluates architectural differences between BERT, RoBERTa, and BART.

In [6]:
!pip install transformers torch




[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip





In [7]:
from transformers import pipeline

## Experiment 1: Text Generation

In [8]:

text_gen_models = [
    "bert-base-uncased",
    "roberta-base",
    "facebook/bart-base"
]

for model in text_gen_models:
    print(f"Model: {model}")
    try:
        gen = pipeline("text-generation", model=model)
        print(gen("The future of Artificial Intelligence is", max_length=30))
    except Exception as e:
        print("Error:", e)
    print("-"*50)


Model: bert-base-uncased


If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is................................................................................................................................................................................................................................................................'}]
--------------------------------------------------
Model: roberta-base


If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is'}]
--------------------------------------------------
Model: facebook/bart-base


Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is Style demonstrates Style Style Style convictionsthemeatively UTC Jobsativelyalks GN Styleativelythememouth Jobsiddlesiddlesiddles900900 deals Style Styleiddlesiddles struiddles presidents camel rebuilding rebuildingiddlesiddlesmouth Style fo Styleκ demonstrates ordained conveyed replicaitting Forumitting Style StyleHost StyleProfileProfile demonstratesiddles HM HM GUN GUN GUN HMjaniddlesiddlesittingiddles Forum GUNitting footholdiddles replicaiddlesiddlesProfileiddles GUNiddles GeneticHostiddles55Host GUNiddlesiddles GUN assailiddlesiddles foolishiddlesiddles Styleiddles GUN replica GUN fo GUN HM rebuilding GUN GUNiddles footholdiddles GUN Style GUN GUNHost Periodiddlesiddles55iddles presidents HMProfileiddlesiddles Period presidentsiddlesiddles308 GUN Arriiddlesiddles Reliefiddles GUN GUN Style Recovery593Hostiddlesiddles reinvent GUN really reallyiddlesiddles ventilationiddlesiddlesPlPliddlesiddles593iddlesiddles Voldemort

## Experiment 2: Masked Language Modeling

In [9]:

fill_mask_models = [
    "bert-base-uncased",
    "roberta-base",
    "facebook/bart-base"
]

sentences = {
    "bert-base-uncased": "The goal of Generative AI is to [MASK] new content.",
    "roberta-base": "The goal of Generative AI is to <mask> new content.",
    "facebook/bart-base": "The goal of Generative AI is to <mask> new content."
}

for model in fill_mask_models:
    print(f"Model: {model}")
    fm = pipeline("fill-mask", model=model)
    print(fm(sentences[model]))
    print("-"*50)


Model: bert-base-uncased


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.539692759513855, 'token': 3443, 'token_str': 'create', 'sequence': 'the goal of generative ai is to create new content.'}, {'score': 0.15575766563415527, 'token': 9699, 'token_str': 'generate', 'sequence': 'the goal of generative ai is to generate new content.'}, {'score': 0.05405496060848236, 'token': 3965, 'token_str': 'produce', 'sequence': 'the goal of generative ai is to produce new content.'}, {'score': 0.044515229761600494, 'token': 4503, 'token_str': 'develop', 'sequence': 'the goal of generative ai is to develop new content.'}, {'score': 0.017577484250068665, 'token': 5587, 'token_str': 'add', 'sequence': 'the goal of generative ai is to add new content.'}]
--------------------------------------------------
Model: roberta-base


Device set to use cpu


[{'score': 0.37113118171691895, 'token': 5368, 'token_str': ' generate', 'sequence': 'The goal of Generative AI is to generate new content.'}, {'score': 0.3677138090133667, 'token': 1045, 'token_str': ' create', 'sequence': 'The goal of Generative AI is to create new content.'}, {'score': 0.08351466804742813, 'token': 8286, 'token_str': ' discover', 'sequence': 'The goal of Generative AI is to discover new content.'}, {'score': 0.02133519947528839, 'token': 465, 'token_str': ' find', 'sequence': 'The goal of Generative AI is to find new content.'}, {'score': 0.01652175933122635, 'token': 694, 'token_str': ' provide', 'sequence': 'The goal of Generative AI is to provide new content.'}]
--------------------------------------------------
Model: facebook/bart-base


Device set to use cpu


[{'score': 0.07461544126272202, 'token': 1045, 'token_str': ' create', 'sequence': 'The goal of Generative AI is to create new content.'}, {'score': 0.06571853160858154, 'token': 244, 'token_str': ' help', 'sequence': 'The goal of Generative AI is to help new content.'}, {'score': 0.060880184173583984, 'token': 694, 'token_str': ' provide', 'sequence': 'The goal of Generative AI is to provide new content.'}, {'score': 0.035935722291469574, 'token': 3155, 'token_str': ' enable', 'sequence': 'The goal of Generative AI is to enable new content.'}, {'score': 0.03319481760263443, 'token': 1477, 'token_str': ' improve', 'sequence': 'The goal of Generative AI is to improve new content.'}]
--------------------------------------------------


## Experiment 3: Question Answering

In [10]:

qa_models = [
    "bert-base-uncased",
    "roberta-base",
    "facebook/bart-base"
]

context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

for model in qa_models:
    print(f"Model: {model}")
    qa = pipeline("question-answering", model=model)
    print(qa(question=question, context=context))
    print("-"*50)


Model: bert-base-uncased


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.013356235111132264, 'start': 46, 'end': 60, 'answer': 'hallucinations'}
--------------------------------------------------
Model: roberta-base


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.00838578911498189, 'start': 68, 'end': 81, 'answer': 'and deepfakes'}
--------------------------------------------------
Model: facebook/bart-base


Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.0733608528971672, 'start': 0, 'end': 81, 'answer': 'Generative AI poses significant risks such as hallucinations, bias, and deepfakes'}
--------------------------------------------------


| Task       | Model   | Classification (Success/Failure) | Observation (What actually happened?)                   | Why did this happen? (Architectural Reason)       |
| ---------- | ------- | -------------------------------- | ------------------------------------------------------- | --------------------------------|
| Generation | BERT    | Failure                          | Generated repetitive dots instead of meaningful text.   | Encoder-only model; cannot                                                                                                                             generate next tokens.  
| Generation | RoBERTa | Failure                          | Did not continue the prompt beyond the input sentence.  | Encoder-only architecture with                                                                                                                         no decoder.        
| Generation | BART    | Success                          | Generated long text, but output was incoherent.         | Encoder–decoder architecture                                                                                                                           supports generation. 
| Fill-Mask  | BERT    | Success                          | Correctly predicted words like “create” and “generate”. | Trained using masked language                                                                                                                           modeling.           
| Fill-Mask  | RoBERTa | Success                          | Predicted correct masked words with high confidence.    | Optimized encoder trained for                                                                                                                          MLM.                
| Fill-Mask  | BART    | Partial Success                  | Predicted reasonable words but with lower confidence.   | Masking is not its primary                                                                                                                             training objective.    
| QA         | BERT    | Partial Success                  | Extracted correct answer with very low confidence.      | QA head not fine-tuned for                                                                                                                             question answering.    
| QA         | RoBERTa | Failure                          | Returned partial and incomplete answer.                 | Lacks task-specific QA fine-                                                                                                                           tuning.               
| QA         | BART    | Failure                          | Produced incomplete answer ending mid-phrase.           | Requires fine-tuning for                                                                                                                               extractive QA.           


## Observations

- Encoder-only models fail at text generation.
- BERT and RoBERTa excel at masked language modeling.
- All base models perform poorly at QA without fine-tuning.

#### Experiment 1
BERT and RoBERTa fail to generate meaningful text. Although they run without crashing, their outputs lack semantic continuation because encoder-only models are not designed for autoregressive token generation. BART is able to generate long sequences due to its encoder–decoder architecture, demonstrating that generation is architecturally possible, even though the output quality is poor without task-specific training.

#### Experiment 2

BERT and RoBERTa perform exceptionally well at masked language modeling, correctly predicting words such as “create” and “generate” with high confidence. This aligns with their training objective, which explicitly involves predicting masked tokens using bidirectional context. BART performs worse in comparison, as masked word prediction is not its primary training task.

#### Experiment 3

All three base models show inconsistent performance in question answering. Although BERT correctly extracts the list of risks, its confidence score is extremely low. RoBERTa and BART return partial or incomplete answers. This behavior is expected because the models are not fine-tuned on a question answering dataset such as SQuAD, demonstrating that task-specific fine-tuning is critical for reliable QA performance.

