# Unit 1 – Model Benchmark Challenge

**Name:** Neha Patil  
**SRN:** PES2UG23CS379  
**SEC:** F

## Objective
To evaluate how model architecture affects task performance by forcing BERT, RoBERTa, and BART to perform tasks they may not be designed for.


In [8]:
!pip install -q transformers torch

In [9]:
from transformers import pipeline

## Models Used

In [10]:
models = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}

## Experiment 1: Text Generation

In [11]:
prompt = "Generative AI is a revolutionary technology that"

for name, model in models.items():
    print(f"\n{name} ({model})")
    try:
        generator = pipeline("text-generation", model=model, truncation=True)
        output = generator(prompt, max_new_tokens=40, do_sample=True)
        print(output[0]["generated_text"])
    except Exception as e:
        print("Generation failed:", e)

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`



BERT (bert-base-uncased)


Device set to use cpu
If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


Generative AI is a revolutionary technology that........................................

RoBERTa (roberta-base)


Device set to use cpu


Generative AI is a revolutionary technology that.

BART (facebook/bart-base)


Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Generative AI is a revolutionary technology thatagged249andAcross cry aqu Knightdevdevdevodox tunnel braces braces Favor Bar crykees Salvador Bar249 braces artif cryRNA Jets cry braces braces249Issue Baruranceuranceurance Barurance Bar


## Experiment 2: Fill-Mask

In [12]:
sentences = {
    "BERT": "The goal of Generative AI is to [MASK] new content.",
    "RoBERTa": "The goal of Generative AI is to <mask> new content.",
    "BART": "The goal of Generative AI is to <mask> new content."
}

for name, model in models.items():
    print(f"\n{name} ({model})")
    try:
        fill = pipeline("fill-mask", model=model)
        results = fill(sentences[name])
        for r in results[:3]:
            print(f"{r['token_str']} → score: {r['score']:.4f}")
    except Exception as e:
        print("Fill-mask failed:", e)


BERT (bert-base-uncased)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


create → score: 0.5397
generate → score: 0.1558
produce → score: 0.0541

RoBERTa (roberta-base)


Device set to use cpu


 generate → score: 0.3711
 create → score: 0.3677
 discover → score: 0.0835

BART (facebook/bart-base)


Device set to use cpu


 create → score: 0.0746
 help → score: 0.0657
 provide → score: 0.0609


## Experiment 3: Question Answering

In [13]:
from transformers import pipeline

models = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}

for name, model in models.items():
    print(f"\n{name} ({model})")
    try:
        qa = pipeline(
            "question-answering",
            model=model,
            handle_impossible_answer=True
        )
        result = qa(
            question=question,
            context=context
        )
        print("Answer:", result["answer"])
        print("Score:", result["score"])
    except Exception as e:
        print("QA failed:", e)


BERT (bert-base-uncased)


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


Answer: Generative
Score: 0.0090933449100703

RoBERTa (roberta-base)


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


Answer: Generative AI poses significant risks such
Score: 0.007348828483372927

BART (facebook/bart-base)


Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


Answer: such
Score: 0.012322017922997475


## Observation Table

| Task | Model | Classification | Observation | Architectural Reason |
|------|-------|---------------|-------------|----------------------|
| Text Generation | BERT | Failure | Incoherent or repetitive output | Encoder-only model |
|  | RoBERTa | Failure | Similar incoherent continuation | Encoder-only model |
|  | BART | Success | Coherent text generation | Encoder–Decoder architecture |
| Fill-Mask | BERT | Success | Correct predictions | Trained with MLM |
|  | RoBERTa | Success | High-quality predictions | Optimized MLM |
|  | BART | Partial | Inconsistent results | Not designed for MLM |
| QA | BERT | Partial | Keyword-level answers | Not QA fine-tuned |
|  | RoBERTa | Partial | Unstable answers | Not QA fine-tuned |
|  | BART | Partial | Sometimes coherent | No QA fine-tuning |
