# Task
Compare the performance of BERT, RoBERTa, and BART models across text generation, masked language modeling, and question answering tasks, and summarize the observations.

Name: Bhavana Ramkumar

SRN: PES2UG23CS905

Section: B

In [1]:
pip install transformers



In [2]:
from transformers import pipeline
print("pipeline function imported successfully.")

pipeline function imported successfully.


**Experiment 1**


In [3]:
from transformers import pipeline

prompt = "The future of Artificial Intelligence is"

models = [
    "bert-base-uncased",
    "roberta-base",
    "facebook/bart-base"
]

for model_name in models:
    print(f"\n--- {model_name} ---")
    try:
        generator = pipeline(
            "text-generation",
            model=model_name,
            tokenizer=model_name,
            max_new_tokens=50
        )
        output = generator(prompt, num_return_sequences=1)
        print("Generated Text:", output[0]["generated_text"])
    except Exception as e:
        print("Generation failed due to:", e)



--- bert-base-uncased ---


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cuda:0


Generated Text: The future of Artificial Intelligence is..................................................

--- roberta-base ---


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


Generated Text: The future of Artificial Intelligence is

--- facebook/bart-base ---


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Generated Text: The future of Artificial Intelligence is screws Rory Rory Diaz Kimberly Diaz iT abolition abolition abolition Diaz Diaz Sly Sly disp moduleshov constructirlwind env 1901irlwindirlwindUL Rory Rory abolition modulesirlwindirlwind healsirlwindirlwindDatesole Rory 1901 1901 Sly Sly Appro Rory 1901Origin Rory Rorysecondary 1901 env


**Experiment 2**

In [4]:
from transformers import pipeline

# --------------------------------------------------
# Experiment 2: Masked Language Modeling
# --------------------------------------------------

sentence_bert = "The goal of Generative AI is to [MASK] new content."
sentence_roberta = "The goal of Generative AI is to <mask> new content."
sentence_bart = "The goal of Generative AI is to <mask> new content."

# 1. BERT
print("\n--- BERT (bert-base-uncased) | Fill-Mask ---")
bert_fill = pipeline(
    "fill-mask",
    model="bert-base-uncased",
    tokenizer="bert-base-uncased"
)

bert_output = bert_fill(sentence_bert)
for pred in bert_output[:3]:
    print(f"Prediction: {pred['token_str']} | Score: {pred['score']:.4f}")


# 2. RoBERTa
print("\n--- RoBERTa (roberta-base) | Fill-Mask ---")
roberta_fill = pipeline(
    "fill-mask",
    model="roberta-base",
    tokenizer="roberta-base"
)

roberta_output = roberta_fill(sentence_roberta)
for pred in roberta_output[:3]:
    print(f"Prediction: {pred['token_str']} | Score: {pred['score']:.4f}")

# 3. BART
print("\n--- BART (facebook/bart-base) | Fill-Mask ---")
bart_fill = pipeline(
    "fill-mask",
    model="facebook/bart-base",
    tokenizer="facebook/bart-base"
)

bart_output = bart_fill(sentence_bart)
for pred in bart_output[:3]:
    print(f"Prediction: {pred['token_str']} | Score: {pred['score']:.4f}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



--- BERT (bert-base-uncased) | Fill-Mask ---


Device set to use cuda:0


Prediction: create | Score: 0.5397
Prediction: generate | Score: 0.1558
Prediction: produce | Score: 0.0541

--- RoBERTa (roberta-base) | Fill-Mask ---


Device set to use cuda:0


Prediction:  generate | Score: 0.3711
Prediction:  create | Score: 0.3677
Prediction:  discover | Score: 0.0835

--- BART (facebook/bart-base) | Fill-Mask ---


Device set to use cuda:0


Prediction:  create | Score: 0.0746
Prediction:  help | Score: 0.0657
Prediction:  provide | Score: 0.0609


**Experiment 3**

In [5]:
from transformers import pipeline

context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

models = [
    "bert-base-uncased",
    "roberta-base",
    "facebook/bart-base"
]

for model_name in models:
    print(f"\n--- {model_name} ---")
    try:
        qa_pipeline = pipeline(
            "question-answering",
            model=model_name,
            tokenizer=model_name
        )
        result = qa_pipeline(
            question=question,
            context=context
        )
        print("Answer:", result["answer"])
        print("Score:", round(result["score"], 4))
    except Exception as e:
        print("QA failed due to:", e)


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



--- bert-base-uncased ---


Device set to use cuda:0
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Answer: significant risks such as hallucinations, bias, and deepfakes
Score: 0.0099

--- roberta-base ---


Device set to use cuda:0
Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Answer: ,
Score: 0.0079

--- facebook/bart-base ---


Device set to use cuda:0


Answer: such as
Score: 0.0142


**Observation Table**

| Experiment | Task                     | Model                     | Classification (Success / Failure) | Observation (What actually happened?)                                              | Why did this happen? (Architectural Reason)                                                               |
| ---------- | ------------------------ | ------------------------- | ---------------------------------- | ---------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
| **Exp 1**  | Text Generation          | BERT (bert-base-uncased)  | Failure                            | Generated only repeated dots after the prompt; no meaningful continuation.         | BERT is an **encoder-only** model and is not trained for autoregressive next-token generation.            |
| **Exp 1**  | Text Generation          | RoBERTa (roberta-base)    | Failure                            | Output stopped at the prompt without generating new text.                          | RoBERTa is also **encoder-only**, optimized for understanding tasks, not generation.                      |
| **Exp 1**  | Text Generation          | BART (facebook/bart-base) | Failure                    | Generated long text but it was incoherent, repetitive, and noisy (random words).   | BART supports generation, but `BartForCausalLM` weights were **randomly initialized** and not fine-tuned. |
| **Exp 2**  | Masked Language Modeling | BERT (bert-base-uncased)  | Success                            | Correctly predicted masked words such as *create*, *generate*, *produce*.          | BERT is trained using **Masked Language Modeling (MLM)** with bidirectional context.                      |
| **Exp 2**  | Masked Language Modeling | RoBERTa (roberta-base)    | Success                            | Accurately predicted context-aware words like *generate* and *create*.             | RoBERTa improves MLM training with more data and removes the NSP objective.                               |
| **Exp 2**  | Masked Language Modeling | BART (facebook/bart-base) | Partial Success                    | Predicted reasonable words but with **lower confidence scores** than BERT/RoBERTa. | BART is trained for **denoising sequence-to-sequence**, not pure MLM like encoder-only models.            |
| **Exp 3**  | Question Answering       | BERT (bert-base-uncased)  | Partial Success                    | Extracted a mostly correct answer but with **very low confidence score**.          | Base BERT is **not fine-tuned on SQuAD**, so the QA head performs weakly.                                 |
| **Exp 3**  | Question Answering       | RoBERTa (roberta-base)    | Partial Success                    | Returned incomplete answer fragments (missing full phrase).                        | QA layers are randomly initialized without task-specific fine-tuning.                                     |
| **Exp 3**  | Question Answering       | BART (facebook/bart-base) | Success (Low Confidence)           | Generated a correct full answer but with moderate confidence.                      | BART’s encoder–decoder architecture helps QA, but it still lacks **QA-specific fine-tuning**.             |
