# Task
Compare the performance of BERT, RoBERTa, and BART models on text generation, masked language modeling, and question answering tasks, and summarize the observations in a markdown table.

In [None]:
# 1. Setup & Imports
# !pip install transformers
from transformers import pipeline
import textwrap

# Global models list
models = ["bert-base-uncased", "roberta-base", "facebook/bart-base"]

def print_header(title):
    print(f"\n{'='*80}")
    print(f"{title.center(80)}")
    print(f"{'='*80}")



##Experiment 1: Text Generation

This experiment tests how different architectures handle generating new text.

In [None]:
print_header("EXPERIMENT 1: TEXT GENERATION")
prompt_gen = "The future of Artificial Intelligence is"

for m in models:
    print(f"\n[MODEL]: {m}")
    try:
        # We set max_new_tokens to 50 for a cleaner look
        gen = pipeline("text-generation", model=m, device=-1)
        output = gen(prompt_gen, max_length=50, num_return_sequences=1, truncation=True)
        text = output[0]['generated_text']

        # Wrapping text for readability
        wrapped_text = textwrap.fill(text, width=80)
        print(f"OUTPUT:\n{wrapped_text}")
    except Exception as e:
        print(f"STATUS: Failed/Limited Support - {str(e)[:70]}...")


                         EXPERIMENT 1: TEXT GENERATION                          

[MODEL]: bert-base-uncased


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


OUTPUT:
The future of Artificial Intelligence is........................................
................................................................................
................................................................................
........................................................

[MODEL]: roberta-base


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


OUTPUT:
The future of Artificial Intelligence is

[MODEL]: facebook/bart-base


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


OUTPUT:
The future of Artificial Intelligence is exposed heats artist January January
Devon cured Abdul bothered bothered Spoon freshmanirainning Aad January
Aadiraira January January Januarykgicipated cured cured unworthyuggle tacos
consult tradition teenager reunion Januaryicipated January
AadaliationFinishserving Booster Januaryaliation Tit Investigatoruggle infuri
guiding guiding guiding relate Mongolia Spoon guiding JanuaryTu
relatealiationservingolit guiding guiding cured guiding guiding
accuratelyserving infuriird guiding guidingserving guidinghers guiding asteroids
guiding guiding January guiding guiding literature relate guiding guiding
couragealiation Tit guidingaliation guiding Tit guiding guiding
hairstaliationaliationaliation millennials guiding accuratelyaliation guiding
guidingird guiding asteroids infuri infuri unavoid guiding Mongolia guiding
courage guiding Tit Still guiding Mongoliaaliation62 Tit guidingdLHi guiding
guidingBeckserving guiding accurately accurately ap

##Experiment 2: Masked Language Modeling (MLM)

This task uses the native training objective of Encoder models to predict a hidden word.

In [None]:
# --- EXPERIMENT 2: FILL-MASK (FIXED) ---
print_header("EXPERIMENT 2: FILL-MASK")
# We use a base template instead of a hardcoded string
prompt_template = "The goal of Generative AI is to {} new content."

for m in models:
    print(f"\n[MODEL]: {m}")
    try:
        # Load the pipeline
        filler = pipeline("fill-mask", model=m, device=-1)

        # FIX: Dynamically get the specific mask token for this model (e.g., [MASK] or <mask>)
        mask_token = filler.tokenizer.mask_token
        current_prompt = prompt_template.format(mask_token)

        results = filler(current_prompt)

        print(f"Using Mask Token: {mask_token}")
        print(f"{'Rank':<5} | {'Predicted Token':<15} | {'Score'}")
        print("-" * 35)
        for i, res in enumerate(results[:3]):
            print(f"{i+1:<5} | {res['token_str'].strip():<15} | {res['score']:.4f}")

    except Exception as e:
        print(f"STATUS: Error - {str(e)[:100]}...")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



                            EXPERIMENT 2: FILL-MASK                             

[MODEL]: bert-base-uncased


Device set to use cpu


Using Mask Token: [MASK]
Rank  | Predicted Token | Score
-----------------------------------
1     | create          | 0.5397
2     | generate        | 0.1558
3     | produce         | 0.0541

[MODEL]: roberta-base


Device set to use cpu


Using Mask Token: <mask>
Rank  | Predicted Token | Score
-----------------------------------
1     | generate        | 0.3711
2     | create          | 0.3677
3     | discover        | 0.0835

[MODEL]: facebook/bart-base


Device set to use cpu


Using Mask Token: <mask>
Rank  | Predicted Token | Score
-----------------------------------
1     | create          | 0.0746
2     | help            | 0.0657
3     | provide         | 0.0609


##Experiment 3: Question Answering

This experiment tests extractive understanding. Since these are "base" models and not fine-tuned on the SQuAD dataset, the results may be poor or random.

In [None]:
print_header("EXPERIMENT 3: QUESTION ANSWERING")
qa_context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
qa_question = "What are the risks?"

for m in models:
    print(f"\n[MODEL]: {m}")
    try:
        qa = pipeline("question-answering", model=m, device=-1)
        res = qa(question=qa_question, context=qa_context)
        print(f"ANSWER: '{res['answer']}'")
        print(f"CONFIDENCE SCORE: {res['score']:.4f}")
    except Exception as e:
        print(f"STATUS: No QA Head found in base model. (Expected for non-fine-tuned models)")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



                        EXPERIMENT 3: QUESTION ANSWERING                        

[MODEL]: bert-base-uncased


Device set to use cpu


ANSWER: 'hallucinations, bias,'
CONFIDENCE SCORE: 0.0170

[MODEL]: roberta-base


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ANSWER: 'deepfakes.'
CONFIDENCE SCORE: 0.0127

[MODEL]: facebook/bart-base


Device set to use cpu


ANSWER: 'Generative AI poses significant'
CONFIDENCE SCORE: 0.0640


### Deliverable: Observation Table


| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Generation** | BERT | *Failure* | *Example: Generated nonsense or random symbols.* | *BERT is an Encoder; it isn't trained to predict the next word.* |
| | RoBERTa | *Failure*|*didn't generate new text, just returned the input prompt* | *RoBERTa is also an encoder only model so it isnt designed to predict the next word* |
| | BART |*Failure* | *generated text but the text was meaningless and rubbish* | *Even though BART is an encoder-decoder model and can handle this task, we used the base model which is trained on a lot of raw data, and isnt finetuned to understand and handle grammar and logic*|
| **Fill-Mask** | BERT | *Success* | *Predicted 'create', 'generate', highest confidence score(0.5397).* | *BERT is trained on Masked Language Modeling (MLM).* |
| | RoBERTa | *Sucesss*|*Predicted 'generate', 'create', more consistent scores for synonymns* |*more robust training data than BERT.* |
| | BART |*Success* |*Predicted 'create', 'help', had lower confidence* | *flexible architecture but is designeed for seq2seq.*|
| **QA** | BERT |*partial success* | *Extracted 'hallucinations, bias' (Score: 0.017).*| *BERT had NSP(next sentence prediction) in its training and that helped but since the heads are randomly initialised the outputs are almost random*|
| | RoBERTa |*Failure* |*Extracted only "deepfakes." (Score: 0.012)* | *removal of NSP in pretraining reduced its ability to link QA and it had low confidence*|
| | BART |*partial failure* | *Extracted "Generative AI poses significant" (Score: 0.064)*| *the decoder helped it handle longer sequence queries but the head being randomly initialised meant the output was random*|

---

