In [1]:
from transformers import pipeline
import warnings
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
models = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}

# Experiment 1: Text Generation

In [3]:
prompt = "The future of Artificial Intelligence is"

for name, model_id in models.items():
    print(f"\n {name}")
    try:
        generator = pipeline("text-generation", model=model_id)
        output = generator(prompt, max_length=30, truncation=True)
        print(output)
    except Exception as e:
        print("Failed:", e)


 BERT


If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use mps:0
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is................................................................................................................................................................................................................................................................'}]

 RoBERTa


If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use mps:0
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is'}]

 BART


Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use mps:0
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "The future of Artificial Intelligence isartist rub Interstellar Fr mailConnell Interstellar sparkling Doctorsomach CE InterstellarJusticeJusticeither mail mail undertaking ColtsJusticeJustice anguish verticallyburst TicketSTEMoard Interstellar mailither Interstellar Interstellar Interstellar mail datingall CE veterallallMotorall Lieberman StillMotoritherburstall CEall Interstellarpodpod 244allall dominanceolithic女allapeakeJusticeJustice Interstellarall vertically dominance Life vertically ran Life violationsallallall pigeallallirlingallalltrackingallall harming headsets Ticket Interstellarolithic Interstellarpodallall ah Lieberman GH Ticket Life Life Life ranallall GH Gram LifeMotorallParametersallall ill Life Life Interstellarallall 747 LiebermanSTEM Jerome CarsonallSTEM.''ed 747anche Liebermanburstall violationsall Ticket Gram Lieberman Liebermanazine.''all Interstellar pige Life Liebermanazineall 747STEM Carson Lieberman Lieberman Lieberman Factorallall Gram Lif

## Observations
**BERT**

We get repeating dots after the prompt text as output from BERT. This has happened primarily because it is an encoder-only model and is not autoregressive so it can't predict the next token step by step. When forced into generation, it falls back to generating high-probability junk tokens like ".".

**RoBERTa**

The model stopped immediately after printing the prompt out. This model is also encoder-only. When forced into generation, it either outputs nothing or terminates immediately due to token probability collapse.

**BART**

This model kept generating tokens which have no semantic meaning at all. BART is an encoder-decoder model so it can generate tokens but bart-base has no causal language modeling like RoBERTa and some decoder weights are randomly initialized due to which we get fluent token flow but no coherent semantic structure.

# Experiment 2: Masked Language Modeling (Missing Word)

In [4]:
sentences = {
    "BERT": "The goal of Generative AI is to [MASK] new content.",
    "RoBERTa": "The goal of Generative AI is to <mask> new content.",
    "BART": "The goal of Generative AI is to <mask> new content."
}

for name, model_id in models.items():
    print(f"\n {name}")
    try:
        fill_mask = pipeline("fill-mask", model=model_id)
        results = fill_mask(sentences[name])
        for r in results[:3]:
            print(r["token_str"], "->", round(r["score"], 3))
    except Exception as e:
        print("Failed:", e)


 BERT


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


create -> 0.54
generate -> 0.156
produce -> 0.054

 RoBERTa


Device set to use mps:0


 generate -> 0.371
 create -> 0.368
 discover -> 0.084

 BART


Device set to use mps:0


 create -> 0.075
 help -> 0.066
 provide -> 0.061


## Observations
**BERT**

It predicts highly relevant verbs like create, generate, etc. It also predicts with high confidence (~54%). The predictions it made are also grammatically correct and syntactically precise. This happened because BERT is trained using Masked Language Modeling itself, and this task matches its pretraining objective of predicting a missing token using bidirectional context.

**RoBERTa**

It's top predictions are almost evenly split and it is slightly less overconfident than BERT. It is also MLM-trained but is trained on more data than BERT with dynamic masking, which leads to better generalization and less probability collapse on one token.

**BART**

The predictions made by it are ok but they have very low confidence and has no clear dominant answer as well. BART is not primarily MLM trained but is trained on denoising autoencoding and on larger text-spans and not single-token masks. So for this task it works syntactically well but does not align well with its training objective.

# Experiment 3: Question Answering

In [5]:
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

for name, model_id in models.items():
    print(f"\n {name}")
    try:
        qa = pipeline("question-answering", model=model_id)
        answer = qa(question=question, context=context)
        print(answer)
    except Exception as e:
        print("Failed:", e)


 BERT


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use mps:0


{'score': 0.004337908700108528, 'start': 32, 'end': 82, 'answer': 'risks such as hallucinations, bias, and deepfakes.'}

 RoBERTa


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use mps:0


{'score': 0.007780194049701095, 'start': 0, 'end': 67, 'answer': 'Generative AI poses significant risks such as hallucinations, bias,'}

 BART


Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use mps:0


{'score': 0.061262412928044796, 'start': 38, 'end': 81, 'answer': 'such as hallucinations, bias, and deepfakes'}


## Observations
**BERT**

The answer is fully correct but it has very low confidence score. qa_outputs head is randomly initialized for this and BERT is not fine-tuned on QA data. The encoder representations are strong enough to align question and context and guess the correct span but without QA training the start/end classifiers are poorly calibrated.

**RoBERTa**

The answer is incomplete and incorrect as well. It also has very low confidence score. Here also the QA head is randomly initialized because the model has no QA-specific alignment training. So it falls back to the high-attention tokens near the start.

**BART**

The answer is correct and well-scoped and confidence score is higher than BERT and RoBERTa. The encoder-decoder attention in BART helps align question to relevant context region, so we get a decent output even without QA fine-tuning. However, the QA head is still untrained itself and the score is still not fully reliable.

# Observations Table

| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Generation** | BERT | *Failure* | *Generated a long sequence of dots only.* | *BERT is an encoder-only model and isn't trained for next-token generation.* |
| | RoBERTa | *Failure* | *It just returned the prompt itself.* | *RoBERTa is also encoder-only and lacks a causal language modeling decoder.* |
| | BART | *Failure* | *Generated long but incoherent and meaningless text.* | *BART can generate test, but bart-base is not trained for causal language modeling and the decoder weights are partially untrained.* |
| **Fill-Mask** | BERT | *Success* | *Predicted 'create', 'generate', etc. with high confidence.* | *BERT is trained on Masked Language Modeling (MLM), which matches the task.* |
| | RoBERTa | *Success* | *Produced accurate predictions like 'create' and 'generate' with balanced scores.* | *It is also MLM trained and has dynamic masking + larger pretraining data.* |
| | BART | *Failure* | *Returned ok predictions with very low confidence and no dominant answer.* | *BART is trained as a denoising autoencoder and not well trained for single-token MLM prediction.* |
| **QA** | BERT | *Partial Failure* | *Extracted the correct answer span but with very low confidence.* | *Encoder representations are strong, but the QA head is randomly initialized as there is no QA fine-tuning.* |
| | RoBERTa | *Failure* | *Returned a span from the start of the context, which was incomplete.* | *Mostly due to no QA fine-tuning* |
| | BART | *Partial Success* | *Extracted correct answer span with slightly higher confidence compared to other models.* | *Encoder-decoder attention helps alignment of question to context, but QA head is still untrained itself.* |