In [1]:
from transformers import pipeline
import warnings
warnings.filterwarnings("ignore")



In [2]:
models = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}

In [3]:
prompt = "The future of Artificial Intelligence is"

for name, model_id in models.items():
    print(f"\n {name}")
    try:
        generator = pipeline("text-generation", model=model_id)
        output = generator(prompt, max_length=30, truncation=True)
        print(output)
    except Exception as e:
        print("Failed:", e)


 BERT


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is................................................................................................................................................................................................................................................................'}]

 RoBERTa


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is'}]

 BART


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence isookyXX Kiringin Planetary Planetaryooluddy antagonist oppressutinggingingin inspection popup inspectionMikeookyooky Metroid celebratedutingginmyra Info molooky Milan Milanmyra crackooky learners ● aimed Milan aimed Ox Malone learners learnersgin0000YNookyookyooky learners securelyooky chosen chosengin asserts Malone MaloneookyestroREDRepublicankilling Fighter kale laud laud learners Respectooky chosen learnersRED honestly crack chosen chosen chosensurface Fighter honestly Anon chosen Malone Maloneheader chosen honestly Ok chosenll chosen reproductiveansky chosenolia Tags chosen chosenestro crack chosenestroRED FighterRepublicanheaderestroestroheaderansky endorsingll wideansky settlesll chosen Fighter chosenRepublicanestro Fighter chosenheaderestro honestlyheader Tags chosenheader reproductiveheader Fighterookyheaderheader Fighterheaderheader � Fighteroliaestro endorsing Maloneanskyheaderestroheaderheader SpearsheaderheaderRep

In [4]:
sentences = {
    "BERT": "The goal of Generative AI is to [MASK] new content.",
    "RoBERTa": "The goal of Generative AI is to <mask> new content.",
    "BART": "The goal of Generative AI is to <mask> new content."
}

for name, model_id in models.items():
    print(f"\n {name}")
    try:
        fill_mask = pipeline("fill-mask", model=model_id)
        results = fill_mask(sentences[name])
        for r in results[:3]:
            print(r["token_str"], "->", round(r["score"], 3))
    except Exception as e:
        print("Failed:", e)


 BERT


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


create -> 0.54
generate -> 0.156
produce -> 0.054

 RoBERTa


Device set to use cpu


 generate -> 0.371
 create -> 0.368
 discover -> 0.084

 BART


Device set to use cpu


 create -> 0.075
 help -> 0.066
 provide -> 0.061


In [5]:
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

for name, model_id in models.items():
    print(f"\n {name}")
    try:
        qa = pipeline("question-answering", model=model_id)
        answer = qa(question=question, context=context)
        print(answer)
    except Exception as e:
        print("Failed:", e)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



 BERT


Device set to use cpu
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'score': 0.010130776558071375, 'start': 32, 'end': 81, 'answer': 'risks such as hallucinations, bias, and deepfakes'}

 RoBERTa


Device set to use cpu
Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'score': 0.007283675251528621, 'start': 72, 'end': 82, 'answer': 'deepfakes.'}

 BART


Device set to use cpu


{'score': 0.03296162933111191, 'start': 14, 'end': 37, 'answer': 'poses significant risks'}


## Observation Table

Based on the experimental results, here's the completed observation table:

| **Task** | **Model** | **Classification (Success/Failure)** | **Observation (What actually happened?)** | **Why did this happen? (Architectural Reason)** |
|----------|-----------|--------------------------------------|-------------------------------------------|------------------------------------------------|
| **Generation** | BERT | Failure | Generated nonsense - repeated dots (periods) | BERT is an Encoder-only model; it isn't trained to predict the next word. It's trained for MLM (Masked Language Modeling), not autoregressive generation. |
| | RoBERTa | Failure | Returned the same prompt without generating new text | RoBERTa is also an Encoder-only model like BERT, designed for understanding tasks, not generation. |
| | BART | Failure | Generated completely random/gibberish tokens (Fighter, Republican, Malone, Tags, etc.) | BART wasn't fine-tuned for causal generation. The base model requires task-specific fine-tuning; without it, decoder produces nonsensical outputs. |
| **Fill-Mask** | BERT | Success | Predicted: 'create' (0.54), 'generate' (0.156), 'produce' (0.054) | BERT is trained on Masked Language Modeling (MLM). This is its core training objective, so it excels at this task. |
| | RoBERTa | Success | Predicted: 'generate' (0.371), 'create' (0.368), 'discover' (0.084) | RoBERTa is an optimized BERT variant, also trained on MLM. It performs well on fill-mask tasks. |
| | BART | Success | Predicted: 'create' (0.075), 'help' (0.066), 'provide' (0.061) | BART uses denoising autoencoding during pre-training, which includes masked token prediction. However, lower confidence scores than BERT/RoBERTa. |
| **QA** | BERT | Partial Success | Answer: "risks such as hallucinations, bias, and deepfakes" (score: 0.010) | BERT can extract spans but wasn't fine-tuned on SQuAD. Very low confidence score indicates poor performance without task-specific training. |
| | RoBERTa | Partial Success | Answer: "deepfakes." (score: 0.007) | Similar to BERT - can extract text spans but gives incomplete answer with very low confidence. Not fine-tuned for QA. |
| | BART | Partial Success | Answer: "poses significant risks" (score: 0.033) | BART has encoder-decoder architecture suitable for QA, but base model isn't fine-tuned. Extracts partial relevant text with low confidence. |

---

### Key Insights:

1. **Encoder-only models (BERT, RoBERTa)**: Excel at understanding tasks (MLM, classification) but fail at generation tasks
2. **Encoder-decoder models (BART)**: Designed for generation but require fine-tuning for specific tasks
3. **Fine-tuning matters**: All base models show poor performance on tasks they weren't specifically trained for
4. **Task alignment**: Best performance occurs when model architecture matches the task (e.g., BERT on fill-mask)