In [None]:
!pip install transformers torch
from transformers import pipeline
import torch

# Define our three models
models = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}





In [None]:
print("--- EXPERIMENT 1: TEXT GENERATION ---")
prompt = "The future of Artificial Intelligence is"

for name, model_path in models.items():
    print(f"\nTesting {name}...")
    try:
        # We use the text-generation pipeline
        generator = pipeline("text-generation", model=model_path)
        result = generator(prompt, max_length=20, num_return_sequences=1)
        print(f"Result: {result[0]['generated_text']}")
    except Exception as e:
        print(f"FAILURE: {name} cannot generate text easily. Error: {str(e)[:100]}...")

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


--- EXPERIMENT 1: TEXT GENERATION ---

Testing BERT...


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


Result: The future of Artificial Intelligence is................................................................................................................................................................................................................................................................

Testing RoBERTa...


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Result: The future of Artificial Intelligence is

Testing BART...


Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Result: The future of Artificial Intelligence is audiences Overviewessor381essoressoressorCRECREessor381Ops Slaveessoressor38188essorOps381381Opsessoressor Janeiro Janeiro Episode rejuvenRPG childishessoratefulOps381essor AllaahessorCRE hurled Highly shrinkulo0101essorinea0101Ops hurledineainea01 bottle locally terrifying locallycue sched ToolsWOR0101 Slave rejuven Receasonable rejuveninea rejuvenRPG rejuven rejuveninea651 bizarre rejuven rejuvenasonable bottle rejuven rejuvenWild rejuven rejuven rejuven Fiber rejuven locally rejuven rejuvenovie rejuvenWOR rejuven bottle rejuven ignorance rejuven rejuven preferring sleeper rejuven rejuven locally Cologne rejuvenTY rejuvenWOR fullerWOR rejuven rejuvenerno rejuvenaires rejuven locally ignorance Thumbnails rejuven rejuven ignorance preferring rejuven rejuven Rights rejuven rejuven sleeper rejuvenHCR locally locally rejuven bottle locally rejuvenHCR rejuven rejuven Thumbnails rejuven locally weight rejuven locally Brigham rejuven rejuven s

In [None]:
print("\n--- EXPERIMENT 2: FILL-MASK ---")

for name, model_path in models.items():
    # Fix the mask token based on the model's requirements
    mask_token = "[MASK]" if name == "BERT" else "<mask>"
    text = f"The goal of Generative AI is to {mask_token} new content."

    print(f"\nTesting {name}...")
    try:
        filler = pipeline("fill-mask", model=model_path)
        result = filler(text)
        # Show top 2 predictions
        preds = [r['token_str'] for r in result[:2]]
        print(f"Top Predictions: {preds}")
    except Exception as e:
        print(f"FAILURE: {e}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



--- EXPERIMENT 2: FILL-MASK ---

Testing BERT...


Device set to use cpu


Top Predictions: ['create', 'generate']

Testing RoBERTa...


Device set to use cpu


Top Predictions: [' generate', ' create']

Testing BART...


Device set to use cpu


Top Predictions: [' create', ' help']


In [None]:
print("\n--- EXPERIMENT 3: QUESTION ANSWERING ---")
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

for name, model_path in models.items():
    print(f"\nTesting {name}...")
    try:
        qa_bot = pipeline("question-answering", model=model_path)
        result = qa_bot(question=question, context=context)
        print(f"Answer: {result['answer']}")
        print(f"Confidence Score: {round(result['score'], 4)}")
    except Exception as e:
        print(f"FAILURE: {e}")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



--- EXPERIMENT 3: QUESTION ANSWERING ---

Testing BERT...


Device set to use cpu


Answer: Generative AI poses significant risks such as hallucinations
Confidence Score: 0.0085

Testing RoBERTa...


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Answer: ,
Confidence Score: 0.0051

Testing BART...


Device set to use cpu


Answer: deepfakes
Confidence Score: 0.0198


| Task           | Model   | Classification (Success/Failure) | Observation (What actually happened?)                                     | Why did this happen? (Architectural Reason)                                                                   |
| :------------- | :------ | :------------------------------- | :------------------------------------------------------------------------ | :------------------------------------------------------------------------------------------------------------ |
| **Generation** | BERT    | Failure                          | It printed a long line of dots and periods instead of meaningful text.    | BERT is a pure **encoder** model and is not designed for text generation or next-token prediction.            |
|                | RoBERTa | Failure                          | It did not output anything; the result was completely blank.              | RoBERTa is also an **encoder-only** model and cannot generate text without a decoder.                         |
|                | BART    | Failure                          | It generated a large block of random, nonsensical words.                  | The **base BART model** can generate text, but without task-specific fine-tuning, its outputs are incoherent. |
| **Fill-Mask**  | BERT    | Success                          | It correctly predicted words like “create” and “generate.”                | BERT is trained using **Masked Language Modeling (MLM)**, making it ideal for fill-mask tasks.                |
|                | RoBERTa | Success                          | It accurately predicted masked words such as “generate” and “create.”     | RoBERTa improves on BERT’s MLM training with more data and no NSP, leading to better predictions.             |
|                | BART    | Success                          | It successfully suggested words like “create” and “help.”                 | BART is trained by reconstructing corrupted text, which helps it infer missing tokens well.                   |
| **QA**         | BERT    | Bad                              | It produced a very long, strange answer with a very low confidence score. | The model architecture supports QA, but the **base model lacks QA fine-tuning**.                              |
|                | RoBERTa | Failure                          | It output only a single comma instead of an answer.                       | Without **extractive QA fine-tuning**, RoBERTa cannot properly locate answers in context.                     |
|                | BART    | Bad                              | It picked a random word like “deepfakes” from the end of the passage.     | BART was not fine-tuned for QA, so it generates or selects tokens without true comprehension.                 |


I experimented with BERT, RoBERTa and BART in this lab. I discovered that BERT and RoBERTa are very good at understanding context (Fill-Mask), but not so skilled at generating text, still they can't generate any texts since Encoder-only GoT. BART is an Encoder-Decoder, so it is capable of producing text, but the base version requires additional training to be useful for Q&A et cetera. This experiment taught me that choosing the right architecture is just as crucial as the size of the model.