In [1]:
# Install the transformers library
!pip install transformers -q

import torch
from transformers import pipeline

# Define the models for the assignment
models = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}



## Experiment 1: Text Generation


In [2]:
print("--- Experiment 1: Text Generation ---")
prompt = "The future of Artificial Intelligence is"

for name, model_id in models.items():
    print(f"\nTesting {name}...")
    try:
        # Note: BERT and RoBERTa often fail here because they lack a causal decoder
        gen = pipeline("text-generation", model=model_id)
        # Using a small max_new_tokens for a quick test
        output = gen(prompt, max_new_tokens=15, num_return_sequences=1)
        print(f"Result: {output[0]['generated_text']}")
    except Exception as e:
        print(f"Result: FAILED. {name} architecture is not designed for text generation.")

--- Experiment 1: Text Generation ---

Testing BERT...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


Result: The future of Artificial Intelligence is...............

Testing RoBERTa...


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


Result: The future of Artificial Intelligence is

Testing BART...


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


Result: The future of Artificial Intelligence is planted games games gamesRevolution planted planted planted games planted unden planted descend planted


## Inference of Experiment 1
he results highlight a fundamental architectural divide. BERT and RoBERTa are Encoder-only models designed for bidirectional representation. Because they lack an autoregressive Decoder, they cannot effectively predict a sequence of future tokens, resulting in empty outputs or simple punctuation loops. BART, being an Encoder-Decoder architecture, is capable of generation. However, because the facebook/bart-base model is a "base" checkpoint trained on denoising tasks (like shuffling sentences) rather than conversational data, it produces repetitive gibberish. Successful generation requires a causal decoder and specific fine-tuning for fluency.

## Experiment 2: Masked Language Modeling

In [4]:
print("\n--- Experiment 2: Fill-Mask (Corrected) ---")

for name, model_path in models.items():
    print(f"\nTesting {name}...")

    # Logic to select the correct token automatically
    if "roberta" in model_path or "bart" in model_path:
        mask_token = "<mask>"
    else:
        mask_token = "[MASK]"

    masked_prompt = f"The goal of Generative AI is to {mask_token} new content."

    try:
        filler = pipeline("fill-mask", model=model_path)
        results = filler(masked_prompt)
        print(f"  Top Prediction: '{results[0]['token_str']}' (Score: {results[0]['score']:.4f})")
    except Exception as e:
        print(f"  Result: FAILED. Error: {e}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



--- Experiment 2: Fill-Mask (Corrected) ---

Testing BERT...


Device set to use cpu


  Top Prediction: 'create' (Score: 0.5397)

Testing RoBERTa...


Device set to use cpu


  Top Prediction: ' generate' (Score: 0.3711)

Testing BART...


Device set to use cpu


  Top Prediction: ' create' (Score: 0.0746)


## Inference of Experiment 2
In this task, the Encoder-only models (BERT and RoBERTa) performed significantly better, with confidence scores ranging from 37% to 54%. This is because both were pre-trained specifically on the Masked Language Modeling (MLM) objective—their primary "job" is to predict a hidden token based on surrounding context. BART achieved a much lower confidence score (7%) because it is a generalist. Its objective is to reconstruct entire corrupted sequences rather than just filling a single slot, meaning its probability distribution is spread across many more possibilities than the specialized encoders.

## Experiment 3: Question Answering

In [5]:
print("\n--- Experiment 3: Question Answering ---")
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

for name, model_id in models.items():
    print(f"\nTesting {name}...")
    try:
        # Warning: Base models are not fine-tuned for QA, so results may be "noisy"
        qa = pipeline("question-answering", model=model_id)
        res = qa(question=question, context=context)
        print(f"Answer: '{res['answer']}' (Confidence: {res['score']:.4f})")
    except Exception as e:
        print(f"Result: FAILED. Base model may not have a QA head.")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



--- Experiment 3: Question Answering ---

Testing BERT...


Device set to use cpu


Answer: 'Generative AI poses significant risks such as hallucinations' (Confidence: 0.0100)

Testing RoBERTa...


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


Answer: 'poses significant risks such as hallucinations, bias, and deepfakes' (Confidence: 0.0079)

Testing BART...


Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


Answer: 'risks such as hallucinations, bias, and deepfakes.' (Confidence: 0.0208)


## Inference of Experiment 3
All three models yielded extremely low confidence scores (approx. 0.01 to 0.02), and the warnings indicated that the qa_outputs weights were newly initialized. This occurs because these are Base Models, not fine-tuned versions. While the models "understand" the language in the context, they do not have the specialized "QA Head" (the final layers that identify start/end positions of an answer) trained on datasets like SQuAD. Without fine-tuning, the models are essentially making random guesses using an untrained output layer. This proves that architecture provides the potential for a task, but fine-tuning provides the capability.

| Task | Model | Classification | Observation (What happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Generation** | BERT | **Failure** | Generated punctuation or dots. | **Encoder-only**. BERT is bidirectional and lacks the autoregressive decoder needed for next-token prediction. |
| | RoBERTa | **Failure** | Generated an empty string. | **Encoder-only**. Like BERT, it is optimized for context understanding, not sequence generation. |
| | BART | **Success / Poor** | Generated repetitive gibberish. | **Encoder-Decoder**. It has a decoder designed for generation, but the "base" model lacks the fine-tuning for fluency. |
| **Fill-Mask** | BERT | **Success** | Predicted 'create' (High confidence). | **Encoder-only**. BERT was specifically pre-trained on the Masked Language Modeling (MLM) task. |
| | RoBERTa | **Success** | Predicted 'generate' (High confidence). | **Encoder-only**. Optimized for bidirectional token prediction using context. |
| | BART | **Success** | Predicted 'create' (Low confidence). | **Encoder-Decoder**. Designed for sequence-to-sequence denoising; its probability is spread across many reconstruction options. |
| **QA** | BERT | **Poor / Random** | Confidence ~0.01; extracted partial text. | **Base Model**. It lacks the fine-tuned "QA Head" (output layer) required to accurately locate answers in a context. |
| | RoBERTa | **Poor / Random** | Confidence ~0.007; extracted long strings. | **Base Model**. The QA Head was initialized with random weights because the base model was not fine-tuned on SQuAD. |
| | BART | **Poor / Random** | Confidence ~0.02; extracted correct phrase. | **Base Model**. Despite having an Encoder-Decoder structure, it still requires task-specific training to perform extractive QA reliably. |