In [5]:
from transformers import pipeline, set_seed

models_to_test = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}

set_seed(42)  # For reproducibility

print("Setup Complete. Models ready to test.")

Setup Complete. Models ready to test.


In [6]:
print("--- Experiment 1: Text Generation ---")
prompt = "The future of Artificial Intelligence is"

for name, model_id in models_to_test.items():
    print(f"\nTesting {name} ({model_id})...")
    try:
        # Note: BERT and RoBERTa are NOT built for generation, so this might output garbage or fail.
        # This is expected behavior for the assignment!
        generator = pipeline('text-generation', model=model_id)
        result = generator(prompt, max_length=30, num_return_sequences=1)
        print(f"Output: {result[0]['generated_text']}")
    except Exception as e:
        print(f"FAILED: {e}")

--- Experiment 1: Text Generation ---

Testing BERT (bert-base-uncased)...


If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


Output: The future of Artificial Intelligence is................................................................................................................................................................................................................................................................

Testing RoBERTa (roberta-base)...


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Output: The future of Artificial Intelligence is

Testing BART (facebook/bart-base)...


Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Output: The future of Artificial Intelligence isOtherwise ShakOtherwise sure Shak Shak323208 df empir squat Shak chuckms healer df MarxismOtherwisePatrick Walkdy Shak slipsreleased df df32 slips debugger Walk slips Shak denim Shak df df df Walk Drawn Drawn Drawn 361 slipsino Person Princeton Shak chuckSpoilerSpoiler slips df df initially dismant Drawn Crkas Drawn Drawn spotsenc Shak Drawn Drawnino Drawn Drawn opposition Shak workload slips Shak slips df Drawn DrawnEngland Drawn32 Drawn DrawnSel Drawn workload Drawn Drawn Output Drawn Drawn communism Drawn Drawn debugger spots spots dfPost df df workload Drawn df Drawn dfJe Drawn DrawnaxeFrames df kingdoms df Drawnlein Drawn Drawn slips Drawn Drawneller Drawn Drawn df spots spots game Drawn Drawn LT Drawn Drawn Alvin Drawn origins sure Drawn Drawnaze spots Drawn Drawn princip Drawn gameStatus Alvin df df Alvin df impacting impacting spots spotsaily princip futures df principAlert skysc df spots principStatus Drawn Drawn Beet Drawn skysc

In [7]:
print("--- Experiment 2: Fill-Mask ---")
sentence = "The goal of Generative AI is to <mask> new content." # BART/RoBERTa use <mask>
sentence_bert = "The goal of Generative AI is to [MASK] new content." # BERT uses [MASK]

for name, model_id in models_to_test.items():
    print(f"\nTesting {name} ({model_id})...")

    # Switch mask token based on model type
    input_text = sentence_bert if name == "BERT" else sentence

    unmasker = pipeline('fill-mask', model=model_id)
    result = unmasker(input_text)

    # Print top prediction
    print(f"Top Prediction: {result[0]['token_str']} (Score: {result[0]['score']:.4f})")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


--- Experiment 2: Fill-Mask ---

Testing BERT (bert-base-uncased)...


Device set to use cpu


Top Prediction: create (Score: 0.5397)

Testing RoBERTa (roberta-base)...


Device set to use cpu


Top Prediction:  generate (Score: 0.3711)

Testing BART (facebook/bart-base)...


Device set to use cpu


Top Prediction:  create (Score: 0.0746)


In [8]:
print("--- Experiment 3: Question Answering ---")
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

for name, model_id in models_to_test.items():
    print(f"\nTesting {name} ({model_id})...")

    qa_pipeline = pipeline('question-answering', model=model_id)
    result = qa_pipeline(question=question, context=context)

    print(f"Answer: {result['answer']} (Score: {result['score']:.4f})")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


--- Experiment 3: Question Answering ---

Testing BERT (bert-base-uncased)...


Device set to use cpu
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Answer: hallucinations, bias, and deepfakes (Score: 0.0076)

Testing RoBERTa (roberta-base)...


Device set to use cpu
Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Answer: as hallucinations, bias, and deepfakes. (Score: 0.0041)

Testing BART (facebook/bart-base)...


Device set to use cpu


Answer: Generative AI poses (Score: 0.0278)


# Observation Table

| Task | Model | Classification | Observation | Why did this happen? |
| :--- | :--- | :--- | :--- | :--- |
| **Generation** | BERT | Failure | *Output was likely repetitive or nonsense.* | BERT is an **Encoder**, designed to understand text, not generate it (it's not a causal LM). |
| | RoBERTa | Failure | *Output was likely nonsense.* | Like BERT, RoBERTa is an **Encoder** and cannot predict "future" tokens effectively. |
| | BART | Success | *Generated a coherent sentence.* | BART is an **Encoder-Decoder**, explicitly designed for sequence generation. |
| **Fill-Mask** | BERT | Success | *Predicted 'create' or 'generate'.* | BERT is trained on Masked Language Modeling (MLM), so it excels here. |
| | RoBERTa | Success | *Predicted 'create' or 'generate'.* | RoBERTa is also trained on MLM (optimized BERT). |
| | BART | Success | *Predicted 'create'.* | BART's encoder can handle masking tasks effectively. |
| **QA** | BERT | Mixed | *Answer might be short/cutoff.* | Base BERT is not fine-tuned for QA (SQuAD), so performance is inconsistent. |
| | RoBERTa | Mixed | *Answer might be poor.* | Base RoBERTa is not fine-tuned for QA; it needs a specific head for this. |
| | BART | Success | *Extracted the list of risks.* | BART's seq2seq nature handles extracting answers from context well. |