In [1]:
from transformers import pipeline, set_seed, GPT2Tokenizer

In [2]:
import os
import nltk

In [3]:
# Define the models and prompt
models_to_test = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}

### Task -1 : Text Generation

In [4]:
results = {}
prompt = "The future of Artificial Intelligence is"


In [7]:
for name, model_path in models_to_test.items():
    print(f"--- Running Experiment with {name} ---")
    try:
        # Task: Text Generation
        generator = pipeline('text-generation', model=model_path)

        # Setting max_new_tokens for a fair comparison
        output = generator(prompt, max_new_tokens=100, num_return_sequences=1)
        results[name] = output[0]['generated_text']
        print(f"Output: {results[name]}\n")

    except Exception as e:
        results[name] = f"Error: {str(e)}"
        print(f"Result: {name} failed as expected or produced an error.\n")

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


--- Running Experiment with BERT ---


Device set to use cuda:0
If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


Output: The future of Artificial Intelligence is....................................................................................................

--- Running Experiment with RoBERTa ---


Device set to use cuda:0


Output: The future of Artificial Intelligence is

--- Running Experiment with BART ---


Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


Output: The future of Artificial Intelligence isersonerson Brend different different differentraine synchronized perceive Chat synchronized salvation hazeshall salvation synchronized Cache vaccination different vaccination thresholdsFactor paradox Tennis vacuum thresholds thresholds Theseushing thresholds thresholds thresholds Laws thresholds thresholdsode thresholds thresholds Hospital thresholds paradox thresholds revoked Laws thresholdsshall Cache Ah unsureorthy Ah Ah Laws Laws thresholdsDun Laws vaccination vaccinationochemistry Elkwat Ah Ah Ahï¿½ thresholds Ah Laws videot Laws Lawsshall thresholdsochemistryochemistry thresholds thresholds Due Ah Ahshall thresholds mosquitoesshallshallersonerson sliced Ah Nothing thresholds Ah temper Ahochemistryerson heroic thresholds



In [8]:
import pandas as pd
df = pd.DataFrame.from_dict(results, orient='index', columns=['Generated Text'])
print(df)

                                            Generated Text
BERT     The future of Artificial Intelligence is.........
RoBERTa           The future of Artificial Intelligence is
BART     The future of Artificial Intelligence isersone...


**Observations**  
As BERT and RoBERTA are encoder only models, they excell at understanding and reading the prompt. However the part of text generation in a transformer is the role of decoder. As such, we can see that they didn't generate any meaningful sentence.   
  
On the other hand, BART is an encoder-decoder model and is capable of text generation. However we can observe from above cells that although it did generate some text, it wasn't as coherent or meaningful as our normal GPT models.

### Task - 2 : Masked language modelling

In [9]:
mask_prompt_bert = "The goal of Generative AI is to [MASK] new content."
mask_prompt_others = "The goal of Generative AI is to <mask> new content."

In [10]:
results_mask = {}
for name, model_path in models_to_test.items():
    print(f"--- Running Fill-Mask with {name} ---")
    try:
        # Initialize fill-mask pipeline
        mask_filler = pipeline('fill-mask', model=model_path)

        # Select the correct prompt format
        current_prompt = mask_prompt_bert if name == "BERT" else mask_prompt_others

        # Get top prediction
        output = mask_filler(current_prompt)

        # Store the top 1 result
        top_prediction = output[0]['token_str']
        score = round(output[0]['score'], 4)

        results_mask[name] = f"{top_prediction} (Confidence: {score})"
        print(f"Top Prediction: {top_prediction}\n")

    except Exception as e:
        results_mask[name] = f"Error: {str(e)}"
        print(f"Result: {name} failed.\n")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


--- Running Fill-Mask with BERT ---


Device set to use cuda:0


Top Prediction: create

--- Running Fill-Mask with RoBERTa ---


Device set to use cuda:0


Top Prediction:  generate

--- Running Fill-Mask with BART ---


Device set to use cuda:0


Top Prediction:  create



In [11]:
mask_df = pd.DataFrame.from_dict(results_mask, orient='index', columns=['Predicted Word'])
print(mask_df)

                         Predicted Word
BERT        create (Confidence: 0.5397)
RoBERTa   generate (Confidence: 0.3711)
BART        create (Confidence: 0.0746)


**Observations**  
BERT and RoBERTA were specifically trained for such masked language modelling tasks. Since they use bidirectional encoding, they are able to predict the correct word with good confidence.
  
BART is also able to predict the missing word, although it can do alot more such as text generation. It wasn't specifically trained for this task and as such we can see that its confidence is lower.

### Task - 3: Question Answering

In [15]:
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

In [16]:
results_qa = {}

for name, model_path in models_to_test.items():
    print(f"--- Running Question Answering with {name} ---")
    try:
        # Task: Question Answering
        qa_pipeline = pipeline('question-answering', model=model_path)

        # Get answer
        output = qa_pipeline(question=question, context=context)

        answer = output['answer']
        score = round(output['score'], 4)
        results_qa[name] = f"'{answer}' (Score: {score})"
        print(f"Answer: {answer}\n")

    except Exception as e:
        # Base models often lack the 'QA head' weights needed for this pipeline
        results_qa[name] = "Failed/No QA Head"
        print(f"Result: {name} does not have a fine-tuned QA head.\n")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


--- Running Question Answering with BERT ---


Device set to use cuda:0
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Answer: , and deepfakes

--- Running Question Answering with RoBERTa ---


Device set to use cuda:0
Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Answer: Generative AI poses significant risks such as hallucinations, bias, and deepfakes

--- Running Question Answering with BART ---


Device set to use cuda:0


Answer: as hallucinations, bias, and



In [17]:
qa_df = pd.DataFrame.from_dict(results_qa, orient='index', columns=['Extracted Answer'])
print(qa_df)

                                          Extracted Answer
BERT                     ', and deepfakes' (Score: 0.0156)
RoBERTa  'Generative AI poses significant risks such as...
BART         'as hallucinations, bias, and' (Score: 0.028)


**Observations**  
RoBERTA gave the best result, while BERT and BART were relatively fine with their answers

| Task           | Model   | Classification (Success/Failure) | Observation (What actually happened?)              | Why did this happen? (Architectural Reason)                                                                                |
| :------------- | :------ | :------------------------------- | :------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------- |
| **Generation** | BERT    | *Failure*                        | Generated ... until max tokens                     | *BERT is an Encoder; it isn't trained to predict the next word.*                                                           |
|                | RoBERTa | Failure                          | Did not generate any content                       | RoBERTa was trained on top of BERT and has a similar encoder only architecture                                             |
|                | BART    | Partial Success                  | Generated texts, however were not fully meaningful | As BART is encoder and decoder, it is capable of text generation. However it is not as good as GPT models.                 |
| **Fill-Mask**  | BERT    | *Success*                        | *Predicted 'create', 'generate'.*                  | *BERT is trained on Masked Language Modeling (MLM).*                                                                       |
|                | RoBERTa | Success                          | Predicted generate                                 | Like BERT, RoBERTA is also trained on Masked Language Modelling, but with a bigger dataset                                 |
|                | BART    | Success                          | Predicted create                                   | BART can do a lot more than masked language modelling. As such it did predict accurately, although with a lower confidence |
| **QA**         | BERT    | Partial Success                  | Answer was not grammatically correct               | They don't have task specific heads and lack fine tuned question answering capabilities                                    |
|                | RoBERTa | Success                          | Answered relatively well                           | They don't have task specific heads and lack fine tuned question answering capabilities                                    |
|                | BART    | Partial Success                  | Slightly better than BERT                          | As BART also has a decoder, it can technically generate text. Although lack of task specific head, make it hard.           |

