Name : Ankana Mandal

SRN  : PES2UG23CS076

# Task
Compare the performance of BERT, RoBERTa, and BART models on text generation, masked language modeling, and question answering tasks, and summarize the observations in a markdown table.

In [None]:
from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')



In [None]:
models = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}


## Experiment 1: Text Generation

In [None]:
prompt = "The future of Artificial Intelligence is"

print("="*70)
print("EXPERIMENT 1: TEXT GENERATION")
print("="*70)

for name, model_id in models.items():
    print(f"\n{'='*70}")
    print(f"Model: {name}")
    print(f"{'='*70}")
    try:
        generator = pipeline('text-generation', model=model_id, max_length=40)
        result = generator(prompt)
        print("SUCCESS")
        print(f"Generated: {result[0]['generated_text']}")
    except Exception as e:
        print(f"FAILURE")
        print(f"Error: {str(e)[:100]}...")


EXPERIMENT 1: TEXT GENERATION

Model: BERT


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


SUCCESS
Generated: The future of Artificial Intelligence is................................

Model: RoBERTa


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


SUCCESS
Generated: The future of Artificial Intelligence is

Model: BART


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


SUCCESS


##Experiment 2: Masked Language Modeling (MLM)

In [None]:
sentences = {
    "BERT": "The goal of Generative AI is to [MASK] new content.",
    "RoBERTa": "The goal of Generative AI is to <mask> new content.",
    "BART": "The goal of Generative AI is to <mask> new content."
}

print("="*70)
print("EXPERIMENT 2: FILL-MASK (MASKED LANGUAGE MODELING)")
print("="*70)

for name, model_id in models.items():
    print(f"\n{'='*70}")
    print(f"Model: {name}")
    print(f"Input: {sentences[name]}")
    print(f"{'='*70}")
    try:
        mask_filler = pipeline("fill-mask", model=model_id)
        preds = mask_filler(sentences[name])
        print("SUCCESS - Top 3 Predictions:")
        for i, p in enumerate(preds[:3], 1):
            print(f"  {i}. '{p['token_str'].strip()}' (confidence: {p['score']*100:.1f}%)")
    except Exception as e:
        print(f"FAILURE")
        print(f"Error: {str(e)}")


EXPERIMENT 2: FILL-MASK (MASKED LANGUAGE MODELING)

Model: BERT
Input: The goal of Generative AI is to [MASK] new content.


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


SUCCESS - Top 3 Predictions:
  1. 'create' (confidence: 54.0%)
  2. 'generate' (confidence: 15.6%)
  3. 'produce' (confidence: 5.4%)

Model: RoBERTa
Input: The goal of Generative AI is to <mask> new content.


Device set to use cpu


SUCCESS - Top 3 Predictions:
  1. 'generate' (confidence: 37.1%)
  2. 'create' (confidence: 36.8%)
  3. 'discover' (confidence: 8.4%)

Model: BART
Input: The goal of Generative AI is to <mask> new content.


Device set to use cpu


SUCCESS - Top 3 Predictions:
  1. 'create' (confidence: 7.5%)
  2. 'help' (confidence: 6.6%)
  3. 'provide' (confidence: 6.1%)


##Experiment 3: Question Answering

In [None]:
for name, model_id in models.items():
    print(f"\n[MODEL]: {name}")
    try:
        qa = pipeline("question-answering", model=model_id)
        res = qa(question=qa_question, context=qa_context)
        print(f"ANSWER: '{res['answer']}'")
        print(f"CONFIDENCE SCORE: {res['score']:.4f}")
    except Exception as e:
        print("STATUS: No QA head / poor output (expected for base models)")




[MODEL]: BERT


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


ANSWER: ', and deepfakes'
CONFIDENCE SCORE: 0.0136

[MODEL]: RoBERTa


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


ANSWER: 'deepfakes'
CONFIDENCE SCORE: 0.0039

[MODEL]: BART


Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


ANSWER: ','
CONFIDENCE SCORE: 0.0364


### Deliverable: Observation Table


| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Generation** | BERT | *Failure* | *Example: Generated nonsense or random symbols.* | *BERT is an Encoder; it isn't trained to predict the next word.* |
| | RoBERTa | *Failure*|*didn't generate new text, just returned the input prompt* | *RoBERTa is also an encoder only model so it isnt designed to predict the next word* |
| | BART |*Failure* | *generated text but the text was meaningless and rubbish* | *Even though BART is an encoder-decoder model and can handle this task, we used the base model which is trained on a lot of raw data, and isnt finetuned to understand and handle grammar and logic*|
| **Fill-Mask** | BERT | *Success* | *Predicted 'create', 'generate', highest confidence score(0.5397).* | *BERT is trained on Masked Language Modeling (MLM).* |
| | RoBERTa | *Sucesss*|*Predicted 'generate', 'create', more consistent scores for synonymns* |*more robust training data than BERT.* |
| | BART |*Success* |*Predicted 'create', 'help', had lower confidence* | *flexible architecture but is designeed for seq2seq.*|
| **QA** | BERT |*partial success* | *Extracted 'hallucinations, bias' (Score: 0.017).*| *BERT had NSP(next sentence prediction) in its training and that helped but since the heads are randomly initialised the outputs are almost random*|
| | RoBERTa |*Failure* |*Extracted only "deepfakes." (Score: 0.012)* | *removal of NSP in pretraining reduced its ability to link QA and it had low confidence*|
| | BART |*partial failure* | *Extracted "Generative AI poses significant" (Score: 0.064)*| *the decoder helped it handle longer sequence queries but the head being randomly initialised meant the output was random*|

---