### **Assignment - 1: The Model Benchmark Challenge**

**Objective:** In this assignment, you will step beyond simply using a model and instead evaluate the architectural differences between BERT, RoBERTa, and BART. You will force these models to perform tasks they might not be designed for, to observe why architecture matters.

In [35]:
from transformers import pipeline, set_seed

In [28]:
set_seed(27)

## Text Generation:
Requires decoder-based architectures

In [29]:
prompt = "The future of Artificial Intelligence is"

**Using BERT**

In [31]:
generator = pipeline("text-generation", model="bert-base-uncased")
output = generator(prompt, max_length=80, num_return_sequences=1)
print(output[0]['generated_text'])

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of Artificial Intelligence is................................................................................................................................................................................................................................................................


**Using RoBERTa**

In [32]:
generator = pipeline("text-generation", model="roberta-base")
output = generator(prompt, max_length=30, num_return_sequences=1)
print(output[0]['generated_text'])

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of Artificial Intelligence is


**Using BART**

In [33]:
generator = pipeline("text-generation", model="facebook/bart-base")
output = generator(prompt, max_length=30, num_return_sequences=1)
print(output[0]['generated_text'])

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of Artificial Intelligence is Discipline Dublinraviolet 1850 ===== polymorph Gained polymorph Dublin Dublin maternity 1850 1850 wrapper wrapper Prosper premieruties prone Enix 1850 congressional Heights Pist Pistassisassis BCEille Pist Dublin Paradox Gained Pist asthma asthma Pistonian asthma asthmaassis Gained Gained 1850oves 71 greets Pistassis demeanor Pist Pistmuch Pist Gained Pist Discipline Pist Christensen BALL Pist Gained Immortal Pist Dublin Immortal Gained Gainedilleassis gener Gained Dublin improvised Gained BALL improvised Dublin hybrid Gained Discipline locom Discipline improvised Chronicle gener Gained greetsー� gener greets Signs Pist gener GainedVarious befriend generwhatille improvised Briggs Electro Tickets greets gener gener Dublin Gaineducket generians option Dublin Gained Dublin Chronicleilleoves Optional GainedSaid Dublinfoundland heresy Bank Dublin crus Write GainedSaid crusVarious Dublin crus Dublin crusfoundlandoros greetsoves slaying Dublin crus inco

## Fill-Mask:
Masking tasks are best handled by encoder models

In [13]:
prompt2= "The goal of Generative AI is to ([MASK]) new content."

**Using BERT**

In [14]:
fill_mask = pipeline("fill-mask", model="bert-base-uncased")
fill_mask_output = fill_mask(prompt2)
print(fill_mask_output)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.4012771546840668, 'token': 3443, 'token_str': 'create', 'sequence': 'the goal of generative ai is to ( create ) new content.'}, {'score': 0.12370636314153671, 'token': 9699, 'token_str': 'generate', 'sequence': 'the goal of generative ai is to ( generate ) new content.'}, {'score': 0.05480790138244629, 'token': 5587, 'token_str': 'add', 'sequence': 'the goal of generative ai is to ( add ) new content.'}, {'score': 0.054711129516363144, 'token': 3965, 'token_str': 'produce', 'sequence': 'the goal of generative ai is to ( produce ) new content.'}, {'score': 0.05049882084131241, 'token': 4503, 'token_str': 'develop', 'sequence': 'the goal of generative ai is to ( develop ) new content.'}]


**Using RoBERTa**

In [17]:
prompt2= "The goal of Generative AI is to (<mask>) new content."
fill_mask = pipeline("fill-mask", model="roberta-base")
fill_mask_output = fill_mask(prompt2)
print(fill_mask_output)

Device set to use cpu


[{'score': 0.6570649147033691, 'token': 32845, 'token_str': 'create', 'sequence': 'The goal of Generative AI is to (create) new content.'}, {'score': 0.05813491716980934, 'token': 29631, 'token_str': 'write', 'sequence': 'The goal of Generative AI is to (write) new content.'}, {'score': 0.030625075101852417, 'token': 23411, 'token_str': 'build', 'sequence': 'The goal of Generative AI is to (build) new content.'}, {'score': 0.024237658828496933, 'token': 26559, 'token_str': 'find', 'sequence': 'The goal of Generative AI is to (find) new content.'}, {'score': 0.021864596754312515, 'token': 42843, 'token_str': 'construct', 'sequence': 'The goal of Generative AI is to (construct) new content.'}]


**Using BART**

In [18]:
fill_mask = pipeline("fill-mask", model="facebook/bart-base")
fill_mask_output = fill_mask(prompt2)
print(fill_mask_output)

Device set to use cpu


[{'score': 0.02457479014992714, 'token': 13138, 'token_str': 'prov', 'sequence': 'The goal of Generative AI is to (prov) new content.'}, {'score': 0.020460328087210655, 'token': 241, 'token_str': 're', 'sequence': 'The goal of Generative AI is to (re) new content.'}, {'score': 0.019175807014107704, 'token': 506, 'token_str': 'f', 'sequence': 'The goal of Generative AI is to (f) new content.'}, {'score': 0.018511200323700905, 'token': 463, 'token_str': 'and', 'sequence': 'The goal of Generative AI is to (and) new content.'}, {'score': 0.01797475852072239, 'token': 32845, 'token_str': 'create', 'sequence': 'The goal of Generative AI is to (create) new content.'}]


## Question Answering:
Requires task-specific fine-tuning for accurate results

In [34]:
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

**Using BERT**

In [20]:
qa = pipeline("question-answering", model="bert-base-uncased")
qa_output = qa(question=question, context=context)
print(qa_output)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.00846179248765111, 'start': 62, 'end': 81, 'answer': 'bias, and deepfakes'}


**Using RoBERTa**

In [21]:
qa = pipeline("question-answering", model="roberta-base")
qa_output = qa(question=question, context=context)
print(qa_output)

Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.009185279253870249, 'start': 60, 'end': 81, 'answer': ', bias, and deepfakes'}


**Using BART**

In [22]:
qa = pipeline("question-answering", model="facebook/bart-base")
qa_output = qa(question=question, context=context)
print(qa_output)

Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'score': 0.04050577059388161, 'start': 20, 'end': 45, 'answer': 'significant risks such as'}


| Task | Model | Classification | Observation | Architectural Reason |
|------|-------|---------------|-------------|----------------------|
| Generation | BERT | Failure | It throws error and cannot generate text | it is an encoder-only model |
|  | RoBERTa | Failure | It throws error and cannot generate text just like BERT | It is an encoder-only model |
|  | BART | Partial Success | Generated text doesn't have proper meaning| It is an encoder-decoder but isn't generative |
| Fill-Mask | BERT | Success | Predicted correct words like create, generate, develop and produce | MLM trained to predict missing words |
|  | RoBERTa | Success | Gave higher confidence prediction of words like create, build and find | Better trained (MLM optimised) to predict missing words |
|  | BART | Partially Works | Accuracy is very less and gave some correct words like create and other incorrect words like re, prov | Not MLM trained to predict missing words |
| QA | BERT | Partially correct | Extracted phrase and answered partially | It is not fine-tuned for QA |
|  | RoBERTa | Partially Working | Gave an inconsistent answer conating odd punctuations | It is a base encoder not trained for QA |
|  | BART | Partial Answer | Answer was too short and din't contain actually required part of information | It needs special fine-tuning (SQuAD) for QA |


### Key Insight

Model performance strongly depends on the underlying architecture. Encoder-only models perform well on understanding tasks, while decoder-based models are required for text generation.