In [35]:
import transformers
from transformers import pipeline 

## Experiment 1: Text Generation

In [36]:
prompt = "The future of Artificial Intelligence is "

### Model 1: BERT (bert-base-uncased)

In [37]:
generate = pipeline('text-generation',model="bert-base-uncased")
op = generate(prompt)
print(op)

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Loading weights: 100%|██████████| 202/202 [00:00<00:00, 954.42it/s, Materializing param=cls.predictions.transform.dense.weight]                 
BertLMHeadModel LOAD REPORT from: bert-base-uncased
Key                         | Status     |  | 
----------------------------+------------+--+-
bert.pooler.dense.bias      | UNEXPECTED |  | 
cls.seq_relationship.bias   | UNEXPECTED |  | 
bert.pooler.dense.weight    | UNEXPECTED |  | 
cls.seq_relationship.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is .. the stories in on jay and and - this deep character and and for and and and and.. the and and or smoke i and -, in and and in and and in and and ".. it well. ". me work women ". ( ". i to is / or and and and and and and - - - -,. it it it it it that that " ( " ( " ) an ( " " ". it out all ty because more actually her a. that but\'- - - - - - - ( ). it it it it is\'- ( ( ( ( ) ) ( -, - ( ( ". that me so van a ) ) ( ( ) down yeth shit to rush how it it it it it that many into me is her " of, -, and. was belle, and and are and and ".. it it - " ". are and and and and and and. him ". him ". it that and " ( ) ) ) ( ". it, around the ring ( ) group ( ". the thele [ to now " " " (. go that that as the and around bus they an an is as or and the some some some some those'}]


**Why this happened?**  
BERT is an **Encoder-only** architecture trained for Masked Language Modeling (MLM), not autoregressive generation. It lacks a decoder to predict the next token sequentially. When forced into text generation mode, it produces incoherent outputs because it's trying to use bidirectional context in a unidirectional generation task.

### Model 2: RoBERTa (roberta-base)

In [38]:
generate = pipeline('text-generation',model="roberta-base")
op = generate(prompt)
print(op)

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`
Loading weights: 100%|██████████| 202/202 [00:00<00:00, 822.01it/s, Materializing param=roberta.encoder.layer.11.output.dense.weight]              
RobertaForCausalLM LOAD REPORT from: roberta-base
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of Artificial Intelligence is '}]


**Why this happened?**  
RoBERTa is also an **Encoder-only** model, similar to BERT but with optimized training. Like BERT, it's designed for understanding tasks (MLM), not generation. Without a decoder architecture for autoregressive generation, it fails to produce new tokens and only returns the input prompt.

### Model 3: BART (facebook/bart-base)

In [39]:
generate = pipeline('text-generation',model="facebook/bart-base")
op = generate(prompt)
print(op)

Loading weights: 100%|██████████| 159/159 [00:00<00:00, 946.99it/s, Materializing param=model.decoder.layers.5.self_attn_layer_norm.weight]   
This checkpoint seem corrupted. The tied weights mapping for this model specifies to tie model.decoder.embed_tokens.weight to lm_head.weight, but both are absent from the checkpoint, and we could not find another related tied weight for those keys
BartForCausalLM LOAD REPORT from: facebook/bart-base
Key                                                           | Status     | 
--------------------------------------------------------------+------------+-
encoder.layers.{0, 1, 2, 3, 4, 5}.self_attn.q_proj.bias       | UNEXPECTED | 
encoder.layers.{0, 1, 2, 3, 4, 5}.self_attn.k_proj.weight     | UNEXPECTED | 
encoder.layers.{0, 1, 2, 3, 4, 5}.self_attn_layer_norm.weight | UNEXPECTED | 
encoder.layers.{0, 1, 2, 3, 4, 5}.fc2.bias                    | UNEXPECTED | 
encoder.layers.{0, 1, 2, 3, 4, 5}.self_attn_layer_norm.bias   | UNEXPECTED | 
encoder.la

[{'generated_text': 'The future of Artificial Intelligence is  tragedy tragedy tragedy addon tragedy tragedyaston tragedy tragedyyour Rape unmarked unmarked unmarkedULTS updating updating extrad� Rape updating vertically extradReporting tragedy tragedy narcissstocks Stir addon inconsist Rapeominationampionomination classroom� mansionomination updatingDaniel plummetedDaniel unmarkedomination tragedy updating trace introduce Tow Tow Tow island theoristsAMPDaniel traceitta Tow helicopter trace Tow dopingDanielitusDanielDaniel unmarked mansion CFLULTS Firearmsinquinqu doping conflicted helicopter helicopterZip classroomitusitusDaniel helicopterpersframes inflammatoryitus casHoursNaz Tow inflammatory unmarkedHoursinqu Cent Centitus helicopter helicopter Firearms Firearms unmarkeditusitus Firearms unmarkedASED Tow unmarked Tow helicopter cod codAMPEY helicopterNaz helicopter Firearms Tow helicopter helicopter helicopter forge Tow TowitusAMP condesc Tow cas Tow helicopterJacob Towitus helicop

**Why this happened?**  
BART is an **Encoder-Decoder** architecture designed for seq2seq tasks. While it has a decoder capable of generation, the base model isn't fine-tuned for coherent text generation. The decoder generates tokens autoregressively but without proper training data for this specific task, it produces repetitive, nonsensical outputs.

## Experiment 2: Masked Language Modeling (Missing Word)

### Model 1: BERT (bert-base-uncased)

In [49]:
prompt = "The goal of Generative AI is to [MASK] new content."

In [50]:
generate = pipeline('fill-mask',model="bert-base-uncased")
op = generate(prompt)
print(op)

Loading weights: 100%|██████████| 202/202 [00:00<00:00, 920.96it/s, Materializing param=cls.predictions.transform.dense.weight]                 
BertForMaskedLM LOAD REPORT from: bert-base-uncased
Key                         | Status     |  | 
----------------------------+------------+--+-
bert.pooler.dense.bias      | UNEXPECTED |  | 
cls.seq_relationship.bias   | UNEXPECTED |  | 
bert.pooler.dense.weight    | UNEXPECTED |  | 
cls.seq_relationship.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


[{'score': 0.5396925806999207, 'token': 3443, 'token_str': 'create', 'sequence': 'the goal of generative ai is to create new content.'}, {'score': 0.15575747191905975, 'token': 9699, 'token_str': 'generate', 'sequence': 'the goal of generative ai is to generate new content.'}, {'score': 0.05405499413609505, 'token': 3965, 'token_str': 'produce', 'sequence': 'the goal of generative ai is to produce new content.'}, {'score': 0.04451555386185646, 'token': 4503, 'token_str': 'develop', 'sequence': 'the goal of generative ai is to develop new content.'}, {'score': 0.017577461898326874, 'token': 5587, 'token_str': 'add', 'sequence': 'the goal of generative ai is to add new content.'}]


**Why this happened?**  
BERT's **Encoder-only** architecture is specifically trained on the MLM objective. The bidirectional encoder processes the entire input simultaneously, using left and right context to predict the masked token. This is exactly what BERT was designed for, hence the high accuracy in predicting meaningful words like 'create' (54.0%) and 'generate' (15.6%).

### Model 2: RoBERTa (roberta-base)

In [51]:
prompt = "The goal of Generative AI is to <mask> new content."

In [52]:
generate = pipeline('fill-mask',model="roberta-base")
op = generate(prompt)
print(op)

Loading weights: 100%|██████████| 202/202 [00:00<00:00, 935.52it/s, Materializing param=roberta.encoder.layer.11.output.dense.weight]              
RobertaForMaskedLM LOAD REPORT from: roberta-base
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


[{'score': 0.37113264203071594, 'token': 5368, 'token_str': ' generate', 'sequence': 'The goal of Generative AI is to generate new content.'}, {'score': 0.36771315336227417, 'token': 1045, 'token_str': ' create', 'sequence': 'The goal of Generative AI is to create new content.'}, {'score': 0.08351451903581619, 'token': 8286, 'token_str': ' discover', 'sequence': 'The goal of Generative AI is to discover new content.'}, {'score': 0.021335037425160408, 'token': 465, 'token_str': ' find', 'sequence': 'The goal of Generative AI is to find new content.'}, {'score': 0.016521571204066277, 'token': 694, 'token_str': ' provide', 'sequence': 'The goal of Generative AI is to provide new content.'}]


**Why this happened?**  
RoBERTa is an optimized **Encoder-only** model with improved training (more data, dynamic masking, no NSP task). Like BERT, it excels at MLM but with better performance. The bidirectional encoder provides rich contextual understanding, resulting in balanced high confidence scores for both 'generate' (37.1%) and 'create' (36.8%), showing its superior contextual understanding.

### Model 3: BART (facebook/bart-base)

In [53]:
prompt = "The goal of Generative AI is to <mask> new content."

In [54]:
generate = pipeline('fill-mask',model="facebook/bart-base")
op = generate(prompt)
print(op)

Loading weights: 100%|██████████| 259/259 [00:00<00:00, 819.85it/s, Materializing param=model.shared.weight]                                  


[{'score': 0.0746152326464653, 'token': 1045, 'token_str': ' create', 'sequence': 'The goal of Generative AI is to create new content.'}, {'score': 0.06571860611438751, 'token': 244, 'token_str': ' help', 'sequence': 'The goal of Generative AI is to help new content.'}, {'score': 0.060880132019519806, 'token': 694, 'token_str': ' provide', 'sequence': 'The goal of Generative AI is to provide new content.'}, {'score': 0.035935692489147186, 'token': 3155, 'token_str': ' enable', 'sequence': 'The goal of Generative AI is to enable new content.'}, {'score': 0.03319466486573219, 'token': 1477, 'token_str': ' improve', 'sequence': 'The goal of Generative AI is to improve new content.'}]


**Why this happened?**  
BART's **Encoder-Decoder** architecture is primarily trained for seq2seq tasks including text infilling. When using the correct `<mask>` token format (not parentheses), BART can successfully predict masked tokens like 'create' (7.5%), 'help' (6.6%), and 'provide' (6.1%). While confidence scores are lower than BERT/RoBERTa, the predictions are meaningful because BART's training included denoising tasks similar to MLM.

## Experiment 3: Question Answering

In [45]:
question = "What are the risks?"
context= "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."

### Model 1: BERT (bert-base-uncased)

In [46]:
generate = pipeline('question-answering', model="bert-base-uncased")
op = generate(question=question,context=context)
print(op)

Loading weights: 100%|██████████| 197/197 [00:00<00:00, 865.22it/s, Materializing param=bert.encoder.layer.11.output.dense.weight]              
BertForQuestionAnswering LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
bert.pooler.dense.weight                   | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
bert.pooler.dense.bias                     | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
qa_outputs.weight                          | MISSING    | 
qa_outputs.bias                            | MISSING    | 

Notes:
- UNEXPECTED	:can b

{'score': 0.010191475041210651, 'start': 72, 'end': 81, 'answer': 'deepfakes'}


**Why this happened?**  
BERT's **Encoder-only** architecture wasn't fine-tuned for QA. The `qa_outputs` layer (which predicts start/end positions) was randomly initialized. While the encoder provides good contextual representations, the untrained output layer struggles to identify correct answer spans, resulting in low confidence (1.0%) and only partial answer extraction.

### Model 2: RoBERTa (roberta-base)

In [47]:
generate = pipeline('question-answering', model="roberta-base")
op = generate(question=question,context=context)
print(op)

Loading weights: 100%|██████████| 197/197 [00:00<00:00, 888.99it/s, Materializing param=roberta.encoder.layer.11.output.dense.weight]              
RobertaForQuestionAnswering LOAD REPORT from: roberta-base
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.bias              | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.dense.weight            | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
qa_outputs.weight               | MISSING    | 
qa_outputs.bias                 | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


{'score': 0.005021810065954924, 'start': 43, 'end': 71, 'answer': 'as hallucinations, bias, and'}


**Why this happened?**  
RoBERTa's **Encoder-only** architecture, while powerful for understanding, also lacks QA-specific fine-tuning. The randomly initialized `qa_outputs` layer leads to poor span selection. Despite better representations than BERT, without proper training, it extracts awkward phrases with very low confidence (0.5%).

### Model 3: BART (facebook/bart-base)

In [48]:
generate = pipeline('question-answering', model="facebook/bart-base")
op = generate(question=question,context=context)
print(op)

Loading weights: 100%|██████████| 259/259 [00:00<00:00, 817.40it/s, Materializing param=model.shared.weight]                                  
BartForQuestionAnswering LOAD REPORT from: facebook/bart-base
Key               | Status  | 
------------------+---------+-
qa_outputs.weight | MISSING | 
qa_outputs.bias   | MISSING | 

Notes:
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


{'score': 0.005432651843875647, 'start': 32, 'end': 66, 'answer': 'risks such as hallucinations, bias'}


**Why this happened?**  
BART's **Encoder-Decoder** architecture is better suited for QA than encoder-only models. The encoder processes the question and context, while the decoder can generate or extract answers. However, without QA-specific fine-tuning (qa_outputs layer is newly initialized), performance is still suboptimal with low confidence (0.5%), though slightly better span selection than BERT/RoBERTa.

## Deliverable: Observation Table

| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Generation** | BERT | **Failure** | Generated incoherent text with random words like "jay", "smoke", "ty", "belle", "yeth shit", etc. No logical sequence. | BERT is an Encoder-only model trained for MLM, not autoregressive text generation. It lacks the decoder architecture needed for sequential text prediction. |
| | RoBERTa | **Failure** | Returned only the input prompt with no generation: "The future of Artificial Intelligence is " | RoBERTa is also Encoder-only, trained for MLM like BERT. It's not designed for autoregressive generation and couldn't produce new tokens. |
| | BART | **Partial Failure** | Generated repetitive nonsensical words like "tragedy tragedy tragedy addon", "Tow", "helicopter", "Firearms", "inflammatory", etc. | BART has Encoder-Decoder but the base model wasn't fine-tuned for coherent text generation. Without proper training, it produces repetitive tokens. |
| **Fill-Mask** | BERT | **Success** | Correctly predicted 'create' (54.0%), 'generate' (15.6%), 'produce' (5.4%), 'develop' (4.5%), 'add' (1.8%). | BERT is specifically trained on Masked Language Modeling (MLM) task, making it excel at predicting masked tokens based on context. |
| | RoBERTa | **Success** | Correctly predicted 'generate' (37.1%), 'create' (36.8%), 'discover' (8.4%), 'find' (2.1%), 'provide' (1.7%). | RoBERTa is an optimized version of BERT, also trained on MLM. It performs even better with high confidence and balanced predictions. |
| | BART | **Success** | Correctly predicted 'create' (7.5%), 'help' (6.6%), 'provide' (6.1%), 'enable' (3.6%), 'improve' (3.3%). With proper `<mask>` token, reasonable predictions. | BART's **Encoder-Decoder** architecture can handle fill-mask when using correct token format. Though confidence is lower than BERT/RoBERTa, predictions are now meaningful. |
| **QA** | BERT | **Partial Success** | Extracted 'deepfakes' with low confidence (1.0%). Only partial answer to "What are the risks?" | BERT base model wasn't fine-tuned for QA. The qa_outputs layer was randomly initialized (MISSING in checkpoint), leading to poor performance. |
| | RoBERTa | **Partial Success** | Extracted 'as hallucinations, bias, and' with very low confidence (0.5%). Awkward phrase extraction. | RoBERTa base also lacks QA fine-tuning. The qa_outputs layer was newly initialized, resulting in suboptimal span selection. |
| | BART | **Partial Success** | Extracted 'risks such as hallucinations, bias' with low confidence (0.5%). Better than BERT/RoBERTa but still not ideal. | BART's seq2seq architecture is better suited for QA than encoder-only models, but still needs fine-tuning for optimal performance. |
