<a href="https://colab.research.google.com/github/CrisMcode111/DI_Bootcamp/blob/main/w6_d2Exercises_XP_Day2_Student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercises XP: Day 2
Follow the instructions. Where you see TODO, add your answer before running/continuing.


## What You'll Learn
- Deepen your understanding of core LLM concepts.
- Apply theory to practical scenarios.
- Develop critical thinking on LLM applications and ethics.
- Compare/contrast transformer architectures and techniques.


## What You Will Create
- Comparative tables (NLP paradigms and BERT variants)
- Architecture/application write-ups
- Pretraining benefits and ethical considerations
- Analyses on attention and positional encoding
- Model selection justifications across tasks
- Notes on softmax temperature and its effects
- Scenario-based answers applying learned concepts


## üåü Exercise 1: Traditional vs. Modern NLP: Comparative Analysis
1) Complete the table (replace each TODO):

| Aspect | Traditional NLP | Modern NLP |
|---|---|---|
| Feature Engineering | Manual ‚Äî based on human-designed rules and statistical features (bag-of-words, TF-IDF, n-grams) | Automatic ‚Äî neural networks learn representations directly from data (embeddings, contextual vectors) |
| Word Representations | Sparse and independent (one-hot, TF-IDF) | Dense and contextual (Word2Vec, GloVe, BERT, GPT) |
| Model Architectures | Statistical or classic ML models (Naive Bayes, SVM, Logistic Regression) | Deep learning architectures ‚Äî RNNs, LSTMs, Transformers |
| Training Methodology | Fully supervised ‚Äî requires labeled data for each task | Pretraining on large unlabeled corpora + fine-tuning on specific tasks |
| Key Examples | spaCy, NLTK, bag-of-words sentiment models | BERT, GPT-2/3, T5, LLaMA, ChatGPT |
| Advantages | Simple, fast, easy to interpret, low computational cost|High performance, captures context, strong generalization |
| Disadvantages | Limited accuracy, no deep contextual understanding, requires manual feature design | Complex, resource-intensive, less interpretable, harder to train|

2) Discuss: How did the shift to modern NLP impact scalability and efficiency?

- TODO: Write 5-7 sentences connecting representation learning, hardware parallelism, and transfer learning.
The shift to modern NLP transformed scalability and efficiency through representation learning and transfer learning.
Instead of manually crafting features, neural networks now learn language representations automatically, which scales easily across tasks and domains.
With the rise of Transformer architectures, computation became highly parallelizable ‚Äî models can process entire sequences simultaneously, unlike RNNs that handled data step by step.
This parallelism allows training on GPUs and TPUs, dramatically reducing training time for massive datasets.
Pretrained models like BERT and GPT introduced transfer learning, enabling one model to be fine-tuned efficiently for many tasks with minimal data.
As a result, NLP systems became both more scalable and more efficient, leveraging shared knowledge instead of starting from scratch for every problem.
However, this progress also came with increased hardware demands and energy consumption, highlighting a new balance between performance and efficiency.


## üåü Exercise 2: LLM Architecture and Application Scenarios
For each, describe (a) core architectural differences, (b) a real application, (c) why it fits.

### BERT
- TODO: Bidirectional encoder-only; MLM (and historically NSP)
- Application: TODO (e.g., classification/NER/Q&A)
- Why: TODO

### GPT
- TODO: Decoder-only; autoregressive causal LM
- Application: TODO (e.g., chat/generation/code)
- Why: TODO

### T5
- TODO: Encoder-decoder; text-to-text framework
- Application: TODO (e.g., summarization/translation)
- Why: TODO


## üåü Exercise 2: LLM Architecture and Application Scenarios
For each, describe (a) core architectural differences, (b) a real application, (c) why it fits.

### BERT
- Bidirectional encoder-only; pre-trained using Masked Language Modeling (MLM) and historically Next Sentence Prediction (NSP).
- Application: Question Answering (e.g., searching for answers within a document).
- Why: BERT's bidirectional nature allows it to understand the context of a word based on both the words that come before and after it in a sentence, which is crucial for accurately identifying the answer span in a question answering task. The encoder-only architecture is well-suited for tasks that require understanding the entire input sequence, rather than generating new text.

### GPT
- Decoder-only; autoregressive causal language model.
- Application: Text generation (e.g., writing stories, articles, or code).
- Why: GPT's decoder-only architecture and autoregressive nature make it excellent at generating sequential text, predicting the next token based on the preceding ones. This is ideal for creative writing, chatbots, and any task where the output is a continuation of the input.

### T5
- Encoder-decoder; pre-trained using a text-to-text framework, often involving denoising objectives.
- Application: Machine Translation (e.g., translating text from one language to another).
- Why: T5's encoder-decoder architecture is inherently suited for sequence-to-sequence tasks like translation, where the input is a sequence in one language and the output is a sequence in another. The text-to-text framework allows it to handle various translation tasks under a unified approach.

## üåü Exercise 3: Benefits and Ethics of Pre-training
- Benefits (explain each in your words):
  1) Improved generalization: Pre-training helps a model understand general language patterns before learning a specific task.
Because it has already seen millions of examples, it can recognize grammar, meanings, and relationships between words.
This makes it easier for the model to perform well on new or unseen data, even when it wasn‚Äôt trained directly on it.
In short ‚Äî it ‚Äúlearns how to learn‚Äù and can adapt faster to different problems.
  2) Less labeled data: Pre-training allows models to learn a lot from unlabeled text, like articles or conversations, before fine-tuning.
When we later train them on a specific task (like spam detection or translation), they already ‚Äúknow‚Äù language patterns.
This means we need much fewer labeled examples, saving time and effort because labeling data is expensive and slow.
The model reuses its previous knowledge instead of starting from zero.
  3) Faster fine-tuning: Faster fine-tuning:
Because the model is already trained on a huge amount of text, it starts with a good understanding of language.
When we fine-tune it for a specific task, it only needs to adjust slightly ‚Äî not learn everything from the beginning.
This makes the training process much faster and requires less computing power.
It‚Äôs like teaching someone who already speaks the language, instead of starting from zero.
  4) Transfer learning: Transfer learning means taking a model that was already trained on one big, general task and reusing it for another, smaller task.
The model ‚Äútransfers‚Äù its language knowledge ‚Äî grammar, meaning, and context ‚Äî to new problems like sentiment analysis or question answering.
This saves a lot of time and data because we don‚Äôt need to train a new model from scratch every time.
It helps achieve good results even when the new dataset is small.
  5) Robustness: Pre-trained models are more stable and reliable because they‚Äôve already seen many types of text, topics, and writing styles.
This helps them handle spelling mistakes, slang, or unusual phrasing better than models trained on small datasets.
They don‚Äôt get confused as easily by noise or rare words.
In short, pre-training makes models more resistant to errors and better at understanding real-world language.

- Ethical concerns: TODO (bias, misinformation, misuse, privacy) Pre-trained models learn from huge amounts of text collected from the internet ‚Äî and that text often includes biases, stereotypes, and misinformation.
As a result, the model can accidentally repeat or amplify unfair or false ideas.
There is also a risk of misuse, like generating fake news or harmful content.
Another problem is privacy, since training data may contain personal information that should not be exposed.
That‚Äôs why developers must carefully check data sources and apply filters or ethical rules before using these models.
- Mitigations: TODO (data curation, differential privacy, safety filters, RLHF, audits)
To reduce ethical risks, developers use several safety methods.
Data curation helps by cleaning and selecting high-quality, diverse, and unbiased text sources.
Differential privacy protects personal data during training, so individual information cannot be traced.
Safety filters and RLHF (Reinforcement Learning from Human Feedback) teach the model to avoid toxic, biased, or unsafe outputs.
Regular audits and evaluations help detect problems early and keep the system transparent and responsible.
Together, these steps make modern NLP models safer and more trustworthy.

## üåü Exercise 4: Transformer Architecture Deep Dive
### Self-Attention & Multi-Head Attention
- TODO: Describe the Q/K/V flow, softmax weighting, and value mixing. In a Transformer, each token is first converted into three vectors ‚Äî Query (Q), Key (K), and Value (V) ‚Äî through learned linear layers.
The model compares every Query with all Keys to measure how strongly each word should pay attention to the others.
These similarity scores are scaled and passed through a softmax function to turn them into attention weights (they sum to 1).
The weights are then used to mix the Value vectors ‚Äî this gives a new representation for each word that includes information from its context.
In Multi-Head Attention, several self-attention layers run in parallel, each focusing on different types of relationships (syntax, meaning, position, etc.).
Their outputs are then concatenated and combined, giving the model a richer and more complete understanding of the sequence.
- TODO: Explain why multiple heads help (subspace projections, diverse relations).Using multiple attention heads allows the model to look at the same sentence from different perspectives.
Each head learns a different subspace projection of the input ‚Äî for example, one head may focus on grammar, another on meaning, and another on long-distance dependencies.
By combining these diverse attention patterns, the model captures richer and more complex relationships between words.
This makes the overall understanding more accurate and robust than using a single head.
- Example sentence (different from lesson):‚ÄúThe cat sat on the mat because it was warm.‚Äù
  - TODO: Provide a sentence and describe at least two different relations distinct heads might capture. One attention head might focus on the grammatical structure, linking ‚Äúcat‚Äù (subject) with ‚Äúsat‚Äù (verb).

Another head could capture the coreference relationship, connecting ‚Äúit‚Äù with ‚Äúthe cat‚Äù to understand who ‚Äúit‚Äù refers to.
This shows how different heads specialize ‚Äî one in syntax, another in meaning or reference ‚Äî to build a complete understanding of the sentence.

### Pre-training Objectives
- MLM vs CLM: TODO (compare objectives and typical use) MLM (Masked Language Modeling) ‚Äî used in models like BERT.
It randomly hides some words in a sentence and trains the model to predict the missing words based on the context around them.
This helps the model understand both left and right context, making it great for tasks like classification, QA, or sentence understanding.

CLM (Causal Language Modeling) ‚Äî used in models like GPT.
The model reads text from left to right and learns to predict the next word in a sequence.
It captures strong generative abilities, making it ideal for text completion, dialogue, and creative generation tasks.
- When to prefer MLM vs CLM: TODO Choose MLM when the goal is to understand text, not generate it ‚Äî for example in classification, sentiment analysis, or question answering.
MLM helps the model see the full context (both sides) of each word, which improves comprehension and representation quality.

Choose CLM when the goal is to generate text ‚Äî like chatbots, translation, story writing, or code generation.
CLM trains the model to think ‚Äúforward,‚Äù predicting the next word naturally, which makes it fluent and coherent in generation tasks.
- NSP: TODO (why early BERT used it and why many modern models avoid it)NSP (Next Sentence Prediction):
Early BERT models used NSP to help the model understand relationships between sentences ‚Äî for example, whether one sentence logically follows another.
This was useful for tasks like question answering or natural language inference, where sentence connections matter.

However, later research showed that NSP did not significantly improve performance and sometimes even confused the model, because it focused too much on predicting sequence order instead of deeper meaning.
Modern models often remove NSP and rely on better pre-training tasks (like Sentence Order Prediction ‚Äì SOP or larger context windows) that teach sentence relationships more effectively.

### Transformer Model Selection
- Sentiment on reviews: TODO (Encoder-only / Decoder-only / Encoder-Decoder) + justification Sentiment analysis focuses on understanding the text ‚Äî not generating new sentences.
Encoder-only models read the entire input and build a deep contextual representation of its meaning, using both left and right context.
This makes them ideal for classification tasks such as detecting positive or negative opinions in reviews.
Decoder-only or encoder-decoder models are better suited for generation tasks, like translation or summarization, not for simple classification.
- Conversational chatbot (creative responses): TODO + justification
Model type: Decoder-only (like GPT-2, GPT-3, or ChatGPT)

Justification:
A chatbot needs to generate new text ‚Äî not just understand it.
Decoder-only models are trained with a causal language modeling objective (predicting the next word), which makes them excellent at producing fluent, coherent, and creative responses.
They naturally handle open-ended dialogue and can continue a conversation in context.
Encoder-only models only analyze text, and encoder-decoder models are mainly used for structured generation (like translation or summarization), not free-form dialogue.
- Technical document translation (EN‚ÜíES): TODO + justification Model type: Encoder‚ÄìDecoder (like T5, BART, or MarianMT)

Justification:
Translation requires both understanding the source text and generating the target text.
The encoder reads and builds a full representation of the English sentence, while the decoder uses that information to produce the Spanish version word by word.
This separation allows the model to preserve meaning and grammar across languages with high accuracy.
Encoder-only models can understand but not generate, and decoder-only models can generate but lack direct cross-language alignment.

### Transformers process all tokens in parallel, so they don‚Äôt naturally know the order of words in a sentence.
Positional encoding adds information about each word‚Äôs position ‚Äî either absolute (its place in the sequence) or relative (its distance from other words).
This helps the model understand word order, syntax, and meaning changes caused by position.
For example, ‚Äúthe cat chased the dog‚Äù vs. ‚Äúthe dog chased the cat‚Äù have the same words but different meanings ‚Äî positional encoding helps the model tell them apart.
- Example failure without positions: TODO If the model had no positional information, it would treat the sentences
 ‚ÄúThe cat chased the dog‚Äù and ‚ÄúThe dog chased the cat‚Äù
as exactly the same, because they contain the same words.
Without knowing the order, it can‚Äôt tell who is doing the action and who receives it.
This would completely change the meaning and lead to wrong predictions or translations.


## üåü Exercise 5: BERT Variations: Choose Your Detective
Assign the best fit and justify:
- Scenario 1 (mobile, limited resources): DistilBERT - Smaller and faster than BERT, making it suitable for deployment on mobile devices or environments with limited computational resources.
- Scenario 2 (legal docs, high accuracy): RoBERTa - Trained on a larger dataset with a modified training objective, leading to improved performance and accuracy, crucial for tasks requiring high precision like analyzing legal documents.
- Scenario 3 (multilingual support): XLM-R - Pre-trained on a massive corpus of multilingual data, making it highly effective for cross-lingual transfer and tasks requiring understanding or processing text in multiple languages.
- Scenario 4 (efficient pretraining with token replacement detection): ELECTRA - Uses a more efficient pre-training task (replaced token detection) that allows it to achieve better performance than BERT with less computational cost during pre-training.
- Scenario 5 (efficient NLP, constrained environments): ALBERT - Uses parameter-reduction techniques like parameter sharing and factorization to significantly reduce the number of parameters, making it more memory-efficient and faster, ideal for constrained environments.

Create/completed table:

| Model | Training differences | Size/Efficiency | Innovations | Ideal use cases |
|---|---|---|---|---|
| RoBERTa | Trained longer on more data, dynamic masking, no NSP | Larger than BERT, improved performance | Dynamic masking, larger batch sizes | High-accuracy tasks, general NLP tasks |
| ALBERT | Parameter sharing across layers, factorized embedding parameterization | Significantly smaller and faster than BERT | Parameter reduction techniques | Resource-constrained environments, fine-tuning tasks |
| DistilBERT | Knowledge distillation from BERT | Smaller and faster than BERT (reduced layers) | Knowledge distillation | Mobile/edge devices, latency-sensitive applications |
| ELECTRA | Replaced Token Detection (RTD) as pre-training task | More efficient pre-training than BERT | Discriminator-based pre-training | Efficient pre-training, good performance with smaller models |
| XLM-R | Trained on 100 languages using Masked Language Modeling | Large, multilingual | Cross-lingual pre-training | Multilingual tasks, cross-lingual transfer |

## üåü Exercise 6: Softmax Temperature: The Randomness Regulator
### 1) Temperature Scenarios
- T=0.2: TODO The temperature is very low, so the model becomes confident and conservative.
It mostly chooses the most probable word, producing short, predictable, and repetitive text.
Good for factual or precise answers, but creativity drops.
- T=1.5: TODO This is the default, balanced temperature.
The model mixes accuracy and diversity, producing coherent yet natural responses.
It‚Äôs ideal for most conversational and creative tasks.
- T=1.0: TODO A high temperature makes the model more random and creative.
It explores unusual or unexpected words and ideas, which can make text more interesting ‚Äî but sometimes less logical or coherent.

### 2) Application Design
- Bedtime stories (creativity vs coherence): TODO (temperature strategy and why)For bedtime stories, we want the text to be imaginative but still easy to follow.
A good strategy is to use a medium-to-high temperature, around 0.8‚Äì1.2.
This gives the model enough creativity to invent magical scenes, new characters, and playful twists ‚Äî without becoming too random or confusing.
If the temperature is too low, the story will sound repetitive and dull; if it‚Äôs too high, it might lose structure and logic.
- Financial report summaries (accuracy/reliability): TODO (temperature/decoding strategy and why)Financial report summaries (accuracy/reliability):
For financial texts, precision and factual consistency are more important than creativity.
Use a low temperature, around 0.2‚Äì0.4, so the model stays focused on the most probable and reliable words.
This reduces randomness and prevents the generation of false or exaggerated information.
You can also combine it with a greedy or beam search decoding strategy to ensure the output is clear, stable, and data-driven.

### 3) Temperature & Bias
- TODO: How does temperature influence bias surfacing or dampening? Give a realistic example.Temperature affects not only creativity but also how strongly a model‚Äôs biases appear in its responses.
At a low temperature, the model mostly chooses its most confident (high-probability) answers ‚Äî which can reinforce existing biases, since those are often learned as dominant patterns from training data.
At a higher temperature, the model samples more diverse words, which can dilute or soften those biases but might also introduce random or inconsistent phrasing.

Example:
If asked to describe a ‚Äúleader,‚Äù a model with low temperature might always say ‚Äúa man in a suit,‚Äù repeating a bias seen in data.
At a higher temperature, it could produce varied answers like ‚Äúa woman inspiring her team‚Äù or ‚Äúa young activist organizing change.‚Äù


### (Optional) Quick Generation Demo Across Temperatures
Note: Requires `transformers` and model download; skip if offline.


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_name = "gpt2"  # TODO: choose a causal LM
prompt = "Artificial intelligence will"  # TODO: write your own prompt

pipe = pipeline(
    "text-generation",
    model=AutoModelForCausalLM.from_pretrained(model_name),
    tokenizer=AutoTokenizer.from_pretrained(model_name),
    device=0 if torch.cuda.is_available() else -1,
)

for T in [0.2, 1.0, 1.5]:
    out = pipe(prompt, max_new_tokens=40, temperature=T, do_sample=True)
    print("--- Temperature:", T)
    print(out[0]["generated_text"])  # Inspect how style changes with T
