<a href="https://colab.research.google.com/github/Naomie25/DI-Bootcamp/blob/main/Week8_Day1_ExerciceXP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 Exercise 1: Traditional vs. Modern NLP: A Comparative Analysis

| **Aspect**                 | **Traditional NLP**                                                                              | **Modern NLP**                                                                                                       |
| -------------------------- | ------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------- |
| **Feature Engineering**    | Manual feature engineering (e.g., TF-IDF, POS tags, dependency parsing)                          | Automatic feature extraction via deep learning models                                                                |
| **Word Representations**   | Static embeddings (e.g., one-hot, Word2Vec, GloVe)                                               | Contextual embeddings (e.g., BERT, RoBERTa)                                                                          |
| **Model Architectures**    | Shallow models (e.g., Naïve Bayes, SVM, Logistic Regression)                                     | Deep models (e.g., Transformers, LSTMs, GRUs)                                                                        |
| **Training Methodology**   | Task-specific training from scratch                                                              | Pre-training on large corpora + fine-tuning on downstream tasks                                                      |
| **Key Examples of Models** | Naïve Bayes, SVM, HMM, CRF                                                                       | BERT, GPT, RoBERTa, T5, LLaMA                                                                                        |
| **Advantages**             | - Simple, interpretable<br>- Low resource requirements<br>- Fast to train                        | - State-of-the-art performance<br>- Handles complex language tasks<br>- Adaptable across tasks via transfer learning |
| **Disadvantages**          | - Labor-intensive feature design<br>- Limited in capturing context<br>- Struggles with ambiguity | - High computational cost<br>- Requires large datasets<br>- Less interpretable                                       |


Impact of Evolution on Scalability and Efficiency


1. Scalability:

Traditional NLP: Scaling models to new languages or domains required rebuilding manual features and retraining from scratch, limiting scalability.

Modern NLP: Pre-trained models (like BERT or GPT) serve as general-purpose language engines, enabling rapid adaptation to new tasks via fine-tuning, vastly improving scalability across tasks, languages, and industries.

2. Efficiency:

Traditional NLP: More efficient in low-data environments due to simpler models, but less efficient overall in handling complex tasks (e.g., semantic understanding).

Modern NLP: Although computationally expensive during pre-training, modern models offer high efficiency in reuse across multiple tasks. Once pre-trained, these models avoid repeated ground-up training, streamlining development workflows.

3. Practical Implications:

APIs and cloud services now offer access to pre-trained models (e.g., OpenAI, Hugging Face), democratizing NLP.

Modern NLP supports real-time applications (like chatbots and translation) that were impractical at scale with traditional methods.



Exercise 2: LLM Architecture and Application Scenarios

| **Model** | **Core Architecture**                                                                                                                                                                                                                         | **Best Real-World Application**                                                | **Why This Model Excels for It**                                                                                                                             |
| --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **BERT**  | - **Bidirectional Transformer Encoder**<br>- **Masked Language Modeling (MLM)**: predicts missing words in a sentence.<br>- Learns context from both left and right simultaneously.                                                           | **Text Classification (e.g., sentiment analysis, spam detection)**             | BERT’s bidirectional attention captures full sentence context, improving understanding of text nuances necessary for classification tasks.                   |
| **GPT**   | - **Unidirectional (left-to-right) Transformer Decoder**<br>- **Causal Language Modeling (CLM)**: predicts next word based only on previous tokens.<br>- Focused on sequential generation.                                                    | **Text Generation (e.g., chatbots, content creation, code generation)**        | GPT’s architecture is optimized for **generative tasks**, as it predicts text token by token, naturally producing coherent continuations of prompts.         |
| **T5**    | - **Encoder-Decoder Transformer**<br>- **Text-to-Text Framework**: all tasks (classification, summarization, translation) are treated as converting input text into output text.<br>- Pre-trained using **Span Corruption** (similar to MLM). | **Multi-task Learning (e.g., summarization, translation, question answering)** | T5’s text-to-text approach makes it flexible for many NLP tasks, enabling a **single unified model** for diverse applications without architectural changes. |


Exercise 3: The Benefits and Ethical Considerations of Pre-training



### Five Main Benefits of Pre-Trained Models

1. **Better at Handling New Things (Improved Generalization)**
   Because pre-trained models learn from huge amounts of text, they become good at understanding many different situations, even ones they haven’t seen before.

2. **Less Need for Labeled Data**
   Normally, you need a lot of examples with labels (like “this is positive” or “this is spam”) to train a model. But pre-trained models already know a lot, so you only need **a small amount of labeled data** to teach them your task.

3. **Faster Training (Fine-Tuning)**
   Instead of starting from zero, you just adjust (fine-tune) a pre-trained model for your job. This is **quicker and cheaper** than training a model from scratch.

4. **Transfer Learning**
   Pre-trained models can **transfer what they know** to different topics. For example, a model trained on general English can quickly learn to understand medical or legal language.

5. **More Reliable (Robustness)**
   Since these models have seen so much text from so many places, they’re **better at handling weird sentences, spelling mistakes, or strange phrasing** without making mistakes.

---

###  Ethical Issues

1. **Bias**
   If the training data has unfair ideas (like stereotypes), the model can **repeat those unfair ideas** when it answers people.

2. **False Information**
   Because models learn from the internet, they might **repeat wrong or fake information**, without knowing it’s false.

3. **Bad Uses (Misuse)**
   People can use these models to create **fake news, scams, or harmful content** like fake messages pretending to be from real people.

---

###  How to Reduce These Problems

1. **Fixing Bias**

* Choose more balanced and fair training data.
* Use special techniques to help the model avoid repeating stereotypes.
* Test the model regularly to check if it’s being unfair.

2. **Stopping False Information**

* Remove bad-quality data before training.
* Teach the model to recognize when it’s unsure, so it doesn’t give wrong answers confidently.
* Add fact-checking steps during training or when the model gives answers.

3. **Using Models Safely**

* Give access to the model through **controlled systems (like APIs)**, so people can’t misuse it easily.
* Add filters to block dangerous or harmful responses.
* Be clear about how the model was trained and used, and make sure people follow rules when using it.


Exercise 4 : Transformer Architecture Deep Dive

1- Explain Self-Attention and Multi-Head Attention:

## **Self-Attention**?

In a sentence, **each word looks at the other words to understand their relationships**. This process is called **self-attention**. It helps the model understand which words are important to each other.

###  How does Self-Attention work?

1. **Every word gets a score showing how much attention it should pay to other words**.
   Example: For the word “it”, the model checks every other word (like “dog”, “chased”, “cat”) and scores them to decide which one is most related.

2. These scores are used to create a **weighted version of the sentence**, where important words have higher influence.

3. Each word then uses this new information to build a better understanding of its meaning in the sentence.

In short:
 **Each word learns by looking at all other words and deciding who matters most.**

---

## **Multi-Head Attention**

Instead of doing self-attention just **once**, **multi-head attention** lets the model do it **several times, in different ways, at the same time**.

Each "head" can focus on different things:

* One head might focus on **grammar relationships** (like subjects and verbs).
* Another head might focus on **semantic meaning** (who did what to whom).
* Another head could focus on **long-distance relationships** in the sentence.

**Why is this better?**
Because different heads help the model learn from **different perspectives**, which makes its understanding much deeper and more accurate.

---

##  Why Multi-Head Attention is Better than Single-Head

* **Single-head attention** can only focus on **one thing** at a time.
* **Multi-head attention** lets the model **look at many aspects at once**, making it smarter and more flexible.

---

##  Summary

* **Self-Attention** = Each word learns who matters in the sentence.
* **Multi-Head Attention** = The model looks at the sentence from **many angles at once**, like having many mini-experts working together.



2- Pre-training Objectives:

| **Aspect**         | **Masked Language Modeling (MLM)**                                                                                        | **Causal Language Modeling (CLM)**                                                   |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| **How it works**   | Some words in the sentence are **hidden (masked)**, and the model learns to guess them using both left and right context. | The model predicts the **next word**, using only the previous words (left-to-right). |
| **Direction**      | **Bidirectional** (looks at the whole sentence at once).                                                                  | **Unidirectional** (looks left-to-right).                                            |
| **Example Models** | BERT, RoBERTa                                                                                                             | GPT, GPT-2, GPT-3, GPT-4                                                             |
| **Main Goal**      | Understand sentence meaning deeply.                                                                                       | Generate text in a logical, step-by-step flow.                                       |


- MLM Scenario

Task: Sentiment Analysis (understanding if a review is positive or negative).

Why: MLM helps models understand full sentence meaning, which is important for analyzing tone or opinion.

- CLM Scenario

Task: Text Generation (like writing stories or chatbot replies).

Why: CLM is perfect for generating the next word step-by-step, creating natural flowing text.



Why Early BERT Used NSP (Next Sentence Prediction)

What NSP Did:

BERT was trained to guess if Sentence B follows Sentence A. This was meant to help BERT understand relationships between sentences.

Why NSP Was Dropped in Modern Models:

Later research showed NSP didn’t help much and sometimes hurt performance. Models like RoBERTa removed NSP and just focused on MLM, resulting in better understanding without NSP.

 3- Transformer Model Selection

 | **Task**                                                   | **Best Model Type**                       | **Why**                                                                                                                                    |
| ---------------------------------------------------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| **Analyze customer reviews (positive, negative, neutral)** | **Encoder-only** (like BERT)              | Encoder models are great at **understanding text** and producing a summary of meaning, which is perfect for classification tasks.          |
| **Chatbot that generates creative responses**              | **Decoder-only** (like GPT)               | Decoder models are designed to **generate text**, predicting one word at a time in a sequence, ideal for conversations.                    |
| **Translate documents (English to Spanish)**               | **Encoder-Decoder** (like T5 or MarianMT) | Encoder reads and understands the source sentence, Decoder generates the translated version. This combo is best for **translation tasks**. |


 4- Positional Encoding

- Why It’s Needed:
Transformers process all words at the same time (parallel processing), unlike RNNs which process step-by-step. This means transformers don’t naturally know the order of words in a sentence.

- Positional Encoding adds information about the position of each word (first, second, third...) so the model understands the sequence.

Example Problem Without Positional Encoding:

Sentence:

“The cat sat on the mat.”

Without positional encoding, the model might think:

“Mat sat the cat on the.”
...is the same sentence!

This is because, without knowing word positions, all words are just a bag of words to the model.

So positional encoding tells the model:

“The” comes first

“cat” comes second

“sat” comes third...
Which helps the model understand sentence structure correctly.

Exercise 5: BERT Variations - Choose Your Detective

| **Scenario**                                                 | **Best BERT Variant** | **Why?**                                                                                                                                                        |
| ------------------------------------------------------------ | --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **1. Real-time sentiment analysis on mobile app**            | **DistilBERT**        | It's a **smaller, lighter version of BERT**, trained to be fast and efficient with **lower memory and power needs**—perfect for mobile or real-time use.        |
| **2. Legal document research needing high accuracy**         | **RoBERTa**           | RoBERTa is trained on **larger datasets** and uses optimized training, leading to **higher accuracy** for complex, detail-heavy tasks like legal text analysis. |
| **3. Global customer support in many languages**             | **XLM-RoBERTa**       | XLM-RoBERTa is trained on **100+ languages**, making it ideal for **multilingual applications** like global customer support.                                   |
| **4. Efficient pretraining and token replacement detection** | **ELECTRA**           | ELECTRA uses a unique training method based on **replaced token detection**, making pretraining **faster and more efficient** than masked language modeling.    |
| **5. Efficient NLP in resource-limited environments**        | **ALBERT**            | ALBERT reduces model size with **parameter sharing** and **factorized embeddings**, keeping accuracy while being more **memory-efficient**.                     |


| **Model**       | **Training Data & Method**                         | **Model Size & Efficiency**            | **Optimizations / Innovations**                | **Best Use Cases**                                      |
| --------------- | -------------------------------------------------- | -------------------------------------- | ---------------------------------------------- | ------------------------------------------------------- |
| **RoBERTa**     | More data than BERT, no NSP, dynamic masking       | Large; similar to or bigger than BERT  | Better masking, removes NSP, longer training   | High-accuracy tasks (legal, medical, research)          |
| **ALBERT**      | Same as BERT, but with parameter sharing           | Much smaller than BERT, very efficient | Parameter sharing, factorized embeddings       | Resource-constrained but accuracy-focused tasks         |
| **DistilBERT**  | Trained by distillation from BERT                  | 40% smaller, 60% faster                | Knowledge distillation (learns from full BERT) | Mobile apps, real-time tasks, fast predictions          |
| **ELECTRA**     | Replaced token detection (more efficient than MLM) | Smaller/faster pretraining than BERT   | Generator-discriminator setup for pretraining  | Fast pretraining, efficient tasks, classification tasks |
| **XLM-RoBERTa** | Huge multilingual datasets, no NSP                 | Large, similar to RoBERTa              | Supports 100+ languages, dynamic masking       | Multilingual tasks: translation, global chatbots        |


Exercise 6: Softmax Temperature - The Randomness Regulator

1. Temperature Scenarios

- Temperature 0.2

The model chooses the most likely words most of the time.

Output is predictable, repetitive, and conservative.

Example: "The cat sat on the mat." (Simple, factual).

- Temperature 1.0

Balanced randomness and predictability.

Output is coherent but not too rigid.

The model can choose less common words if they make sense.

Example: "The curious cat settled comfortably on the warm mat."

- Temperature 1.5

The model chooses words more randomly.

Output becomes more creative but riskier, possibly less coherent.

Example: "The whimsical feline sprawled atop the shimmering fabric near the fireplace."

2. Application Design
Personalized Bedtime Stories System
Goal: Balance creativity with story coherence.

Strategy:

Use a higher temperature (e.g., 1.2 to 1.5) during storytelling parts to make the story more imaginative and varied.

Lower the temperature (e.g., 0.7 to 1.0) for critical sections (e.g., moral of the story or plot conclusions) to keep the story logical.

Result: Stories are fun and surprising, but don’t become confusing or chaotic.

Financial Reports Summarization System
Goal: Ensure accuracy, consistency, and reliability.

Strategy:

Set temperature low (e.g., 0.2 to 0.5).

This forces the model to stick to the most likely, factual language and avoid creative word choices that could mislead.

Result: Summaries are clear, precise, and professional, minimizing the risk of errors.



Temperature and Bias
How Temperature Affects Bias:

Low Temperature:

The model tends to repeat dominant patterns from training data.

This can reinforce biases present in the data (since it favors "safe" or "common" outputs).

High Temperature:

The model explores less common word choices, which might reduce visible bias but also introduce randomness or even inappropriate content.