### Architecture Flow

## **What Does This Architecture Do?**

This architecture diagram illustrates the complete pipeline of a multi-label emotion classification system using BERT (Bidirectional Encoder Representations from Transformers). It processes raw text input and outputs predictions for multiple emotions simultaneously.

**Key Components:**
- **Input Processing**: Raw text → Tokenized sequences
- **Feature Extraction**: BERT model extracts contextual embeddings
- **Classification**: Custom head maps embeddings to emotion probabilities
- **Output**: Binary predictions for 5 emotions (anger, fear, joy, love, sadness)

---

## **Why Is This Architecture Necessary?**

### **1. Multi-Label Classification Challenge**
Unlike single-label classification (where text belongs to ONE category), emotions can co-exist. A sentence like *"I'm excited but nervous about the interview"* contains both **joy** and **fear**. This architecture handles such complexity.

### **2. Contextual Understanding**
Traditional models treat words independently, missing context. BERT's transformer layers capture:
- **Bidirectional context**: Words understand both left and right neighbors
- **Semantic relationships**: Distinguishes "I love this!" (positive) from "I'd love to leave" (sarcasm/negative)

### **3. Transfer Learning Efficiency**
BERT is pre-trained on massive text corpora, giving it:
- **Language understanding** out-of-the-box
- **Reduced training time** (we only fine-tune, not train from scratch)
- **Better performance** with less data

### **4. Regularization & Overfitting Prevention**
- **Dropout layers** prevent the model from memorizing training data
- **Sigmoid activation** allows independent emotion predictions (not mutually exclusive)

---

## **How Should This Architecture Be Implemented?**

### **Step-by-Step Implementation Guide**

#### **1. Input Processing (Tokenization)**
```python
# Convert text to BERT-compatible format
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
# Returns: input_ids (token IDs) + attention_mask (padding indicators)
```

**Why?** BERT requires fixed-length numerical inputs with special tokens ([CLS], [SEP]).

---

#### **2. Embedding Layer**
```python
# BERT's first layer converts token IDs to 768-dimensional vectors
embeddings = bert_model.embeddings(input_ids)
```

**Why?** Neural networks process numbers, not words. Embeddings capture semantic meaning.

---

#### **3. Transformer Layers (12 in BERT-base)**
```python
# Each layer applies self-attention + feed-forward networks
for layer in bert_model.encoder.layer:
    hidden_states = layer(hidden_states, attention_mask)
```

**Why?** Multi-head attention learns which words relate to each other (e.g., "not happy" → focus on "not").

---

#### **4. Extract [CLS] Token**
```python
# [CLS] token (first position) aggregates sentence-level information
cls_output = hidden_states[:, 0, :]  # Shape: (batch_size, 768)
```

**Why?** This special token is designed to represent the entire sentence for classification tasks.

---

#### **5. Dropout for Regularization**
```python
# Randomly drop 30% of neurons during training
dropout_output = nn.Dropout(0.3)(cls_output)
```

**Why?** Prevents overfitting by forcing the model to not rely on specific neurons.

---

#### **6. Classification Head**
```python
# Linear layer maps 768 features to 5 emotion scores
logits = nn.Linear(768, 5)(dropout_output)
```

**Why?** Reduces dimensionality to match the number of target emotions.

---

#### **7. Sigmoid Activation**
```python
# Convert logits to probabilities (0-1 range)
probabilities = torch.sigmoid(logits)
```

**Why?** Unlike softmax (used for single-label), sigmoid treats each emotion independently.

---

#### **8. Thresholding for Predictions**
```python
# Predict emotion if probability > 0.5
predictions = (probabilities > 0.5).int()
```

**Why?** Converts probabilities to binary decisions. Threshold (0.5) can be tuned for precision/recall trade-off.

---

### **Complete Architecture Diagram**

```
Raw Text Input
    ↓
Tokenization (BERT Tokenizer)
    ↓
Token IDs + Attention Mask (Tensors)
    ↓
Embedding Layer (BERT)
    ↓
12 Transformer Layers (Self-Attention + FFN)
    ↓
[CLS] Token Representation (768-dim)
    ↓
Dropout Layer (Regularization)
    ↓
Linear Classification Head (768 → 5)
    ↓
Logits (Raw Scores)
    ↓
Sigmoid Activation
    ↓
Probabilities (0-1 per emotion)
    ↓
Thresholding (> 0.5)
    ↓
Binary Predictions
```

---

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from typing import Dict, List, Optional
from dataclasses import dataclass
import warnings

warnings.filterwarnings('ignore')

### Line-by-Line Breakdown

**Line 1: `import torch`**

**What**: Imports the main PyTorch library.

**Why**: PyTorch is the foundation for all tensor operations, automatic differentiation, and neural network training. We need it for creating and manipulating tensors, which are the fundamental data structures in deep learning.

**How**: This makes the entire `torch` namespace available, allowing access to functions like `torch.tensor()`, `torch.device()`, and `torch.no_grad()`.

---

**Line 2: `import torch.nn as nn`**

**What**: Imports PyTorch's neural network module with the alias `nn`.

**Why**: The `torch.nn` module contains all building blocks for neural networks including layers, loss functions, and activation functions. The alias `nn` is a standard convention that makes code more concise and readable.

**How**: By importing as `nn`, we can write `nn.Linear()` instead of `torch.nn.Linear()`, which is cleaner and follows community standards.

---

**Line 3: `from torch.utils.data import Dataset, DataLoader`**

**What**: Imports two specific classes from PyTorch's data utilities module.

**Why**:
- `Dataset`: Abstract base class for creating custom datasets. We will inherit from this to build our emotion dataset.
- `DataLoader`: Utility for batching, shuffling, and loading data efficiently during training.

**How**: Using `from ... import` allows us to directly use `Dataset` and `DataLoader` without the `torch.utils.data.` prefix, making the code cleaner.

---

**Line 4: `from transformers import AutoTokenizer, AutoModel`**

**What**: Imports automatic model and tokenizer classes from Hugging Face transformers library.

**Why**:
- `AutoTokenizer`: Automatically selects the correct tokenizer for a given model name, handling the complexity of different tokenization schemes.
- `AutoModel`: Automatically loads the appropriate transformer architecture based on the model name.

**How**: These "Auto" classes use model configuration files to determine which specific tokenizer or model class to instantiate, providing a uniform interface across different transformer models.

---

**Line 5: `import pandas as pd`**

**What**: Imports the pandas library with the standard alias `pd`.

**Why**: Pandas provides DataFrame structures that are ideal for handling tabular data. We use it to organize our text and labels in a structured format that is easy to manipulate and pass to our dataset class.

**How**: The alias `pd` is the universally accepted convention, allowing us to write `pd.DataFrame()` instead of `pandas.DataFrame()`.

---

**Line 6: `import numpy as np`**

**What**: Imports the NumPy library with the standard alias `np`.

**Why**: NumPy provides efficient array operations and is the foundation for numerical computing in Python. We use it for stacking predictions and performing array manipulations during evaluation.

**How**: The alias `np` is the standard convention, making code more concise when using functions like `np.vstack()` or `np.array()`.

---

**Line 7: `from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score`**

**What**: Imports four specific evaluation metric functions from scikit-learn.

**Why**:
- These metrics are essential for evaluating multi-label classification performance
- `accuracy_score`: Measures the proportion of correct predictions
- `f1_score`: Harmonic mean of precision and recall, balancing both metrics
- `precision_score`: Measures how many predicted positives are actually positive
- `recall_score`: Measures how many actual positives were correctly predicted

**How**: These functions accept true labels and predictions as arrays and compute the respective metrics, handling both binary and multi-label cases.

---

**Line 8: `from typing import Dict, List, Optional`**

**What**: Imports type hint classes from Python's typing module.

**Why**: Type hints improve code readability and enable static type checking. They document what types of data functions expect and return, making the code self-documenting and catching potential bugs early.

**How**:
- `Dict`: Type hint for dictionaries, used as `Dict[str, int]` to specify key and value types
- `List`: Type hint for lists, used as `List[str]` to specify element types
- `Optional`: Indicates a value can be of a specified type or None

---

**Line 9: `from dataclasses import dataclass`**

**What**: Imports the dataclass decorator from Python's dataclasses module.

**Why**: Dataclasses automatically generate special methods like `__init__()` for classes that primarily store data. This reduces boilerplate code and makes our DataCollator class cleaner.

**How**: When we decorate a class with `@dataclass`, Python automatically creates an initializer and other methods based on the class attributes we define.

---

**Line 10: `import warnings`**

**What**: Imports Python's warnings module for managing warning messages.

**Why**: We import this to control warning outputs that might clutter our notebook, especially deprecation warnings from libraries.

**How**: The warnings module provides functions to filter, suppress, or customize how warnings are displayed.

---

**Line 11: (blank line)**

**What**: Empty line for code readability.

**Why**: Following PEP 8 style guidelines, blank lines separate import statements from subsequent code, improving visual organization.

---

**Line 12: `warnings.filterwarnings('ignore')`**

**What**: Configures the warnings module to suppress all warning messages.

**Why**: During training, various libraries may emit warnings about deprecations or optimizations. While these are sometimes useful, they can clutter notebook output and distract from learning objectives.

**How**: The `filterwarnings()` function accepts an action parameter. The `'ignore'` action tells Python to suppress all warnings. In production code, you would typically want to be more selective about which warnings to ignore.

## Section 3: Data Synthesis

### Code Block Analysis

In [None]:
synthetic_data = {
    'id': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
    'text': [
        "I was extremely disappointed with the customer service at the restaurant.",
        "Walking alone at night in an unfamiliar neighborhood made me nervous.",
        "Receiving the promotion I worked hard for filled me with happiness.",
        "The news about my friend's illness left me feeling devastated and concerned.",
        "I never expected to see my childhood friend at the conference today.",
        "The project deadline is approaching and I am worried about finishing on time.",
        "Celebrating my graduation with family brought immense satisfaction.",
        "The unexpected cancellation of the event frustrated me greatly.",
        "Reading the final chapter of my favorite book series was bittersweet.",
        "The presentation went smoothly and I felt confident throughout."
    ],
    'anger': [1, 0, 0, 0, 0, 0, 0, 1, 0, 0],
    'fear': [0, 1, 0, 1, 0, 1, 0, 0, 0, 0],
    'joy': [0, 0, 1, 0, 0, 0, 1, 0, 0, 1],
    'sadness': [1, 0, 0, 1, 0, 0, 0, 0, 1, 0],
    'surprise': [0, 0, 0, 0, 1, 0, 0, 1, 0, 0]
}

df = pd.DataFrame(synthetic_data)
print("Dataset Overview:")
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"\nEmotion distribution:")
print(df[['anger', 'fear', 'joy', 'sadness', 'surprise']].sum())

## Section 4: Train-Test Split

### Code Block Analysis

In [None]:
train_size = int(0.8 * len(df))
train_df = df[:train_size]
val_df = df[train_size:]

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")

### Line-by-Line Breakdown

**Line 1: `train_size = int(0.8 * len(df))`**

**What**: Calculates the number of samples to use for training (80% of total).

**Why**: We need to split data into training and validation sets to evaluate model performance on unseen data. Using 80% for training and 20% for validation is a common convention that balances having enough training data while maintaining a meaningful validation set.

**How**:
- `len(df)` returns the number of rows in the DataFrame (10)
- `0.8 * len(df)` calculates 80% of rows (10 * 0.8 = 8.0)
- `int()` converts the float to an integer (8.0 → 8)
- The result is stored in `train_size`

**Why int() is necessary**: Array slicing requires integer indices. Without `int()`, we would have a float, which cannot be used for slicing.

---

**Line 2: `train_df = df[:train_size]`**

**What**: Creates the training DataFrame by selecting the first `train_size` rows.

**Why**: This separates our training data from validation data. The training set is what the model learns from during the training process.

**How**:
- `df[:train_size]` uses slice notation to select rows from index 0 up to (but not including) index `train_size`
- With `train_size = 8`, this selects rows 0-7 (8 total rows)
- The resulting DataFrame is assigned to `train_df`

**Important Note**: This is a simple split without shuffling. In production, you would typically shuffle before splitting to avoid any ordering bias.

---

**Line 3: `val_df = df[train_size:]`**

**What**: Creates the validation DataFrame by selecting all rows from `train_size` onwards.

**Why**: The validation set is used to evaluate model performance on data it has not seen during training, helping detect overfitting and assess generalization.

**How**:
- `df[train_size:]` selects from index `train_size` to the end of the DataFrame
- With `train_size = 8`, this selects rows 8-9 (2 total rows)
- The resulting DataFrame is assigned to `val_df`

**Split Visualization**:
```
Original df (10 rows): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
                        |__________________|  |______|
                           train_df (8)      val_df (2)
```

---

**Line 4: (blank line)**

**What**: Empty line for code organization.

**Why**: Separates the data splitting logic from the output display logic, improving readability.

---

**Line 5: `print(f"Training samples: {len(train_df)}")`**

**What**: Prints the number of samples in the training set.

**Why**: Verifying the split worked correctly is important. This also documents the dataset sizes for anyone reading the notebook.

**How**:
- `len(train_df)` counts the rows in the training DataFrame (8)
- The f-string formats and displays this count
- Expected output: "Training samples: 8"

---

**Line 6: `print(f"Validation samples: {len(val_df)}")`**

**What**: Prints the number of samples in the validation set.

**Why**: Completes the verification of the data split, confirming we have the expected 2 validation samples (20% of 10).

**How**:
- `len(val_df)` counts the rows in the validation DataFrame (2)
- Expected output: "Validation samples: 2"

**Educational Discussion Points**:
1. **Why 80-20 split?**: Common ratios are 80-20, 70-30, or 60-20-20 (train-val-test). Smaller datasets might use 70-30 to have more validation data.
2. **Why not use test set here?**: With only 10 samples, adding a test set would leave too little data for training and validation. In real projects, you would have train-val-test splits.
3. **Stratification**: For imbalanced datasets, you would use stratified splitting to maintain class distribution across splits.

## Section 5: Tokenization

### Code Block Analysis

In [None]:
MODEL_NAME = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

sample_text = df['text'].iloc[0]
sample_encoding = tokenizer(
    sample_text,
    padding='max_length',
    truncation=True,
    max_length=128,
    return_tensors='pt'
)

print("Sample tokenization:")
print(f"Original text: {sample_text}")
print(f"\nToken IDs shape: {sample_encoding['input_ids'].shape}")
print(f"Token IDs: {sample_encoding['input_ids'][0][:20]}...")
print(f"\nDecoded tokens: {tokenizer.convert_ids_to_tokens(sample_encoding['input_ids'][0][:20])}")

### Line-by-Line Breakdown

**Line 1: `MODEL_NAME = 'bert-base-uncased'`**

**What**: Defines a constant string specifying which pretrained model to use.

**Why**:
- BERT (Bidirectional Encoder Representations from Transformers) is a powerful language model trained on massive text corpora
- 'base' indicates the standard size (12 layers, 768 hidden units)
- 'uncased' means the model does not distinguish between uppercase and lowercase letters
- Using a constant makes it easy to switch models by changing one line

**How**: By convention, constants are written in UPPERCASE. This string will be used to download both the tokenizer and model from Hugging Face's model hub.

**Model Choice Considerations**:
- **bert-base-uncased**: Good for general English text, faster than large models
- **bert-base-cased**: Use when capitalization matters (e.g., names, acronyms)
- **bert-large-uncased**: Better performance but requires more memory and compute

---

**Line 2: `tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)`**

**What**: Downloads and initializes the tokenizer associated with BERT-base-uncased.

**Why**: Neural networks cannot process raw text directly. Tokenizers convert text into numerical tokens that models can understand. Using the pretrained tokenizer ensures compatibility with the pretrained model's vocabulary.

**How**:
- `AutoTokenizer.from_pretrained()` automatically determines the correct tokenizer class for the specified model
- It downloads the tokenizer configuration and vocabulary file from Hugging Face Hub
- These files are cached locally to avoid re-downloading
- The tokenizer instance is stored in the `tokenizer` variable

**What gets downloaded**:
1. `tokenizer_config.json`: Configuration parameters
2. `vocab.txt`: Vocabulary mapping words to IDs (30,522 tokens for BERT-base)
3. Special token definitions ([CLS], [SEP], [PAD], etc.)

---

**Line 4: `sample_text = df['text'].iloc[0]`**

**What**: Extracts the first text sample from the DataFrame for demonstration.

**Why**: We want to show how tokenization works on a concrete example before applying it to the entire dataset.

**How**:
- `df['text']` selects the 'text' column, returning a Series
- `.iloc[0]` uses integer-location indexing to get the first element
- The result is the string: "I was extremely disappointed with the customer service at the restaurant."

---

**Line 5: `sample_encoding = tokenizer(`**

**What**: Begins a function call to tokenize the sample text.

**Why**: The tokenizer is callable like a function, providing a convenient interface for text processing.

**How**: The parentheses indicate the start of a multi-line function call with multiple arguments.

---

**Line 6: `    sample_text,`**

**What**: First positional argument - the text to tokenize.

**Why**: This is the required input that will be converted to tokens.

**How**: The tokenizer will split this text into subword tokens using BERT's WordPiece algorithm.

---

**Line 7: `    padding='max_length',`**

**What**: Specifies padding strategy to pad sequences to maximum length.

**Why**: Neural networks require fixed-size inputs for batch processing. Padding ensures all sequences have the same length by adding special [PAD] tokens.

**How**:
- If the sequence is shorter than `max_length`, [PAD] tokens are appended
- If longer, it will be truncated (controlled by the truncation parameter)

**Alternative values**:
- `'longest'`: Pad to longest sequence in the batch
- `'max_length'`: Pad to specified max_length
- `False`: No padding (sequences remain different lengths)

---

**Line 8: `    truncation=True,`**

**What**: Enables truncation of sequences exceeding maximum length.

**Why**: BERT has a maximum position embedding of 512 tokens. Processing longer sequences would cause errors. Truncation prevents this by cutting off excess tokens.

**How**: If a tokenized sequence exceeds `max_length`, it is trimmed to fit, keeping the first `max_length` tokens.

---

**Line 9: `    max_length=128,`**

**What**: Sets the maximum sequence length to 128 tokens.

**Why**:
- Our texts are relatively short, so 128 is sufficient
- Shorter sequences train faster and use less memory
- BERT's maximum is 512, but using smaller lengths when possible improves efficiency

**How**: This works in conjunction with `padding` and `truncation` to ensure all outputs are exactly 128 tokens long.

---

**Line 10: `    return_tensors='pt'`**

**What**: Specifies that outputs should be PyTorch tensors.

**Why**: PyTorch models require tensor inputs, not lists or NumPy arrays. This parameter ensures the output is in the correct format.

**How**:
- `'pt'`: Returns PyTorch tensors
- `'tf'`: Would return TensorFlow tensors
- `'np'`: Would return NumPy arrays
- `None` (default): Returns Python lists

---

**Line 11: `)`**

**What**: Closes the tokenizer function call.

**Why**: Completes the multi-line function invocation started on line 5.

**How**: The tokenizer processes the input and returns a dictionary containing:
- `input_ids`: Token IDs
- `attention_mask`: Mask indicating real tokens (1) vs padding (0)
- `token_type_ids`: Segment IDs (for BERT's dual-sequence input)

---

**Line 12: (blank line)**

**What**: Empty line for readability.

---

**Line 13: `print("Sample tokenization:")`**

**What**: Prints a header for the tokenization demonstration.

**Why**: Provides context for the output that follows.

---

**Line 14: `print(f"Original text: {sample_text}")`**

**What**: Displays the original text before tokenization.

**Why**: Allows comparison between input text and tokenized output, helping students understand the transformation.

**How**: F-string substitutes the `sample_text` variable into the output string.

---

**Line 15: `print(f"\nToken IDs shape: {sample_encoding['input_ids'].shape}")`**

**What**: Prints the shape of the token ID tensor.

**Why**: Understanding tensor shapes is crucial in deep learning. This shows the batch dimension and sequence length.

**How**:
- `sample_encoding['input_ids']` accesses the token IDs tensor
- `.shape` returns the dimensions as a `torch.Size` object
- Expected output: `torch.Size([1, 128])` meaning 1 sample with 128 tokens

---

**Line 16: `print(f"Token IDs: {sample_encoding['input_ids'][0][:20]}...")`**

**What**: Displays the first 20 token IDs.

**Why**: Showing all 128 tokens would clutter the output. The first 20 tokens demonstrate how text converts to numbers.

**How**:
- `sample_encoding['input_ids'][0]` gets the first (and only) sequence
- `[:20]` slices to get the first 20 token IDs
- `...` indicates truncation for display purposes

**Example output**: `tensor([  101,  1045,  2001,  3533, ...])`
- 101 is [CLS] (classification token)
- Subsequent numbers represent words/subwords

---

**Line 17: `print(f"\nDecoded tokens: {tokenizer.convert_ids_to_tokens(sample_encoding['input_ids'][0][:20])}")`**

**What**: Converts the first 20 token IDs back to their string representations.

**Why**: This bridges the gap between numbers (what the model sees) and text (what humans understand), making tokenization concrete and understandable.

**How**:
- `tokenizer.convert_ids_to_tokens()` reverses the tokenization process
- Takes a list/tensor of IDs and returns corresponding token strings
- Expected output includes special tokens like [CLS] and subword tokens like '##ed'

**Example output**: `['[CLS]', 'i', 'was', 'extremely', 'disappointed', 'with', 'the', 'customer', 'service', ...]`

**Key Concepts to Highlight**:
1. **WordPiece tokenization**: Splits unknown words into known subwords (e.g., "disappointment" → "disappoint", "##ment")
2. **Special tokens**: [CLS] at start, [SEP] at end, [PAD] for padding
3. **Attention mask**: Tells model which tokens are real (1) vs padding (0)

## Section 6: Custom Dataset Class

### Code Block Analysis

In [None]:
class EmotionDataset(Dataset):

    def __init__(self, dataframe: pd.DataFrame, tokenizer, max_length: int = 128):
        self.texts = dataframe['text'].values
        self.labels = dataframe[['anger', 'fear', 'joy', 'sadness', 'surprise']].values
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self) -> int:
        return len(self.texts)

    def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
        text = str(self.texts[idx])
        labels = self.labels[idx]

        encoding = self.tokenizer(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(labels, dtype=torch.float)
        }

train_dataset = EmotionDataset(train_df, tokenizer)
val_dataset = EmotionDataset(val_df, tokenizer)

print(f"Training dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(val_dataset)}")
print(f"\nSample from dataset:")
sample = train_dataset[0]
for key, value in sample.items():
    print(f"{key}: {value.shape}")

### Line-by-Line Breakdown

**Line 1: `class EmotionDataset(Dataset):`**

**What**: Defines a new class that inherits from PyTorch's Dataset base class.

**Why**: PyTorch's DataLoader requires datasets to implement specific methods (`__len__` and `__getitem__`). By inheriting from Dataset, we get a standardized interface and access to PyTorch's data loading utilities.

**How**:
- `class EmotionDataset` names our custom class
- `(Dataset)` specifies inheritance from the imported Dataset class
- The colon `:` begins the class definition

**Inheritance Benefits**:
1. Standardized interface expected by DataLoader
2. Type checking and IDE support
3. Integration with PyTorch's ecosystem

---

**Line 2: (blank line)**

**What**: Empty line after class declaration.

**Why**: PEP 8 style guideline for class readability.

---

**Line 3: `    def __init__(self, dataframe: pd.DataFrame, tokenizer, max_length: int = 128):`**

**What**: Defines the constructor method with type hints.

**Why**: The constructor initializes the dataset with necessary components. It runs once when creating a dataset instance, setting up everything needed for data access.

**How**:
- `def __init__` is Python's special method for object initialization
- `self` is the instance reference (automatically passed)
- `dataframe: pd.DataFrame` expects a pandas DataFrame with type hint
- `tokenizer` is the pretrained tokenizer (no type hint as it could be various types)
- `max_length: int = 128` has a default value of 128

**Parameter Purposes**:
- `dataframe`: Contains text and labels
- `tokenizer`: Converts text to tokens
- `max_length`: Controls sequence padding/truncation

---

**Line 4: `        self.texts = dataframe['text'].values`**

**What**: Extracts text column as a NumPy array and stores as instance variable.

**Why**:
- Converting to `.values` (NumPy array) is faster than accessing DataFrame rows repeatedly
- Storing in `self.texts` makes it accessible in other methods
- NumPy arrays support efficient integer indexing

**How**:
- `dataframe['text']` selects the text column
- `.values` converts the Series to a NumPy array
- `self.texts =` stores it as an instance attribute

**Performance consideration**: Pre-extracting data prevents DataFrame overhead during training iterations.

---

**Line 5: `        self.labels = dataframe[['anger', 'fear', 'joy', 'sadness', 'surprise']].values`**

**What**: Extracts emotion columns as a 2D NumPy array.

**Why**: Labels must be in numerical form for loss calculation. Extracting as an array creates a matrix where each row is a sample and each column is an emotion.

**How**:
- Double brackets `[[...]]` select multiple columns, returning a DataFrame
- `.values` converts to a 2D NumPy array of shape (n_samples, 5)
- Each row contains binary values for the 5 emotions

**Shape**: For 8 training samples, this creates an (8, 5) array.

---

**Line 6: `        self.tokenizer = tokenizer`**

**What**: Stores the tokenizer as an instance variable.

**Why**: We need the tokenizer in `__getitem__` to process text on-the-fly. Storing it avoids passing it repeatedly.

**How**: Simple assignment makes the tokenizer accessible throughout the class.

---

**Line 7: `        self.max_length = max_length`**

**What**: Stores the maximum sequence length.

**Why**: This parameter is needed when tokenizing in `__getitem__`. Storing it as an attribute makes it reusable.

**How**: The value (default 128) is saved for use in tokenization calls.

---

**Line 8: (blank line)**

**Why**: Separates constructor from other methods.

---

**Line 9: `    def __len__(self) -> int:`**

**What**: Defines the special method that returns dataset length.

**Why**: DataLoader calls this to determine how many samples exist. Required method for PyTorch Dataset interface.

**How**:
- `__len__` is a Python magic method that enables `len(dataset)`
- `-> int` is a type hint indicating it returns an integer
- Must return the total number of samples

---

**Line 10: `        return len(self.texts)`**

**What**: Returns the number of text samples.

**Why**: This tells PyTorch how many items can be retrieved via `__getitem__`.

**How**: `len(self.texts)` counts elements in the texts array, which equals the number of samples.

---

**Line 11: (blank line)**

**Why**: Separates methods for readability.

---

**Line 12: `    def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:`**

**What**: Defines the special method for retrieving a single sample by index.

**Why**:
- This is the core of the Dataset class
- DataLoader calls this repeatedly to fetch batches
- Enables indexing like `dataset[0]` to get the first sample

**How**:
- `__getitem__` is Python's indexing magic method
- `idx: int` is the sample index to retrieve
- `-> Dict[str, torch.Tensor]` indicates it returns a dictionary of tensors

---

**Line 13: `        text = str(self.texts[idx])`**

**What**: Retrieves the text at position idx and converts to string.

**Why**:
- We need the text for tokenization
- `str()` ensures the value is a string (defensive programming)
- Some DataFrame operations might return non-string types

**How**:
- `self.texts[idx]` uses NumPy array indexing to get element at position idx
- `str()` wraps it for type safety

---

**Line 14: `        labels = self.labels[idx]`**

**What**: Retrieves the label array for this sample.

**Why**: We need the corresponding emotion labels for supervised learning.

**How**:
- `self.labels[idx]` indexes into the 2D array
- Returns a 1D array of 5 binary values
- Example: `[1, 0, 1, 0, 0]` for anger and joy

---

**Line 15: (blank line)**

**Why**: Separates data retrieval from tokenization logic.

---

**Lines 16-22: Tokenization block**

**What**: Tokenizes the text with specific parameters.

**Why**: Must be done here (not in __init__) to handle dynamic batching and maintain flexibility.

**How**: Same as earlier tokenization, but now applied to individual samples during iteration.

---

**Line 24: `        return {`**

**What**: Begins returning a dictionary of processed data.

**Why**: Returning a dictionary makes it easy to access different components by name in training loops.

**How**: Dictionary keys will be used to access specific tensors.

---

**Line 25: `            'input_ids': encoding['input_ids'].flatten(),`**

**What**: Includes token IDs, flattened to 1D.

**Why**:
- `encoding['input_ids']` has shape (1, 128) from `return_tensors='pt'`
- `.flatten()` converts to shape (128,) which is cleaner for batching
- DataLoader will stack these into (batch_size, 128)

**How**: `.flatten()` removes the extra batch dimension added by the tokenizer.

---

**Line 26: `            'attention_mask': encoding['attention_mask'].flatten(),`**

**What**: Includes attention mask, also flattened.

**Why**: The attention mask tells the model which tokens are real (1) vs padding (0). Essential for transformer models.

**How**: Same flattening operation as input_ids.

---

**Line 27: `            'labels': torch.tensor(labels, dtype=torch.float)`**

**What**: Converts NumPy label array to PyTorch float tensor.

**Why**:
- PyTorch models require tensor inputs, not NumPy arrays
- `dtype=torch.float` is necessary for BCEWithLogitsLoss
- Binary Cross Entropy expects float targets, not integers

**How**: `torch.tensor()` creates a new tensor from the NumPy array with specified data type.

---

**Line 28: `        }`**

**What**: Closes the return dictionary.

**Why**: Completes the data structure being returned.

---

**Line 30: `train_dataset = EmotionDataset(train_df, tokenizer)`**

**What**: Creates a dataset instance for training data.

**Why**: Wraps our training DataFrame in the custom dataset class, making it compatible with PyTorch DataLoader.

**How**:
- Calls `__init__` with `train_df` and `tokenizer`
- Uses default `max_length=128`
- Creates an instance stored in `train_dataset`

---

**Line 31: `val_dataset = EmotionDataset(val_df, tokenizer)`**

**What**: Creates a dataset instance for validation data.

**Why**: We need a separate dataset for validation to evaluate on unseen data.

**How**: Same as training dataset but uses `val_df` instead.

---

**Lines 33-38: Verification output**

**What**: Prints dataset information and a sample to verify correct setup.

**Why**: Sanity checking is crucial - we verify the dataset size and shape of returned tensors before training.

**How**:
- `len(train_dataset)` calls `__len__`, should return 8
- `train_dataset[0]` calls `__getitem__(0)`, returning a dictionary
- Iterating over dictionary items shows the shape of each component

**Expected output**:
```
Training dataset size: 8
Validation dataset size: 2

Sample from dataset:
input_ids: torch.Size([128])
attention_mask: torch.Size([128])
labels: torch.Size([5])
```

**Teaching Points**:
1. **Lazy loading**: Tokenization happens in `__getitem__`, not `__init__`, saving memory
2. **Indexing**: Dataset supports `dataset[i]` through `__getitem__`
3. **Iteration**: Can loop over dataset, though DataLoader is preferred
4. **Consistency**: Each sample returns the same structure (dictionary with same keys)

## Section 7: Data Collator

### Code Block Analysis

In [None]:
@dataclass
class DataCollator:

    def __call__(self, batch: List[Dict[str, torch.Tensor]]) -> Dict[str, torch.Tensor]:
        input_ids = torch.stack([item['input_ids'] for item in batch])
        attention_mask = torch.stack([item['attention_mask'] for item in batch])
        labels = torch.stack([item['labels'] for item in batch])

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

collator = DataCollator()
print("Data collator initialized successfully")

### Line-by-Line Breakdown

**Line 1: `@dataclass`**

**What**: Decorator that transforms the following class into a dataclass.

**Why**: Dataclasses automatically generate `__init__`, `__repr__`, and other methods, reducing boilerplate. Since our collator has no instance attributes to initialize, this makes the class definition cleaner.

**How**: The `@dataclass` decorator modifies the class definition at creation time, adding special methods. Even with no attributes defined, it creates a valid `__init__` method.

---

**Line 2: `class DataCollator:`**

**What**: Defines a callable class for batching dataset samples.

**Why**: DataLoader needs a collation function to combine individual samples into batches. Making it a class (rather than a function) provides better organization and makes it callable like a function through `__call__`.

**How**: The class name follows convention (PascalCase). No inheritance is needed as we implement `__call__` directly.

**Collator Purpose**: Transforms a list of individual samples into a single batch by stacking tensors.

---

**Line 3: (blank line)**

**Why**: PEP 8 style guideline.

---

**Line 4: `    def __call__(self, batch: List[Dict[str, torch.Tensor]]) -> Dict[str, torch.Tensor]:`**

**What**: Defines the special method that makes instances callable like functions.

**Why**:
- DataLoader calls `collator(batch)` where batch is a list of samples
- `__call__` allows `collator_instance(batch)` syntax
- Makes the class behave like a function

**How**:
- `__call__` is a Python magic method
- `batch: List[Dict[str, torch.Tensor]]` indicates batch is a list of dictionaries
  - Each dictionary has string keys and tensor values
  - Example: `[{'input_ids': tensor(...), ...}, {'input_ids': tensor(...), ...}]`
- `-> Dict[str, torch.Tensor]` shows it returns a single dictionary of tensors

**Type Hint Breakdown**:
- `List[...]`: Python list
- `Dict[str, torch.Tensor]`: Dictionary with string keys and tensor values
- Input: List of dictionaries (one per sample)
- Output: Single dictionary (batched data)

---

**Line 5: `        input_ids = torch.stack([item['input_ids'] for item in batch])`**

**What**: Stacks all input_ids tensors from individual samples into a single 2D tensor.

**Why**: Neural networks process batches, not individual samples. Stacking combines multiple 1D tensors (128,) into a 2D tensor (batch_size, 128).

**How**:
- List comprehension: `[item['input_ids'] for item in batch]` creates a list of tensors
  - If batch has 4 samples, creates list of 4 tensors each of shape (128,)
- `torch.stack()` combines them into a single tensor
  - Stacks along a new dimension (dimension 0)
  - Result shape: (4, 128) for batch size 4

**torch.stack vs torch.cat**:
- `stack`: Creates new dimension, requires same-shaped tensors
- `cat`: Concatenates along existing dimension, allows different shapes

---

**Line 6: `        attention_mask = torch.stack([item['attention_mask'] for item in batch])`**

**What**: Stacks attention masks into a batch.

**Why**: Each sample's attention mask must be batched alongside its input_ids to maintain alignment.

**How**: Identical process to input_ids stacking, resulting in shape (batch_size, 128).

**Attention Mask Content**:
- 1: Real token (attend to this)
- 0: Padding token (ignore this)

---

**Line 7: `        labels = torch.stack([item['labels'] for item in batch])`**

**What**: Stacks label tensors into a batch.

**Why**: Batching labels is necessary for batch loss calculation during training.

**How**:
- Each label tensor has shape (5,) representing 5 emotions
- Stacking creates shape (batch_size, 5)
- Example: For batch size 4, result is (4, 5)

---

**Line 8: (blank line)**

**Why**: Separates processing from return statement.

---

**Line 9: `        return {`**

**What**: Begins return dictionary construction.

**Why**: Maintaining dictionary structure makes it easy to access batch components by name in training loops.

---

**Lines 10-12: Dictionary entries**

**What**: Returns the batched tensors in a dictionary.

**Why**: This structure matches what training loops expect, providing named access to batch components.

**How**: Each key maps to a batched tensor:
- `'input_ids'`: Shape (batch_size, 128)
- `'attention_mask'`: Shape (batch_size, 128)
- `'labels'`: Shape (batch_size, 5)

---

**Line 13: `        }`**

**What**: Closes the return dictionary.

---

**Line 15: `collator = DataCollator()`**

**What**: Creates an instance of the DataCollator class.

**Why**: We need an instance to pass to DataLoader's `collate_fn` parameter.

**How**:
- Calls the auto-generated `__init__` (from @dataclass)
- Since there are no attributes, initialization is trivial
- The instance can be called: `collator(batch)`

---

**Line 16: `print("Data collator initialized successfully")`**

**What**: Confirmation message.

**Why**: Provides feedback that the collator was created without errors.

**Teaching Points**:

1. **Batching Process**:
   ```
   Input:  [{'input_ids': (128,)}, {'input_ids': (128,)}, ...]
   Output: {'input_ids': (batch_size, 128), ...}
   ```

2. **Why custom collator**: Default collator doesn't handle dictionary outputs from our dataset.

3. **Alternative**: Could use a function instead of a class, but classes provide better organization.

4. **Memory efficiency**: Stacking creates views when possible, not copies, saving memory.

## Section 8: DataLoader Setup

### Code Block Analysis

In [None]:
BATCH_SIZE = 4

train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=collator
)

val_loader = DataLoader(
    val_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    collate_fn=collator
)

print(f"Number of training batches: {len(train_loader)}")
print(f"Number of validation batches: {len(val_loader)}")

print("\nSample batch from DataLoader:")
for batch in train_loader:
    for key, value in batch.items():
        print(f"{key}: {value.shape}")
    break

### Line-by-Line Breakdown

**Line 1: `BATCH_SIZE = 4`**

**What**: Defines a constant for the batch size.

**Why**:
- Batch size controls how many samples are processed together
- Using a constant makes it easy to experiment with different values
- Larger batches train faster but require more memory
- Smaller batches provide more frequent updates and noisier gradients

**How**: Uppercase naming convention indicates this is a constant that should not be changed during execution.

**Batch Size Considerations**:
- **Too small** (1-2): Slow training, noisy gradients, poor GPU utilization
- **Too large** (128+): High memory usage, may not fit in GPU, can reduce generalization
- **Sweet spot** (8-64): Good balance for most tasks
- **Our choice** (4): Appropriate for small dataset and demonstration purposes

---

**Line 2: (blank line)**

---

**Line 3: `train_loader = DataLoader(`**

**What**: Begins creation of a DataLoader for training data.

**Why**: DataLoader handles batching, shuffling, and parallel loading, abstracting away complex iteration logic. It converts our dataset into an efficient iterator.

**How**: `DataLoader` is PyTorch's built-in utility class that wraps datasets.

---

**Line 4: `    train_dataset,`**

**What**: First positional argument - the dataset to load from.

**Why**: DataLoader needs to know which dataset to iterate over.

**How**: Passes our `train_dataset` instance (EmotionDataset object with 8 samples).

---

**Line 5: `    batch_size=BATCH_SIZE,`**

**What**: Specifies how many samples per batch.

**Why**: Determines batch dimensions. With 8 samples and batch size 4, we get 2 batches per epoch.

**How**: DataLoader will call `dataset[i]` for indices in each batch and pass the list to the collator.

**Batch Calculation**:
- Total samples: 8
- Batch size: 4
- Number of batches: 8 / 4 = 2 full batches

---

**Line 6: `    shuffle=True,`**

**What**: Enables random shuffling of data each epoch.

**Why**:
- Shuffling prevents the model from learning spurious patterns based on data order
- Creates different batches each epoch, improving generalization
- Reduces risk of overfitting to specific batch compositions

**How**: Before each epoch, DataLoader randomly permutes the dataset indices, then creates batches from this shuffled order.

**Important**: Always shuffle training data, never validation/test data.

---

**Line 7: `    collate_fn=collator`**

**What**: Specifies our custom collation function.

**Why**: Without this, DataLoader uses default collation which cannot handle our dictionary-based dataset output.

**How**: DataLoader will call `collator(batch_samples)` where `batch_samples` is a list of dictionaries from our dataset.

**Collation Flow**:
1. DataLoader samples indices: `[2, 5, 1, 7]` (example for batch size 4)
2. Calls `dataset[i]` for each index
3. Collects results into list of dictionaries
4. Passes list to `collator()`
5. Receives batched dictionary back

---

**Line 8: `)`**

**What**: Closes the DataLoader constructor.

**How**: Creates an iterable `train_loader` object.

---

**Line 10: `val_loader = DataLoader(`**

**What**: Begins creation of validation DataLoader.

**Why**: We need a separate loader for validation data with different settings.

---

**Lines 11-15: Validation DataLoader parameters**

**What**: Similar to training loader but with `shuffle=False`.

**Why shuffle=False**:
- Validation is for evaluation, not training
- Consistent order makes debugging easier
- Shuffling does not improve validation performance
- Deterministic order ensures reproducible results

**Other parameters identical**: Same batch size and collator for consistency.

---

**Line 17: `print(f"Number of training batches: {len(train_loader)}")`**

**What**: Prints how many batches are in the training loader.

**Why**: Understanding batch count helps predict training time and verify setup.

**How**:
- `len(train_loader)` calculates: ceil(dataset_size / batch_size)
- For 8 samples with batch size 4: 8 / 4 = 2 batches

**Formula**:
```
num_batches = ceil(num_samples / batch_size)
```

---

**Line 18: `print(f"Number of validation batches: {len(val_loader)}")`**

**What**: Prints validation batch count.

**How**: For 2 validation samples with batch size 4: ceil(2 / 4) = 1 batch (partial batch).

---

**Line 20: `print("\nSample batch from DataLoader:")`**

**What**: Header for batch inspection.

---

**Line 21: `for batch in train_loader:`**

**What**: Begins iteration over the training DataLoader.

**Why**: We want to inspect one batch to verify shapes and structure.

**How**:
- `for batch in train_loader` triggers DataLoader's iteration protocol
- DataLoader calls dataset `__getitem__` for batch_size samples
- Collates them using our collator
- Returns batched dictionary as `batch`

---

**Line 22: `    for key, value in batch.items():`**

**What**: Iterates over the batch dictionary entries.

**Why**: We want to see each component and its shape.

**How**: `batch.items()` yields (key, value) pairs from the dictionary.

---

**Line 23: `        print(f"{key}: {value.shape}")`**

**What**: Prints each tensor's name and shape.

**Why**: Verifying tensor shapes is crucial before training to catch dimensionality errors.

**Expected Output**:
```
input_ids: torch.Size([4, 128])
attention_mask: torch.Size([4, 128])
labels: torch.Size([4, 5])
```

**Shape Interpretation**:
- `[4, 128]`: 4 samples, each with 128 tokens
- `[4, 5]`: 4 samples, each with 5 emotion labels

---

**Line 24: `    break`**

**What**: Exits the loop after processing one batch.

**Why**: We only need to inspect one batch, not iterate through all of them.

**How**: `break` immediately exits the `for` loop.

**Teaching Points**:

1. **DataLoader Benefits**:
   - Automatic batching
   - Background data loading (with `num_workers`)
   - Memory-efficient iteration
   - GPU transfer optimization

2. **Batch vs Epoch**:
   - **Batch**: One group of samples (e.g., 4 samples)
   - **Epoch**: Full pass through entire dataset (e.g., 2 batches = 1 epoch)

3. **Memory Considerations**:
   - Larger batches use more GPU memory
   - Formula: memory ≈ batch_size × sequence_length × model_params
   - If out of memory, reduce batch_size

4. **Drop Last**:
   - `drop_last=True` would discard incomplete final batches
   - We don't use it here because our data divides evenly
   - Useful when batch normalization requires consistent batch sizes

## Section 9: Model Definition

### Code Block Analysis

In [None]:
class EmotionClassifier(nn.Module):

    def __init__(self, model_name: str, num_labels: int = 5, dropout: float = 0.3):
        super(EmotionClassifier, self).__init__()

        self.transformer = AutoModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.transformer.config.hidden_size, num_labels)

    def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
        outputs = self.transformer(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        pooled_output = outputs.last_hidden_state[:, 0, :]
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        return logits

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = EmotionClassifier(MODEL_NAME, num_labels=5)
model = model.to(device)

print(f"Device: {device}")
print(f"\nModel architecture:")
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

### Line-by-Line Breakdown

**Line 1: `class EmotionClassifier(nn.Module):`**

**What**: Defines a neural network class inheriting from PyTorch's Module base class.

**Why**:
- All PyTorch models must inherit from `nn.Module`
- This provides access to training methods, parameter management, and device handling
- Enables automatic gradient computation and model serialization

**How**:
- `class EmotionClassifier` names our custom model
- `(nn.Module)` specifies inheritance
- `nn.Module` is PyTorch's base class for all neural network modules

**nn.Module Benefits**:
1. Automatic parameter tracking
2. `.to(device)` for GPU/CPU transfer
3. `.train()` and `.eval()` mode switching
4. `.parameters()` for optimization
5. State dictionary for saving/loading

---

**Line 2: (blank line)**

---

**Line 3: `    def __init__(self, model_name: str, num_labels: int = 5, dropout: float = 0.3):`**

**What**: Constructor that initializes the model architecture.

**Why**: Defines what layers the model contains. This runs once when creating a model instance.

**How**:
- `model_name: str`: Name of pretrained transformer (e.g., 'bert-base-uncased')
- `num_labels: int = 5`: Number of output classes (our 5 emotions)
- `dropout: float = 0.3`: Dropout probability for regularization

**Parameter Purposes**:
- `model_name`: Determines which transformer to load
- `num_labels`: Size of output layer
- `dropout`: Controls regularization strength (0.3 = 30% of neurons dropped)

---

**Line 4: `        super(EmotionClassifier, self).__init__()`**

**What**: Calls the parent class (nn.Module) constructor.

**Why**: **CRITICAL** - This initializes the nn.Module infrastructure. Without this, parameter registration, device handling, and other essential features will not work.

**How**:
- `super()` references the parent class
- `EmotionClassifier, self` specifies which class we are in
- `.__init__()` calls the parent constructor

**Common Python3 Alternative**: `super().__init__()` (shorter, equivalent)

**What happens inside nn.Module.__init__()**:
1. Initializes parameter tracking system
2. Sets up hooks for forward/backward passes
3. Prepares device management
4. Creates module registry

---

**Line 5: (blank line)**

---

**Line 6: `        self.transformer = AutoModel.from_pretrained(model_name)`**

**What**: Loads a pretrained transformer model and stores it as a submodule.

**Why**:
- Transfer learning: We use knowledge from pretraining on massive datasets
- Pretrained models understand language structure, grammar, and semantics
- Fine-tuning is much faster than training from scratch
- Better performance, especially with limited data

**How**:
- `AutoModel.from_pretrained()` downloads and initializes the model
- Downloads weights (hundreds of MB) from Hugging Face Hub
- Cached locally to avoid re-downloading
- Returns a transformer model (BERT in our case) with pretrained weights

**Model Components Loaded**:
- 12 transformer layers (for bert-base)
- 768-dimensional hidden states
- Multi-head attention mechanisms
- Position embeddings
- Token embeddings
- ~110 million parameters

---

**Line 7: `        self.dropout = nn.Dropout(dropout)`**

**What**: Creates a dropout layer for regularization.

**Why**:
- **Regularization**: Prevents overfitting by randomly zeroing neurons during training
- **Ensemble effect**: Creates different network paths each forward pass
- **Robustness**: Forces network to learn redundant representations

**How**:
- `nn.Dropout(dropout)` creates a dropout layer
- `dropout=0.3` means 30% of inputs are randomly set to zero during training
- Automatically disabled during evaluation (`.eval()` mode)
- Remaining values are scaled by 1/(1-dropout) to maintain expected value

**Dropout Behavior**:
- **Training mode**: Randomly drops 30% of neurons
- **Eval mode**: All neurons active (no dropout)

---

**Line 8: `        self.classifier = nn.Linear(self.transformer.config.hidden_size, num_labels)`**

**What**: Creates a linear (fully connected) classification layer.

**Why**:
- Transforms transformer outputs (768-dim) to label space (5-dim)
- This is the "classification head" specific to our task
- Only this layer is task-specific; transformer is general-purpose

**How**:
- `nn.Linear(in_features, out_features)` creates weight matrix W and bias b
- `self.transformer.config.hidden_size` gets transformer output dimension (768)
- `num_labels` is our output dimension (5)
- Computes: output = input @ W.T + b

**Layer Dimensions**:
- Input: (batch_size, 768)
- Weight matrix: (5, 768)
- Bias: (5,)
- Output: (batch_size, 5)

**Parameters**: 768 × 5 + 5 = 3,845 parameters

---

**Line 9: (blank line)**

---

**Line 10: `    def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:`**

**What**: Defines the forward pass - how data flows through the network.

**Why**:
- PyTorch calls this method automatically during model(input)
- Defines the computational graph for backpropagation
- Separates model architecture (init) from computation (forward)

**How**:
- `forward` is a special method name recognized by nn.Module
- Takes input tensors (input_ids, attention_mask)
- Returns output tensor (logits)
- Type hints document tensor types

---

**Line 11: `        outputs = self.transformer(`**

**What**: Begins passing inputs through the transformer model.

**Why**: The transformer extracts contextual representations from token IDs.

---

**Lines 12-13: Transformer inputs**

**What**: Passes token IDs and attention mask to transformer.

**Why**:
- `input_ids`: The actual tokens to process
- `attention_mask`: Tells model which tokens to attend to (1) vs ignore (0)

**How**: Transformer processes these through 12 layers of self-attention and feedforward networks.

---

**Line 14: `        )`**

**What**: Closes the transformer call.

**How**: `outputs` is a special object containing:
- `last_hidden_state`: Shape (batch_size, 128, 768) - all token representations
- `pooler_output`: Shape (batch_size, 768) - [CLS] token representation
- Additional attributes depending on model configuration

---

**Line 15: (blank line)**

---

**Line 16: `        pooled_output = outputs.last_hidden_state[:, 0, :]`**

**What**: Extracts the [CLS] token representation from the transformer output.

**Why**:
- In BERT, the first token ([CLS]) is trained to represent the entire sequence
- This single 768-dimensional vector summarizes the input text
- Commonly used for classification tasks

**How**:
- `outputs.last_hidden_state` has shape (batch_size, 128, 768)
  - 128 positions (tokens)
  - 768 dimensions per token
- `[:, 0, :]` slices to get first token ([CLS]) for all samples
  - `:` - all samples in batch
  - `0` - first token position
  - `:` - all 768 dimensions
- Result shape: (batch_size, 768)

**Indexing Breakdown**:
```
[batch_dimension, sequence_dimension, feature_dimension]
[      :        ,         0         ,         :        ]
   all samples     first token       all features
```

---

**Line 17: `        pooled_output = self.dropout(pooled_output)`**

**What**: Applies dropout to the pooled representation.

**Why**:
- Regularization between transformer and classifier
- Prevents overfitting to specific neuron patterns
- Improves generalization

**How**:
- During training: Randomly zeros 30% of the 768 values
- During evaluation: No-op (passes through unchanged)
- In-place operation (same variable name)

---

**Line 18: `        logits = self.classifier(pooled_output)`**

**What**: Passes the pooled representation through the classification layer.

**Why**: Transforms from feature space (768-dim) to label space (5-dim), producing raw scores for each emotion.

**How**:
- Matrix multiplication: (batch_size, 768) @ (768, 5)
- Add bias: + (5,)
- Result: (batch_size, 5) - one score per emotion per sample

**Logits vs Probabilities**:
- **Logits**: Raw scores (can be any value)
- **Probabilities**: After sigmoid/softmax (0 to 1)
- We return logits because BCEWithLogitsLoss applies sigmoid internally

---

**Line 19: (blank line)**

---

**Line 20: `        return logits`**

**What**: Returns the model's predictions (logits).

**Why**: These logits are passed to the loss function during training or converted to probabilities during inference.

**How**: Returns tensor of shape (batch_size, 5).

---

**Line 22: `device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')`**

**What**: Determines whether to use GPU or CPU for computation.

**Why**:
- GPUs dramatically accelerate training (10-100x faster)
- Code should run on both GPU and CPU for compatibility
- Automatic device selection makes code portable

**How**:
- `torch.cuda.is_available()` checks if CUDA-capable GPU is present
- Returns `torch.device('cuda')` if GPU available, else `torch.device('cpu')`
- `torch.device` object is used for tensor and model placement

---

**Line 23: `model = EmotionClassifier(MODEL_NAME, num_labels=5)`**

**What**: Instantiates our model class.

**Why**: Creates the actual model object we will train.

**How**:
- Calls `__init__` with `model_name='bert-base-uncased'` and `num_labels=5`
- Downloads BERT weights (if not cached)
- Initializes all layers
- Creates ~110M parameters

---

**Line 24: `model = model.to(device)`**

**What**: Moves model parameters to the selected device (GPU or CPU).

**Why**:
- Model parameters must be on the same device as input data
- `.to(device)` is PyTorch's method for device transfer
- Returns the model for chaining

**How**:
- If device is 'cuda', moves all parameters to GPU memory
- If device is 'cpu', keeps them in RAM
- Also moves buffers (batch norm stats, etc.)

**Performance Impact**:
- CPU: Slow training, lower memory usage
- GPU: Fast training, higher memory usage

---

**Lines 26-30: Model inspection**

**What**: Prints model information for verification.

**Why**: Understanding model size and structure before training is crucial.

**How**:

**Line 26**: Prints device name
- Expected: "cuda" or "cpu"

**Line 28**: Prints model architecture
- Shows all layers and their parameters
- Displays transformer layers and classification head

**Line 29**: Counts total parameters
- `model.parameters()` yields all parameter tensors
- `p.numel()` counts elements in each tensor
- `sum()` totals across all parameters
- `:,` formats with thousands separators

**Line 30**: Counts trainable parameters
- `p.requires_grad` filters to only trainable parameters
- Should equal total parameters (all are trainable by default)
- Some advanced techniques freeze layers (requires_grad=False)

**Expected Output**:
```
Device: cuda
Model architecture: [Shows full model structure]
Total parameters: ~110,000,000
Trainable parameters: ~110,000,000
```

**Teaching Points**:

1. **Model Architecture**:
   ```
   Input (batch_size, 128)
      ↓
   Transformer (12 layers)
      ↓
   [CLS] token (batch_size, 768)
      ↓
   Dropout
      ↓
   Linear (batch_size, 5)
      ↓
   Logits
   ```

2. **Transfer Learning**: Only classifier is random initialized; transformer has pretrained weights.

3. **Fine-tuning vs Feature Extraction**:
   - **Fine-tuning** (our approach): Update all parameters
   - **Feature extraction**: Freeze transformer, train only classifier

4. **Memory Usage**: ~450MB for bert-base (float32), ~225MB (float16)

## Section 10: Trainer Class

### Code Block Analysis

In [None]:
class Trainer:

    def __init__(self, model: nn.Module, train_loader: DataLoader, val_loader: DataLoader,
                 criterion, optimizer, device: torch.device):
        self.model = model
        self.train_loader = train_loader
        self.val_loader = val_loader
        self.criterion = criterion
        self.optimizer = optimizer
        self.device = device
        self.history = {'train_loss': [], 'val_loss': [], 'val_acc': []}

    def train_epoch(self) -> float:
        self.model.train()
        total_loss = 0

        for batch in tqdm(self.train_loader, desc="Training"):
            input_ids = batch['input_ids'].to(self.device)
            attention_mask = batch['attention_mask'].to(self.device)
            labels = batch['labels'].to(self.device)

            self.optimizer.zero_grad()

            logits = self.model(input_ids=input_ids, attention_mask=attention_mask)
            loss = self.criterion(logits, labels)

            loss.backward()
            self.optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(self.train_loader)
        return avg_loss

    def validate_epoch(self) -> tuple:
        self.model.eval()
        total_loss = 0
        all_preds = []
        all_labels = []

        with torch.no_grad():
            for batch in tqdm(self.val_loader, desc="Validation"):
                input_ids = batch['input_ids'].to(self.device)
                attention_mask = batch['attention_mask'].to(self.device)
                labels = batch['labels'].to(self.device)

                logits = self.model(input_ids=input_ids, attention_mask=attention_mask)
                loss = self.criterion(logits, labels)

                total_loss += loss.item()

                preds = torch.sigmoid(logits) > 0.5
                all_preds.append(preds.cpu().numpy())
                all_labels.append(labels.cpu().numpy())

        avg_loss = total_loss / len(self.val_loader)

        all_preds = np.vstack(all_preds)
        all_labels = np.vstack(all_labels)
        accuracy = accuracy_score(all_labels, all_preds)

        return avg_loss, accuracy

    def train(self, num_epochs: int):
        print("Starting training...")

        for epoch in range(num_epochs):
            print(f"\nEpoch {epoch + 1}/{num_epochs}")

            train_loss = self.train_epoch()
            val_loss, val_acc = self.validate_epoch()

            self.history['train_loss'].append(train_loss)
            self.history['val_loss'].append(val_loss)
            self.history['val_acc'].append(val_acc)

            print(f"Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.4f}")

        print("\nTraining complete!")

### Line-by-Line Breakdown

**Line 1: `class Trainer:`**

**What**: Defines a custom training orchestration class.

**Why**:
- Encapsulates training logic for reusability and organization
- Manages training loop, validation, and history tracking
- Separates model definition from training procedure
- Makes code cleaner and more maintainable

**How**: Creates a regular Python class (not inheriting from nn.Module) for training utilities.

---

**Line 2: (blank line)**

---

**Line 3: `    def __init__(self, model: nn.Module, train_loader: DataLoader, val_loader: DataLoader,`**

**What**: Constructor that stores all training components.

**Why**: Centralizes all objects needed for training in one place.

---

**Line 4: `                 criterion, optimizer, device: torch.device):`**

**What**: Continuation of constructor parameters.

**How**:
- `model`: The neural network to train
- `train_loader`: DataLoader for training data
- `val_loader`: DataLoader for validation data
- `criterion`: Loss function
- `optimizer`: Optimization algorithm
- `device`: CPU or GPU

---

**Line 5: `        self.model = model`**

**What**: Stores model reference as instance variable.

**Why**: Allows all methods to access the model via `self.model`.

**How**: Simple attribute assignment.

---

**Lines 6-10: Store remaining components**

**What**: Stores train_loader, val_loader, criterion, optimizer, and device as instance variables.

**Why**: Makes these accessible throughout the class lifetime.

**How**: Each becomes an attribute: `self.train_loader`, `self.val_loader`, etc.

---

**Line 11: `        self.history = {'train_loss': [], 'val_loss': [], 'val_acc': []}`**

**What**: Initializes a dictionary to track metrics across epochs.

**Why**:
- Enables plotting learning curves after training
- Helps diagnose overfitting, underfitting, convergence
- Provides historical record for analysis

**How**: Creates dict with three empty lists:
- `train_loss`: Training loss per epoch
- `val_loss`: Validation loss per epoch
- `val_acc`: Validation accuracy per epoch

---

**Line 12: (blank line)**

---

**Line 13: `    def train_epoch(self) -> float:`**

**What**: Defines method to train for one complete epoch.

**Why**: Separates single-epoch logic from multi-epoch orchestration.

**How**:
- Iterates through all training batches once
- Returns average loss for the epoch
- `-> float` indicates return type

---

**Line 14: `        self.model.train()`**

**What**: Sets model to training mode.

**Why**: **CRITICAL** - Enables training-specific behaviors:
- Activates dropout (randomly drops neurons)
- Enables batch normalization in training mode
- Affects any layer with different train/eval behavior

**How**:
- `model.train()` is inherited from nn.Module
- Sets `model.training = True`
- Recursively applies to all submodules

**Common Mistake**: Forgetting this causes dropout to be disabled during training.

---

**Line 15: `        total_loss = 0`**

**What**: Initializes accumulator for batch losses.

**Why**: We sum losses across all batches to compute average epoch loss.

**How**: Starts at zero, incremented each batch.

---

**Line 16: (blank line)**

---

**Line 17: `        for batch in tqdm(self.train_loader, desc="Training"):`**

**What**: Iterates through all training batches with progress bar.

**Why**:
- Processes entire dataset one batch at a time
- `tqdm` provides visual feedback and ETA
- Each iteration is one gradient update step

**How**:
- `self.train_loader` yields batches
- `tqdm()` wraps iterator with progress bar
- `desc="Training"` labels the progress bar
- `batch` is a dictionary: `{'input_ids': tensor, 'attention_mask': tensor, 'labels': tensor}`

**Example Progress**: `Training: 50%|██████ | 1/2 [00:03<00:03, 3.5s/it]`

---

**Line 18: `            input_ids = batch['input_ids'].to(self.device)`**

**What**: Extracts input_ids from batch and moves to GPU/CPU.

**Why**:
- Model and data must be on same device
- `.to(device)` transfers tensor to target device
- This is the encoded token IDs from tokenizer

**How**:
- `batch['input_ids']` accesses the key from DataCollator output
- `.to(self.device)` creates a copy on target device (if different)
- Shape: (batch_size, 128)

**Performance Note**: GPU transfer has overhead; batching amortizes this cost.

---

**Line 19: `            attention_mask = batch['attention_mask'].to(self.device)`**

**What**: Moves attention mask to device.

**Why**: Indicates which tokens are real (1) vs padding (0).

**How**: Same process as input_ids.

---

**Line 20: `            labels = batch['labels'].to(self.device)`**

**What**: Moves ground truth labels to device.

**Why**: Needed to compute loss against model predictions.

**How**: Shape (batch_size, 5) - binary labels for 5 emotions.

---

**Line 21: (blank line)**

---

**Line 22: `            self.optimizer.zero_grad()`**

**What**: Resets gradients to zero.

**Why**: **CRITICAL** - PyTorch accumulates gradients by default. Without this, gradients from previous batches would interfere with current batch.

**How**:
- Sets `grad` attribute of all model parameters to zero
- Must be called before each backward pass
- Prevents gradient accumulation across batches

**When Gradient Accumulation is Desired**: Intentionally skip this every N steps to simulate larger batches.

---

**Line 23: (blank line)**

---

**Line 24: `            logits = self.model(input_ids=input_ids, attention_mask=attention_mask)`**

**What**: Forward pass - gets model predictions.

**Why**: Computes output for this batch to compare against labels.

**How**:
- Calls `model.forward()` implicitly (Python's `__call__` mechanism)
- Passes tensors through transformer and classification head
- Returns logits of shape (batch_size, 5)
- **Gradient tracking is ON** - computation graph is built for backprop

---

**Line 25: `            loss = self.criterion(logits, labels)`**

**What**: Computes loss between predictions and ground truth.

**Why**:
- Quantifies how wrong the model is
- Provides gradient signal for learning
- Single scalar value for optimization

**How**:
- `self.criterion` is BCEWithLogitsLoss
- Applies sigmoid to logits internally
- Computes binary cross-entropy for each of 5 labels
- Averages across labels and batch
- Result: scalar tensor with `requires_grad=True`

---

**Line 26: (blank line)**

---

**Line 27: `            loss.backward()`**

**What**: Backpropagation - computes gradients.

**Why**: **CORE OF LEARNING** - Calculates how each parameter should change to reduce loss.

**How**:
- Traverses computation graph in reverse
- Applies chain rule to compute derivatives
- Stores gradients in `.grad` attribute of each parameter
- Uses automatic differentiation (autograd)

**What Happens Internally**:
1. Starts at loss (scalar)
2. Computes dLoss/dLogits
3. Propagates through linear layer: dLoss/dWeights
4. Continues through dropout, transformer layers
5. Updates ALL parameters' `.grad` attributes

---

**Line 28: `            self.optimizer.step()`**

**What**: Updates model parameters using computed gradients.

**Why**: This is where learning actually happens - parameters are adjusted to reduce loss.

**How**:
- Optimizer (AdamW) applies update rule to each parameter
- For AdamW: `param = param - lr * (gradient + weight_decay * param)`
- Uses learning rate, momentum, adaptive learning rates
- Modifies parameters in-place

**The Training Step Sequence**:
1. `zero_grad()` - Clear old gradients
2. Forward pass - Compute predictions
3. Compute loss
4. `backward()` - Compute gradients
5. `step()` - Update parameters

---

**Line 29: (blank line)**

---

**Line 30: `            total_loss += loss.item()`**

**What**: Accumulates batch loss for epoch average.

**Why**: Track training progress across all batches.

**How**:
- `loss.item()` extracts scalar value from tensor (detaches from graph)
- `.item()` prevents memory leak by not keeping computation graph
- Adds to running total

**Why .item() is Important**: Keeping tensors in list would prevent garbage collection of computation graphs.

---

**Line 31: (blank line)**

---

**Line 32: `        avg_loss = total_loss / len(self.train_loader)`**

**What**: Computes average loss across all batches.

**Why**: Single metric summarizing epoch performance.

**How**:
- `len(self.train_loader)` gives number of batches
- Divides total by batch count
- Result is mean batch loss

**With 8 samples, batch_size=4**: len(train_loader) = 2 batches

---

**Line 33: `        return avg_loss`**

**What**: Returns epoch's average training loss.

**Why**: Allows caller to track and log training progress.

---

**Line 34: (blank line)**

---

**Line 35: `    def validate_epoch(self) -> tuple:`**

**What**: Defines method to evaluate on validation set.

**Why**:
- Measures generalization to unseen data
- Detects overfitting
- Does NOT update parameters

**How**: Returns tuple (avg_loss, accuracy).

---

**Line 36: `        self.model.eval()`**

**What**: Sets model to evaluation mode.

**Why**: **CRITICAL** - Disables training-specific behaviors:
- Deactivates dropout (all neurons active)
- Batch norm uses running statistics instead of batch statistics
- Ensures consistent predictions

**How**:
- `model.eval()` is inherited from nn.Module
- Sets `model.training = False`
- Recursively applies to all submodules

**Impact**: Without this, dropout would randomly affect predictions, making validation inconsistent.

---

**Lines 37-39: Initialize tracking variables**

**What**: Prepares containers for validation metrics.

**How**:
- `total_loss = 0`: Accumulates batch losses
- `all_preds = []`: Stores predictions from all batches
- `all_labels = []`: Stores ground truth from all batches

**Why**: Aggregate across batches to compute overall metrics.

---

**Line 40: (blank line)**

---

**Line 41: `        with torch.no_grad():`**

**What**: Context manager that disables gradient computation.

**Why**:
- **Performance**: Saves memory (no computation graph)
- **Speed**: Faster forward passes
- **Correctness**: Validation should not affect model parameters

**How**:
- Sets `torch.grad_enabled = False` within block
- Automatically restores previous state after block
- All operations inside do not track gradients

**Memory Savings**: Approximately 2x less memory usage during validation.

---

**Line 42: `            for batch in tqdm(self.val_loader, desc="Validation"):`**

**What**: Iterates through validation batches with progress bar.

**Why**: Process entire validation set, show progress.

---

**Lines 43-45: Extract and move batch tensors**

**What**: Gets input_ids, attention_mask, labels and moves to device.

**Why**: Same as training loop - prepare data for model.

---

**Line 46: (blank line)**

---

**Line 47: `                logits = self.model(input_ids=input_ids, attention_mask=attention_mask)`**

**What**: Forward pass for validation batch.

**Why**: Get predictions to evaluate.

**How**: Same as training, but NO gradient tracking (due to `torch.no_grad()`).

---

**Line 48: `                loss = self.criterion(logits, labels)`**

**What**: Computes validation loss.

**Why**: Measure how well model performs on unseen data.

**How**: Same loss function as training, but not used for backprop.

---

**Line 49: (blank line)**

---

**Line 50: `                total_loss += loss.item()`**

**What**: Accumulates validation loss.

**Why**: Track validation performance.

---

**Line 51: (blank line)**

---

**Line 52: `                preds = torch.sigmoid(logits) > 0.5`**

**What**: Converts logits to binary predictions.

**Why**:
- Logits are raw scores
- Sigmoid converts to probabilities (0 to 1)
- Threshold at 0.5 for binary decision

**How**:
- `torch.sigmoid(logits)` → probabilities
- `> 0.5` → boolean tensor (True/False)
- Multi-label: Each emotion independently thresholded
- Shape: (batch_size, 5) of booleans

**Example**:
```
Logits: [2.3, -1.1, 0.8, -0.3, 1.5]
Sigmoid: [0.91, 0.25, 0.69, 0.43, 0.82]
Preds: [True, False, True, False, True]
```

---

**Line 53: `                all_preds.append(preds.cpu().numpy())`**

**What**: Moves predictions to CPU and converts to NumPy.

**Why**:
- Sklearn metrics require NumPy arrays
- `.cpu()` moves from GPU to CPU
- `.numpy()` converts torch tensor to NumPy array
- Append to list for later concatenation

**How**: Stores (batch_size, 5) NumPy array.

---

**Line 54: `                all_labels.append(labels.cpu().numpy())`**

**What**: Stores ground truth labels as NumPy.

**Why**: Match format of predictions for metric computation.

---

**Line 55: (blank line)**

---

**Line 56: `        avg_loss = total_loss / len(self.val_loader)`**

**What**: Computes average validation loss.

**Why**: Single metric for validation performance.

---

**Line 57: (blank line)**

---

**Line 58: `        all_preds = np.vstack(all_preds)`**

**What**: Stacks list of arrays into single 2D array.

**Why**: Sklearn metrics need single array, not list.

**How**:
- `np.vstack()` vertically stacks arrays
- Converts list of (batch_size, 5) arrays to (total_samples, 5)

**Example**:
```
Input: [array([[1,0,1,0,1], [0,1,0,1,0]]), array([[1,1,0,0,1]])]
Output: array([[1,0,1,0,1], [0,1,0,1,0], [1,1,0,0,1]])
Shape: (3, 5)
```

---

**Line 59: `        all_labels = np.vstack(all_labels)`**

**What**: Stacks ground truth labels.

**Why**: Match predictions format.

---

**Line 60: `        accuracy = accuracy_score(all_labels, all_preds)`**

**What**: Computes multi-label accuracy.

**Why**: Measures exact match accuracy (all 5 labels must be correct).

**How**:
- Sklearn's `accuracy_score` compares row-by-row
- Returns fraction of samples where ALL labels match
- **Strict metric**: [1,0,1,0,1] vs [1,0,1,0,0] = incorrect

**Multi-Label Accuracy Calculation**:
```
Sample 1: [1,0,1,0,1] == [1,0,1,0,1] → Correct
Sample 2: [0,1,0,1,0] == [0,1,0,0,0] → Incorrect
Accuracy: 1/2 = 0.5
```

---

**Line 61: (blank line)**

---

**Line 62: `        return avg_loss, accuracy`**

**What**: Returns validation metrics.

**Why**: Allows caller to track validation performance.

**How**: Returns tuple of (float, float).

---

**Line 63: (blank line)**

---

**Line 64: `    def train(self, num_epochs: int):`**

**What**: Main training loop orchestrating multiple epochs.

**Why**:
- Coordinates training and validation
- Tracks history
- Provides user feedback

**How**: Runs specified number of epochs, calling train_epoch and validate_epoch.

---

**Line 65: `        print("Starting training...")`**

**What**: User feedback.

---

**Line 66: (blank line)**

---

**Line 67: `        for epoch in range(num_epochs):`**

**What**: Loops through epochs.

**Why**: Train for multiple passes through dataset.

**How**: `range(num_epochs)` generates 0, 1, 2, ... num_epochs-1.

---

**Line 68: `            print(f"\nEpoch {epoch + 1}/{num_epochs}")`**

**What**: Displays current epoch (1-indexed for humans).

**Why**: Track progress.

**How**: `epoch + 1` converts 0-indexed to 1-indexed.

---

**Line 69: (blank line)**

---

**Line 70: `            train_loss = self.train_epoch()`**

**What**: Runs one training epoch.

**Why**: Update parameters on training data.

**How**: Calls train_epoch(), returns average loss.

---

**Line 71: `            val_loss, val_acc = self.validate_epoch()`**

**What**: Runs validation after training epoch.

**Why**: Check generalization performance.

**How**: Unpacks tuple return (avg_loss, accuracy).

---

**Line 72: (blank line)**

---

**Lines 73-75: Store metrics in history**

**What**: Appends epoch metrics to history dictionary.

**Why**: Enables plotting and analysis after training.

**How**: Each list grows by one element per epoch.

---

**Line 76: (blank line)**

---

**Line 77: `            print(f"Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.4f}")`**

**What**: Displays epoch metrics.

**Why**: Monitor training progress.

**How**:
- `:.4f` formats to 4 decimal places
- `|` separates metrics visually

**Example Output**: `Train Loss: 0.4523 | Val Loss: 0.5102 | Val Acc: 0.6667`

---

**Line 78: (blank line)**

---

**Line 79: `        print("\nTraining complete!")`**

**What**: Final status message.

---

### Teaching Points

**1. Training Loop Structure**:
```
for epoch in range(num_epochs):
    # Training phase
    model.train()
    for batch in train_loader:
        optimizer.zero_grad()
        output = model(input)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
    
    # Validation phase
    model.eval()
    with torch.no_grad():
        for batch in val_loader:
            output = model(input)
            loss = criterion(output, target)
```

**2. Critical Method Calls**:
- `model.train()` before training
- `model.eval()` before validation
- `optimizer.zero_grad()` before each backward
- `torch.no_grad()` during validation

**3. Why Separate train_epoch and validate_epoch**:
- Modularity and reusability
- Different behaviors (gradients on/off)
- Clearer logic

**4. Multi-label vs Multi-class**:
- **Multi-class**: Softmax, one label (cat OR dog)
- **Multi-label**: Sigmoid, multiple labels (angry AND sad)

**5. Overfitting Diagnosis**:
- Train loss decreasing, val loss increasing → Overfitting
- Both decreasing → Good
- Both high → Underfitting

**6. Memory Management**:
- `loss.item()` prevents graph retention
- `torch.no_grad()` disables gradient tracking
- `.cpu()` before accumulating prevents GPU memory leak

## Section 11: Training Configuration

### Code Block Analysis

In [None]:
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

NUM_EPOCHS = 3
trainer = Trainer(model, train_loader, val_loader, criterion, optimizer, device)

### Line-by-Line Breakdown

**Line 1: `criterion = nn.BCEWithLogitsLoss()`**

**What**: Creates the loss function for multi-label classification.

**Why**:
- **Binary Cross-Entropy** is the standard loss for binary/multi-label tasks
- **WithLogits** means it accepts raw logits (no sigmoid needed)
- Combines sigmoid and BCE for numerical stability
- Prevents issues with extreme values (log(0), log(1))

**How**:
- Instantiates PyTorch's BCEWithLogitsLoss
- Internally applies sigmoid then computes binary cross-entropy
- Averages loss across all labels and samples

**Mathematical Formula**:
For each label:
$$\text{BCE} = -[y \log(\sigma(x)) + (1-y) \log(1-\sigma(x))]$$

Where:
- $y$ is ground truth (0 or 1)
- $x$ is logit (raw score)
- $\sigma$ is sigmoid function: $\sigma(x) = \frac{1}{1 + e^{-x}}$

**Why Not Separate Sigmoid + BCE?**
- **Numerical stability**: Combining operations prevents overflow/underflow
- **log-sum-exp trick** for better precision
- **Faster computation**: Fused operation

**Multi-Label Behavior**:
- Computes BCE for each of 5 labels independently
- Averages across all 5 labels
- Averages across batch

**Example**:
```
Logits:     [2.3, -1.1, 0.8, -0.3, 1.5]
Labels:     [1,   0,    1,   0,    1]
Probs:      [0.91, 0.25, 0.69, 0.43, 0.82]
BCE each:   [0.09, 0.29, 0.37, 0.33, 0.20]
Mean:       0.26
```

---

**Line 2: `optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)`**

**What**: Creates AdamW optimizer with specific hyperparameters.

**Why**:
- **AdamW** is state-of-the-art for transformer fine-tuning
- Combines adaptive learning rates with proper weight decay
- Handles sparse gradients well
- More stable than vanilla Adam

**How**:
- `model.parameters()` provides all trainable parameters
- `lr=2e-5` sets learning rate to 0.00002
- `weight_decay=0.01` adds L2 regularization

**Parameter Breakdown**:

1. **model.parameters()**:
   - Generator yielding all tensors with `requires_grad=True`
   - Includes weights and biases from all layers
   - ~110 million parameters for BERT-base

2. **lr=2e-5 (Learning Rate)**:
   - Controls step size during optimization
   - `2e-5 = 0.00002` is small because we are fine-tuning
   - Pretrained weights need gentle updates
   - Too high → unstable training, loss spikes
   - Too low → slow convergence

**Why 2e-5 for Transformers?**:
- Empirically found to work well for BERT fine-tuning
- Original BERT paper recommendation
- Pretrained models are already near optimal
- Large updates would destroy learned representations

3. **weight_decay=0.01**:
   - L2 regularization penalty
   - Prevents weights from growing too large
   - Helps generalization
   - AdamW decouples weight decay from gradient updates (better than Adam)

**AdamW vs Adam**:
- **Adam**: `param -= lr * (grad + wd * param)` (weight decay tied to gradient)
- **AdamW**: `param -= lr * grad; param -= lr * wd * param` (decoupled)
- AdamW prevents weight decay from being affected by gradient scaling

**What AdamW Does Each Step**:
1. Computes gradient moving average (momentum)
2. Computes squared gradient moving average (adaptive LR)
3. Updates parameters using adaptive learning rate
4. Applies weight decay separately

---

**Line 4: `NUM_EPOCHS = 3`**

**What**: Sets number of complete passes through training data.

**Why**:
- Fine-tuning transformers requires few epochs (3-5 typical)
- More epochs → risk of overfitting
- Pretrained models learn quickly on new tasks

**How**: Constant defining training duration.

**Why Only 3 Epochs?**:
- Pretrained model is already 90% there
- Task-specific head learns quickly
- More epochs often hurt validation performance
- Small datasets especially prone to overfitting

**Typical Epoch Counts**:
- **Training from scratch**: 50-200 epochs
- **Fine-tuning transformers**: 2-5 epochs
- **Small datasets**: Fewer epochs
- **Large datasets**: Can handle more epochs

---

**Line 5: `trainer = Trainer(model, train_loader, val_loader, criterion, optimizer, device)`**

**What**: Instantiates the Trainer class with all training components.

**Why**: Packages everything needed for training into single object.

**How**:
- Calls `Trainer.__init__()`
- Stores all arguments as instance variables
- Initializes empty history dictionary

**What Gets Stored**:
- `self.model`: EmotionClassifier instance
- `self.train_loader`: Training DataLoader (8 samples, batch_size=4 → 2 batches)
- `self.val_loader`: Validation DataLoader (2 samples, batch_size=4 → 1 batch)
- `self.criterion`: BCEWithLogitsLoss instance
- `self.optimizer`: AdamW instance
- `self.device`: 'cuda' or 'cpu'
- `self.history`: {'train_loss': [], 'val_loss': [], 'val_acc': []}

**Ready to Train**: Can now call `trainer.train(NUM_EPOCHS)`.

---

### Teaching Points

**1. Loss Function Selection Guide**:
- **Binary classification**: BCEWithLogitsLoss
- **Multi-class classification**: CrossEntropyLoss
- **Multi-label classification**: BCEWithLogitsLoss (our case)
- **Regression**: MSELoss, L1Loss

**2. Optimizer Selection Guide**:
- **Transformers**: AdamW (industry standard)
- **CNNs**: SGD with momentum, Adam
- **RNNs**: Adam, RMSprop
- **General**: Adam is good default

**3. Learning Rate Guidelines**:
- **Fine-tuning transformers**: 1e-5 to 5e-5
- **Training from scratch**: 1e-3 to 1e-4
- **Too high symptoms**: Loss explodes, NaN values
- **Too low symptoms**: Slow convergence, plateaus early

**4. Weight Decay Purpose**:
- Prevents overfitting
- Keeps weights small
- Improves generalization
- Typical values: 0.01 to 0.1

**5. Why BCEWithLogitsLoss vs BCE + Sigmoid?**:
```python
# Numerically unstable
probs = torch.sigmoid(logits)
loss = F.binary_cross_entropy(probs, labels)

# Stable (preferred)
loss = F.binary_cross_entropy_with_logits(logits, labels)
```

**6. Epoch vs Iteration**:
- **Epoch**: One complete pass through entire dataset
- **Iteration/Step**: One batch processed
- **With 8 samples, batch_size=4**: 1 epoch = 2 iterations

## Section 12: Execute Training

### Code Block Analysis

In [None]:
trainer.train(NUM_EPOCHS)

### Line-by-Line Breakdown

**Line 1: `trainer.train(NUM_EPOCHS)`**

**What**: Executes the complete training process for 3 epochs.

**Why**: This single line triggers all training and validation logic.

**How**:
- Calls `Trainer.train()` method with `num_epochs=3`
- Runs 3 complete passes through training data
- Validates after each epoch
- Updates model parameters via backpropagation
- Prints progress and metrics

**What Happens When This Executes**:

1. **Prints**: "Starting training..."

2. **For Epoch 1**:
   - Prints: "Epoch 1/3"
   - **Training Phase** (train_epoch):
     - Sets model to training mode (`model.train()`)
     - Progress bar: "Training: 100%|██████| 2/2 [00:XX<00:00]"
     - Processes 2 batches (8 samples ÷ 4 batch_size = 2)
     - Each batch: forward pass → loss → backward → optimizer step
     - Returns average training loss
   - **Validation Phase** (validate_epoch):
     - Sets model to eval mode (`model.eval()`)
     - Progress bar: "Validation: 100%|██████| 1/1 [00:XX<00:00]"
     - Processes 1 batch (2 samples with batch_size=4)
     - No gradient computation (`torch.no_grad()`)
     - Computes predictions and metrics
     - Returns validation loss and accuracy
   - **Stores metrics** in history
   - Prints: "Train Loss: X.XXXX | Val Loss: X.XXXX | Val Acc: X.XXXX"

3. **Repeats for Epochs 2 and 3**

4. **Prints**: "Training complete!"

**Expected Output Example**:
```
Starting training...

Epoch 1/3
Training: 100%|██████████| 2/2 [00:03<00:00,  1.50s/it]
Validation: 100%|██████████| 1/1 [00:01<00:00,  1.20s/it]
Train Loss: 0.6234 | Val Loss: 0.5891 | Val Acc: 0.5000

Epoch 2/3
Training: 100%|██████████| 2/2 [00:02<00:00,  1.20s/it]
Validation: 100%|██████████| 1/1 [00:01<00:00,  1.10s/it]
Train Loss: 0.4512 | Val Loss: 0.4823 | Val Acc: 0.5000

Epoch 3/3
Training: 100%|██████████| 2/2 [00:02<00:00,  1.15s/it]
Validation: 100%|██████████| 1/1 [00:01<00:00,  1.05s/it]
Train Loss: 0.3245 | Val Loss: 0.4156 | Val Acc: 1.0000

Training complete!
```

**Interpreting Results**:

**Train Loss Decreasing**: Model is learning from training data.

**Val Loss Behavior**:
- Decreasing with train loss → Good generalization
- Increasing while train loss decreases → Overfitting
- Staying flat → Model not learning meaningful patterns

**Val Accuracy**:
- Remember this is exact-match accuracy (all 5 labels must match)
- 0.5000 = 1 out of 2 validation samples correct
- 1.0000 = 2 out of 2 validation samples correct
- More lenient metrics (per-label F1) typically higher

**Training Time**:
- GPU: ~5-10 seconds per epoch
- CPU: ~30-60 seconds per epoch
- Small dataset so very fast

**After This Cell Completes**:
- Model parameters have been updated
- `trainer.history` contains metrics for plotting
- Model is ready for evaluation or inference

---

### Teaching Points

**1. What Actually Changes During Training**:
- **Before**: Model makes random predictions
- **After**: Model learns patterns correlating text with emotions
- **Mechanism**: ~110M parameters adjusted via gradient descent

**2. Progress Bar Information**:
```
Training: 100%|██████████| 2/2 [00:03<00:00, 1.50s/it]
          ^^^^           ^^^   ^^^^^^^^^^^^^^^^^^^^^^
          %done        cur/tot  [elapsed<remaining, speed]
```

**3. Training Dynamics**:
- **Early epochs**: Large loss reductions, fast learning
- **Later epochs**: Smaller improvements, fine-tuning
- **Plateau**: Loss stops improving → may need to stop

**4. Why Validation After Each Epoch**:
- Monitor generalization continuously
- Detect overfitting early
- Choose best checkpoint (not necessarily last epoch)

**5. Memory Usage During Training**:
- **Model**: ~450MB (BERT parameters)
- **Optimizer state**: ~900MB (AdamW maintains momentum)
- **Activations**: ~200MB (forward pass intermediate values)
- **Gradients**: ~450MB (same size as model)
- **Total**: ~2GB GPU memory for this small example

**6. What If Training Fails**:
- **Loss → NaN**: Learning rate too high, reduce by 10x
- **Loss not decreasing**: Learning rate too low, increase
- **CUDA out of memory**: Reduce batch_size
- **Very slow**: Check if GPU is being used (`device`)

## Section 13: Evaluator Class

### Code Block Analysis

In [None]:
class Evaluator:

    def __init__(self, model: nn.Module, val_loader: DataLoader, device: torch.device):
        self.model = model
        self.val_loader = val_loader
        self.device = device

    def evaluate(self) -> dict:
        self.model.eval()
        all_preds = []
        all_labels = []

        with torch.no_grad():
            for batch in tqdm(self.val_loader, desc="Evaluating"):
                input_ids = batch['input_ids'].to(self.device)
                attention_mask = batch['attention_mask'].to(self.device)
                labels = batch['labels'].to(self.device)

                logits = self.model(input_ids=input_ids, attention_mask=attention_mask)
                preds = torch.sigmoid(logits) > 0.5

                all_preds.append(preds.cpu().numpy())
                all_labels.append(labels.cpu().numpy())

        all_preds = np.vstack(all_preds)
        all_labels = np.vstack(all_labels)

        accuracy = accuracy_score(all_labels, all_preds)
        precision_macro = precision_score(all_labels, all_preds, average='macro', zero_division=0)
        recall_macro = recall_score(all_labels, all_preds, average='macro', zero_division=0)
        f1_macro = f1_score(all_labels, all_preds, average='macro', zero_division=0)

        precision_micro = precision_score(all_labels, all_preds, average='micro', zero_division=0)
        recall_micro = recall_score(all_labels, all_preds, average='micro', zero_division=0)
        f1_micro = f1_score(all_labels, all_preds, average='micro', zero_division=0)

        return {
            'accuracy': accuracy,
            'precision_macro': precision_macro,
            'recall_macro': recall_macro,
            'f1_macro': f1_macro,
            'precision_micro': precision_micro,
            'recall_micro': recall_micro,
            'f1_micro': f1_micro,
            'predictions': all_preds,
            'labels': all_labels
        }

evaluator = Evaluator(model, val_loader, device)
results = evaluator.evaluate()

print("=" * 50)
print("EVALUATION RESULTS")
print("=" * 50)
print(f"Exact Match Accuracy: {results['accuracy']:.4f}")
print(f"\nMacro Metrics (average across labels):")
print(f"  Precision: {results['precision_macro']:.4f}")
print(f"  Recall:    {results['recall_macro']:.4f}")
print(f"  F1 Score:  {results['f1_macro']:.4f}")
print(f"\nMicro Metrics (aggregate all labels):")
print(f"  Precision: {results['precision_micro']:.4f}")
print(f"  Recall:    {results['recall_micro']:.4f}")
print(f"  F1 Score:  {results['f1_micro']:.4f}")
print("=" * 50)

### Line-by-Line Breakdown

This section evaluates the trained model comprehensively. The Evaluator class is simpler than Trainer since it only runs inference without updating parameters.

**Lines 1-6: Class initialization**

Similar to Trainer but simpler - only needs model, val_loader, and device. No optimizer or criterion needed since we are not training.

**Lines 8-22: evaluate method - Prediction gathering**

Nearly identical to `Trainer.validate_epoch()` but without loss computation. The key is gathering all predictions and labels for comprehensive metric calculation.

**Line 13**: `torch.no_grad()` disables gradient tracking for inference efficiency.

**Line 19**: `torch.sigmoid(logits) > 0.5` converts raw scores to binary predictions.

**Lines 24-25: Stack arrays**

`np.vstack()` combines batch-wise predictions into single arrays for metric computation.

**Line 27: `accuracy = accuracy_score(all_labels, all_preds)`**

**What**: Exact match accuracy - fraction of samples where ALL labels are correct.

**Why**: Strictest metric for multi-label classification.

**How**: Only counts as correct if all 5 predictions match all 5 ground truths.

**Example**:
```
Sample 1: Pred [1,0,1,0,1] vs True [1,0,1,0,1] → Correct
Sample 2: Pred [0,1,0,1,0] vs True [0,1,0,0,0] → Incorrect (4th label differs)
Accuracy: 1/2 = 0.5
```

**Line 28: `precision_score(all_labels, all_preds, average='macro', zero_division=0)`**

**What**: Macro-averaged precision across all labels.

**Why**:
- **Precision** = True Positives / (True Positives + False Positives)
- Answers: "Of all positive predictions, how many were correct?"
- **Macro** averages metrics per label, then averages across labels
- Treats all labels equally regardless of frequency

**How**:
1. Compute precision for each of 5 labels separately
2. Average the 5 precision scores

**Example**:
```
Label 0 (anger):    Precision = 0.8
Label 1 (fear):     Precision = 0.6
Label 2 (joy):      Precision = 1.0
Label 3 (sadness):  Precision = 0.7
Label 4 (surprise): Precision = 0.9
Macro Precision = (0.8 + 0.6 + 1.0 + 0.7 + 0.9) / 5 = 0.8
```

**zero_division=0**: If a label has zero predictions, precision defaults to 0 instead of undefined.

**Line 29: `recall_score(..., average='macro')`**

**What**: Macro-averaged recall across all labels.

**Why**:
- **Recall** = True Positives / (True Positives + False Negatives)
- Answers: "Of all actual positives, how many did we find?"
- **Macro** treats each label equally

**Example**:
```
Label 0: 3 actual positives, predicted 2 correctly → Recall = 2/3 = 0.67
Label 1: 2 actual positives, predicted 2 correctly → Recall = 2/2 = 1.0
Macro Recall = (0.67 + 1.0 + ...) / 5
```

**Line 30: `f1_score(..., average='macro')`**

**What**: Macro-averaged F1 score.

**Why**:
- **F1** = Harmonic mean of precision and recall
- Balances precision and recall
- Better than accuracy for imbalanced datasets

**How**:
$$F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

**Why Harmonic Mean?**: Punishes extreme values. If precision=1.0 but recall=0.1, F1=0.18 (not 0.55 like arithmetic mean).

**Lines 32-34: Micro-averaged metrics**

**What**: Micro-averaging computes metrics globally by counting total TP, FP, FN.

**Why**: Gives more weight to labels with more samples.

**How**:
1. Sum all true positives across all labels
2. Sum all false positives across all labels
3. Compute single precision/recall/F1 from totals

**Macro vs Micro**:

**Macro (Lines 28-30)**:
- Compute metric for each label
- Average the metrics
- Each label weighted equally

**Micro (Lines 32-34)**:
- Pool all predictions together
- Compute metric on the pool
- Labels with more instances have more influence

**Example**:

```
Label 0: 100 samples, Precision=0.9
Label 1: 10 samples,  Precision=0.5

Macro Precision = (0.9 + 0.5) / 2 = 0.7
Micro Precision = (90 + 5) / (100 + 10) = 0.86
```

Micro gives more weight to Label 0 because it has more samples.

**Lines 36-44: Return dictionary**

**What**: Returns comprehensive results dictionary.

**Why**: Provides multiple evaluation perspectives and raw data.

**How**: Dictionary with 9 keys covering all metrics plus raw predictions/labels.

**Line 46: `evaluator = Evaluator(model, val_loader, device)`**

**What**: Creates evaluator instance.

**Why**: Encapsulates evaluation logic.

**Line 47: `results = evaluator.evaluate()`**

**What**: Runs complete evaluation and stores results.

**Why**: Generates all metrics in one call.

**Lines 49-62: Print results**

**What**: Displays formatted evaluation metrics.

**Why**: Human-readable summary of model performance.

**How**:
- `"=" * 50` creates separator line
- `.4f` formats to 4 decimal places
- Groups metrics logically (exact match, macro, micro)

**Expected Output**:
```
==================================================
EVALUATION RESULTS
==================================================
Exact Match Accuracy: 0.5000

Macro Metrics (average across labels):
  Precision: 0.7000
  Recall:    0.6500
  F1 Score:  0.6700

Micro Metrics (aggregate all labels):
  Precision: 0.7500
  Recall:    0.6800
  F1 Score:  0.7100
==================================================
```

---

### Teaching Points

**1. Precision vs Recall Trade-off**:
- **High Precision, Low Recall**: Conservative predictions (few false positives, many misses)
- **High Recall, Low Precision**: Aggressive predictions (few misses, many false alarms)
- **F1 Score**: Balances both

**2. When to Use Each Metric**:
- **Accuracy**: Balanced classes, all labels equally important
- **Precision**: Cost of false positives is high (spam detection)
- **Recall**: Cost of false negatives is high (disease diagnosis)
- **F1**: Balance precision and recall

**3. Macro vs Micro Guide**:
- **Use Macro** when:
  - All labels should be treated equally
  - Care about rare classes
  - Imbalanced dataset
- **Use Micro** when:
  - Overall performance matters most
  - Larger classes are more important
  - Want single aggregate metric

**4. Multi-label Specifics**:
- Each sample can have 0, 1, or multiple labels
- Metrics computed per-label then aggregated
- More complex than multi-class (only one label)

**5. zero_division Parameter**:
- Handles edge case when no predictions for a label
- Options: 0, 1, or warn
- Common in early training or rare classes

**6. Why Separate Evaluator Class**:
- Reusability for different datasets
- Cleaner than inline evaluation
- Easy to extend with more metrics

## Section 14: Per-Label Performance Analysis

### Code Block Analysis

In [None]:
print("\nPer-Label Performance:")
print("=" * 70)
emotion_names = ['anger', 'fear', 'joy', 'sadness', 'surprise']

for idx, emotion in enumerate(emotion_names):
    precision = precision_score(results['labels'][:, idx], results['predictions'][:, idx], zero_division=0)
    recall = recall_score(results['labels'][:, idx], results['predictions'][:, idx], zero_division=0)
    f1 = f1_score(results['labels'][:, idx], results['predictions'][:, idx], zero_division=0)

    print(f"{emotion.capitalize():10s} | Precision: {precision:.4f} | Recall: {recall:.4f} | F1: {f1:.4f}")

print("=" * 70)

### Line-by-Line Breakdown

**Line 1: `print("\nPer-Label Performance:")`**

**What**: Prints section header.

**Why**: Organizes output for readability.

**How**: `\n` adds newline for spacing.

---

**Line 2: `print("=" * 70)`**

**What**: Prints separator line of 70 equal signs.

**Why**: Visual separation and formatting.

**How**: String multiplication creates "===...===" with 70 characters.

---

**Line 3: `emotion_names = ['anger', 'fear', 'joy', 'sadness', 'surprise']`**

**What**: Defines list of emotion labels matching column order.

**Why**:
- Maps column indices (0-4) to human-readable names
- Must match order used during dataset creation
- Enables meaningful output instead of "Label 0, Label 1, etc."

**How**: Simple Python list with 5 strings.

**CRITICAL**: Order must match the original label columns in the dataset.

---

**Line 5: `for idx, emotion in enumerate(emotion_names):`**

**What**: Iterates through emotions with their indices.

**Why**: Need index to extract correct column from predictions/labels array.

**How**:
- `enumerate()` returns tuples (index, value)
- `idx`: 0, 1, 2, 3, 4
- `emotion`: 'anger', 'fear', 'joy', 'sadness', 'surprise'

**Example iterations**:
```
idx=0, emotion='anger'
idx=1, emotion='fear'
...
```

---

**Line 6: `precision = precision_score(results['labels'][:, idx], results['predictions'][:, idx], zero_division=0)`**

**What**: Computes precision for a single emotion label.

**Why**: Per-label metrics reveal which emotions the model handles well vs poorly.

**How**:
- `results['labels'][:, idx]` extracts column `idx` from labels array
- `:` selects all rows (all samples)
- `, idx` selects specific column (specific emotion)
- Shape: (num_samples,) - 1D array for one emotion

**Array Slicing Example**:
```
results['labels'] shape: (2, 5)
results['labels'][:, 0] extracts column 0 (anger):
  [1, 0] (2 samples' anger labels)
```

**Precision Interpretation**:
- 1.0: All predicted positives were correct
- 0.5: Half of predicted positives were wrong
- 0.0: No correct positive predictions

---

**Line 7: `recall = recall_score(results['labels'][:, idx], results['predictions'][:, idx], zero_division=0)`**

**What**: Computes recall for single emotion.

**Why**: Measures what fraction of actual positives were found.

**How**: Same slicing as precision.

**Recall Interpretation**:
- 1.0: Found all actual positives
- 0.5: Missed half of actual positives
- 0.0: Missed all actual positives

---

**Line 8: `f1 = f1_score(results['labels'][:, idx], results['predictions'][:, idx], zero_division=0)`**

**What**: Computes F1 score for single emotion.

**Why**: Harmonic mean balancing precision and recall.

---

**Line 10: Print formatted results**

**What**: Displays metrics for current emotion.

**Why**: Human-readable per-label performance.

**How**:
- `{emotion.capitalize():10s}`: Capitalizes emotion name, pads to 10 characters
- `{precision:.4f}`: Formats precision to 4 decimals
- `|`: Visual separator

**String Formatting Breakdown**:
- `:10s` means "string padded to 10 characters"
- `:.4f` means "float with 4 decimal places"
- `Anger     | Precision: 0.8500 | Recall: 0.7500 | F1: 0.8000`

**Example Output**:
```
==================================================
Per-Label Performance:
==================================================
Anger      | Precision: 1.0000 | Recall: 1.0000 | F1: 1.0000
Fear       | Precision: 0.5000 | Recall: 1.0000 | F1: 0.6667
Joy        | Precision: 1.0000 | Recall: 1.0000 | F1: 1.0000
Sadness    | Precision: 0.0000 | Recall: 0.0000 | F1: 0.0000
Surprise   | Precision: 1.0000 | Recall: 1.0000 | F1: 1.0000
==================================================
```

---

### Teaching Points

**1. Why Per-Label Analysis Matters**:
- **Aggregated metrics** hide label-specific issues
- One label might be easy (Joy: F1=1.0)
- Another might be hard (Sadness: F1=0.0)
- Reveals which emotions need more data or features

**2. Interpreting Per-Label Results**:

**High Precision, Low Recall**:
- Model is conservative for this label
- Rarely predicts it, but accurate when it does
- Fix: Lower threshold or add more positive training examples

**Low Precision, High Recall**:
- Model is aggressive for this label
- Predicts it often, but many false alarms
- Fix: Raise threshold or improve feature learning

**Both Low**:
- Model doesn't understand this label
- Needs more training data or better features
- May be ambiguous label

**3. Array Slicing Reminder**:
```python
array = [[1,0,1,0,1],   # Sample 0
         [0,1,0,1,0]]   # Sample 1

array[:, 0]  # All rows, column 0: [1, 0] (anger for both samples)
array[0, :]  # Row 0, all columns: [1,0,1,0,1] (all labels for sample 0)
array[1, 2]  # Row 1, column 2: 0 (joy for sample 1)
```

**4. Column Order Importance**:
If emotion_names doesn't match dataset creation order, results will be mislabeled:
```python
# Dataset creation: ['anger', 'fear', 'joy', 'sadness', 'surprise']
# Wrong order: ['joy', 'anger', 'fear', 'surprise', 'sadness']
# Result: Metrics attributed to wrong emotions!
```

**5. zero_division Parameter**:
- If no samples have ground truth=1 for a label, recall is undefined
- If no samples have prediction=1 for a label, precision is undefined
- `zero_division=0` treats undefined as 0.0
- Alternative: `zero_division=1` treats as 1.0

## Section 15: Inference on New Text

### Code Block Analysis

In [None]:
def predict_emotions(text: str, model: nn.Module, tokenizer, device: torch.device,
                     threshold: float = 0.5) -> dict:
    model.eval()

    encoding = tokenizer(
        text,
        add_special_tokens=True,
        max_length=MAX_LENGTH,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )

    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)

    with torch.no_grad():
        logits = model(input_ids=input_ids, attention_mask=attention_mask)
        probs = torch.sigmoid(logits).cpu().numpy()[0]

    predictions = {
        'anger': probs[0],
        'fear': probs[1],
        'joy': probs[2],
        'sadness': probs[3],
        'surprise': probs[4]
    }

    detected_emotions = [emotion for emotion, prob in predictions.items() if prob > threshold]

    return {
        'text': text,
        'probabilities': predictions,
        'detected_emotions': detected_emotions
    }

test_texts = [
    "I am so happy today!",
    "This is terrifying and awful.",
    "I can't believe this happened!"
]

print("\n" + "=" * 70)
print("INFERENCE EXAMPLES")
print("=" * 70)

for text in test_texts:
    result = predict_emotions(text, model, tokenizer, device)

    print(f"\nText: {result['text']}")
    print("Probabilities:")
    for emotion, prob in result['probabilities'].items():
        print(f"  {emotion.capitalize():10s}: {prob:.4f}")
    print(f"Detected emotions: {', '.join(result['detected_emotions']) if result['detected_emotions'] else 'None'}")
    print("-" * 70)

### Line-by-Line Breakdown

**Lines 1-2: Function definition**

**What**: Defines function for predicting emotions on arbitrary text.

**Why**:
- Enables real-world usage of trained model
- Demonstrates complete inference pipeline
- Encapsulates prediction logic for reusability

**How**:
- `text`: Input string to classify
- `model`: Trained neural network
- `tokenizer`: BERT tokenizer for encoding
- `device`: GPU or CPU
- `threshold`: Probability cutoff for positive prediction (default 0.5)
- Returns dictionary with text, probabilities, and detected emotions

---

**Line 3: `model.eval()`**

**What**: Sets model to evaluation mode.

**Why**: **CRITICAL** - Disables dropout for consistent predictions.

**How**: Without this, predictions would be random due to active dropout.

---

**Lines 5-11: Tokenization**

**What**: Encodes text into token IDs.

**Why**: Neural network cannot process raw text, needs numeric representation.

**How**: Same tokenization as training data:
- Special tokens ([CLS], [SEP])
- Padding to MAX_LENGTH (128)
- Truncation if too long
- Returns PyTorch tensors

**Example**:
```
Input: "I am so happy today!"
Tokens: [CLS] i am so happy today ! [SEP] [PAD] [PAD] ...
IDs: [101, 1045, 2572, 2061, 3407, 2651, 999, 102, 0, 0, ...]
```

---

**Lines 13-14: Move to device**

**What**: Transfers input tensors to GPU/CPU.

**Why**: Model and inputs must be on same device.

**How**: `.to(device)` handles transfer.

---

**Line 16: `with torch.no_grad():`**

**What**: Disables gradient computation.

**Why**: Inference doesn't need gradients, saves memory and time.

---

**Line 17: `logits = model(input_ids=input_ids, attention_mask=attention_mask)`**

**What**: Forward pass through model.

**Why**: Get raw predictions.

**How**:
- Processes text through BERT
- Returns logits of shape (1, 5) - one sample, 5 emotions
- No gradient tracking

---

**Line 18: `probs = torch.sigmoid(logits).cpu().numpy()[0]`**

**What**: Converts logits to probabilities and extracts values.

**Why**: Probabilities (0-1) are interpretable, logits are not.

**How**:
- `torch.sigmoid(logits)`: Applies sigmoid activation
  - Maps logits to (0, 1) range
  - Each value is independent probability
- `.cpu()`: Moves to CPU (required for NumPy)
- `.numpy()`: Converts to NumPy array
- `[0]`: Extracts first (only) sample
- Result: 1D array of 5 probabilities

**Example**:
```
Logits:  [2.3, -1.1, 0.8, -0.3, 1.5]
Sigmoid: [0.91, 0.25, 0.69, 0.43, 0.82]
```

---

**Lines 20-26: Create probabilities dictionary**

**What**: Maps emotion names to their probabilities.

**Why**: Human-readable output instead of unnamed array.

**How**:
- `probs[0]` is probability for anger
- `probs[1]` is probability for fear
- etc.

**Result**:
```python
{
    'anger': 0.9091,
    'fear': 0.2497,
    'joy': 0.6900,
    'sadness': 0.4256,
    'surprise': 0.8176
}
```

---

**Line 28: `detected_emotions = [emotion for emotion, prob in predictions.items() if prob > threshold]`**

**What**: Filters emotions above threshold into list.

**Why**: Binary decision - which emotions are present?

**How**:
- List comprehension iterates through predictions
- Keeps only emotions with probability > 0.5
- Returns list of emotion names

**Example**:
```
Probabilities: {'anger': 0.91, 'fear': 0.25, 'joy': 0.69, 'sadness': 0.43, 'surprise': 0.82}
Threshold: 0.5
Detected: ['anger', 'joy', 'surprise']
```

**Threshold Impact**:
- **0.5**: Balanced (default)
- **0.7**: Conservative (fewer false positives)
- **0.3**: Aggressive (fewer false negatives)

---

**Lines 30-33: Return dictionary**

**What**: Returns comprehensive prediction results.

**Why**: Provides both probabilities (soft) and binary decisions (hard).

**How**: Dictionary with three keys:
- `text`: Original input
- `probabilities`: All 5 probabilities
- `detected_emotions`: List of predicted emotions

---

**Lines 35-38: Test texts**

**What**: Defines sample inputs for demonstration.

**Why**: Show model predictions on diverse examples.

**How**: List of three strings covering different emotions.

---

**Lines 40-42: Print header**

**What**: Formats output section.

---

**Line 44: `for text in test_texts:`**

**What**: Iterates through test examples.

---

**Line 45: `result = predict_emotions(text, model, tokenizer, device)`**

**What**: Gets predictions for current text.

**Why**: Demonstrates function usage.

**How**: Calls predict_emotions, stores result dictionary.

---

**Lines 47-53: Print results**

**What**: Displays formatted predictions.

**Why**: Human-readable output showing probabilities and decisions.

**How**:
- Line 47: Shows input text
- Lines 49-51: Loops through probabilities, formats to 4 decimals
- Line 52: Shows detected emotions or "None"
- Line 53: Separator line

**Example Output**:
```
==================================================
INFERENCE EXAMPLES
==================================================

Text: I am so happy today!
Probabilities:
  Anger     : 0.0234
  Fear      : 0.0189
  Joy       : 0.9876
  Sadness   : 0.0123
  Surprise  : 0.3456
Detected emotions: joy
--------------------------------------------------

Text: This is terrifying and awful.
Probabilities:
  Anger     : 0.4523
  Fear      : 0.8901
  Joy       : 0.0234
  Sadness   : 0.6789
  Surprise  : 0.1234
Detected emotions: fear, sadness
--------------------------------------------------

Text: I can't believe this happened!
Probabilities:
  Anger     : 0.2345
  Fear      : 0.3456
  Joy       : 0.1234
  Sadness   : 0.2345
  Surprise  : 0.8901
Detected emotions: surprise
--------------------------------------------------
```

---

### Teaching Points

**1. Inference Pipeline**:
```
Raw Text
  ↓ (Tokenization)
Token IDs + Attention Mask
  ↓ (Move to device)
Tensors on GPU/CPU
  ↓ (Model forward pass)
Logits
  ↓ (Sigmoid)
Probabilities
  ↓ (Thresholding)
Binary Predictions
```

**2. Threshold Selection**:
- **Domain-dependent**: Medical diagnosis (low threshold), spam filter (high threshold)
- **Trade-off**: Precision vs Recall
- **Tuning**: Use validation set to find optimal threshold per label
- **Multiple thresholds**: Can use different thresholds for each emotion

**3. Batch vs Single Inference**:
This function processes one text at a time. For many texts:
```python
# Inefficient (one at a time)
for text in texts:
    predict_emotions(text, ...)

# Efficient (batch)
encodings = tokenizer(texts, ...)
logits = model(**encodings)
```

**4. model.eval() Importance**:
```python
# With dropout active (WRONG for inference)
model.train()
pred1 = model(input)  # Random due to dropout
pred2 = model(input)  # Different result!

# With dropout disabled (CORRECT)
model.eval()
pred1 = model(input)  # Deterministic
pred2 = model(input)  # Same result
```

**5. Multi-label Interpretation**:
- **Single label**: "This text is joy" (mutually exclusive)
- **Multi-label**: "This text has joy AND surprise" (not exclusive)
- Our model supports multiple emotions per text

**6. Probability Calibration**:
- Probabilities may not be well-calibrated
- 0.8 doesn't necessarily mean "80% confident"
- For calibrated probabilities, use temperature scaling or Platt scaling

## Section 16: Summary and Key Takeaways

### Complete Pipeline Overview

This notebook demonstrated a complete PyTorch pipeline for multi-label emotion classification using transformer models. Here is the comprehensive workflow:

---

### Architecture Flow

```
Raw Text Input
    ↓
Tokenization (BERT Tokenizer)
    ↓
Token IDs + Attention Mask (Tensors)
    ↓
Embedding Layer (BERT)
    ↓
12 Transformer Layers (Self-Attention + FFN)
    ↓
[CLS] Token Representation (768-dim)
    ↓
Dropout Layer (Regularization)
    ↓
Linear Classification Head (768 → 5)
    ↓
Logits (Raw Scores)
    ↓
Sigmoid Activation
    ↓
Probabilities (0-1 per emotion)
    ↓
Thresholding (> 0.5)
    ↓
Binary Predictions
```

---

### Core Components Covered

**1. Data Preparation**
- Synthesized 10-row emotion dataset with 5 binary labels
- Train-test split (80-20) → 8 training, 2 validation samples
- Custom Dataset class inheriting from torch.utils.data.Dataset
- DataCollator for batch assembly using @dataclass

**2. Tokenization**
- BERT tokenizer (bert-base-uncased)
- Max length 128 tokens
- Padding and truncation
- Special tokens: [CLS], [SEP], [PAD]
- Attention masks to distinguish real tokens from padding

**3. Model Architecture**
- Base: Pretrained BERT-base-uncased (~110M parameters)
- Custom classification head: Linear(768, 5)
- Dropout layer (p=0.3) for regularization
- Transfer learning: Fine-tune entire model

**4. Training Configuration**
- Loss: BCEWithLogitsLoss (multi-label binary cross-entropy)
- Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
- Epochs: 3 (typical for transformer fine-tuning)
- Batch size: 4
- Device: CUDA if available, else CPU

**5. Training Process**
- Trainer class orchestrating training and validation
- Forward pass → Loss computation → Backpropagation → Parameter update
- Per-epoch validation to monitor generalization
- History tracking for loss and accuracy

**6. Evaluation Metrics**
- Exact match accuracy (strict - all labels must match)
- Precision, Recall, F1 (both macro and micro averaging)
- Per-label analysis to identify strong/weak emotions

**7. Inference**
- predict_emotions function for new text
- Returns probabilities and binary decisions
- Threshold-based classification (default 0.5)

---

### Critical Concepts Explained

**Multi-Label vs Multi-Class**
- Multi-class: One label per sample (softmax, argmax)
- Multi-label: Multiple labels per sample (sigmoid, threshold)
- Our task: Multi-label (text can have anger AND sadness)

**Transfer Learning**
- Start with pretrained BERT (trained on massive text corpora)
- Fine-tune on specific task (emotion classification)
- Much faster and better than training from scratch
- Requires smaller learning rate (2e-5 vs 1e-3)

**Key PyTorch Patterns**
- `model.train()` before training (enables dropout)
- `model.eval()` before evaluation (disables dropout)
- `optimizer.zero_grad()` before each backward pass
- `torch.no_grad()` during inference (no gradient computation)
- `.to(device)` to move tensors/models to GPU/CPU

**Gradient Descent Mechanics**
1. Forward pass: Compute predictions
2. Loss computation: Quantify error
3. Backward pass: Compute gradients via chain rule
4. Optimizer step: Update parameters to reduce loss

**Macro vs Micro Metrics**
- Macro: Average per-label metrics (treats all labels equally)
- Micro: Aggregate all predictions (larger classes have more weight)
- Use macro for balanced importance, micro for overall performance

---

### Common Pitfalls and Solutions

**1. Forgetting model.eval()**
- **Problem**: Dropout active during inference → random predictions
- **Solution**: Always call `model.eval()` before inference

**2. Not using torch.no_grad() during inference**
- **Problem**: Memory leak from computation graphs
- **Solution**: Wrap inference in `with torch.no_grad():`

**3. Not calling optimizer.zero_grad()**
- **Problem**: Gradients accumulate across batches → wrong updates
- **Solution**: Call `optimizer.zero_grad()` before each backward pass

**4. Mismatch between device placement**
- **Problem**: Model on GPU but data on CPU → runtime error
- **Solution**: Use `.to(device)` consistently for model and data

**5. Using .item() incorrectly**
- **Problem**: Keeping loss tensors in lists → memory leak
- **Solution**: Use `loss.item()` to extract scalar value

**6. Wrong averaging for multi-label**
- **Problem**: Using softmax instead of sigmoid
- **Solution**: Use BCEWithLogitsLoss and sigmoid for multi-label

**7. Not detaching predictions before accumulation**
- **Problem**: Keeping computation graphs in memory
- **Solution**: Use `.cpu().numpy()` or `.detach()`

---

### Extending This Pipeline

**1. Handling Larger Datasets**
- Use DataLoader with num_workers > 0 for parallel loading
- Implement data augmentation (synonym replacement, back-translation)
- Use gradient accumulation for larger effective batch size

**2. Improving Performance**
- Try different pretrained models (RoBERTa, DistilBERT, ELECTRA)
- Tune hyperparameters (learning rate, dropout, epochs)
- Use learning rate scheduling (linear warmup, cosine decay)
- Implement early stopping to prevent overfitting
- Use class weights for imbalanced labels

**3. Production Deployment**
- Save model: `torch.save(model.state_dict(), 'model.pt')`
- Load model: `model.load_state_dict(torch.load('model.pt'))`
- Use ONNX for cross-framework deployment
- Quantize model for faster inference (FP16 or INT8)
- Batch inference for throughput optimization

**4. Advanced Techniques**
- Label smoothing for better calibration
- Focal loss for hard examples
- Multi-task learning (predict sentiment + emotions)
- Active learning to select informative samples
- Model distillation to create smaller models

---

### Objectives Achieved

**Conceptual Understanding**
- How transformers process text
- Transfer learning and fine-tuning
- Multi-label classification mechanics
- Gradient descent and backpropagation
- Evaluation metrics interpretation

**Practical Skills**
- Building custom Dataset classes
- Creating data collators
- Defining neural network architectures
- Implementing training loops
- Computing and interpreting metrics
- Making predictions on new data

**Best Practices**
- Proper train/eval mode switching
- Memory-efficient inference
- Gradient management
- Device handling
- Code organization with classes

**PyTorch Proficiency**
- nn.Module inheritance
- Autograd system
- DataLoader usage
- Optimizer configuration
- Tensor operations

---

### Final Notes

**Dataset Size**: This notebook uses only 10 samples for demonstration. Real-world applications require thousands to millions of samples for robust performance.

**Computation Time**: With small data, training is very fast (seconds). Real projects may take hours or days.

**Overfitting Risk**: With 110M parameters and 8 training samples, severe overfitting is expected. This is purely educational; production models need much more data.

**Next Steps for Learners**:
1. Experiment with different hyperparameters
2. Try other pretrained models
3. Implement additional metrics (ROC-AUC, PR curves)
4. Add visualization (confusion matrix, learning curves)
5. Save and load trained models
6. Deploy model as REST API or web service

**Key Resources**:
- PyTorch documentation: https://pytorch.org/docs/
- Hugging Face Transformers: https://huggingface.co/docs/transformers/
- BERT paper: "Attention is All You Need" and "BERT: Pre-training of Deep Bidirectional Transformers"

---

### Congratulations!

You have completed a comprehensive deep learning pipeline covering:
- Data preparation and loading
- Tokenization and encoding
- Model architecture and transfer learning
- Training with backpropagation
- Evaluation with multiple metrics
- Inference on new examples

This foundation enables you to tackle diverse NLP tasks including sentiment analysis, named entity recognition, question answering, and text generation.

**Remember**: Deep learning is iterative. Experiment, analyze results, and refine. The best models come from understanding both theory and practice through hands-on experience.