# 🧠 Model Baseline - Building Your First NLP Neural Network

## 📚 Overview

In this notebook, you'll build your **first PyTorch neural network for NLP** from scratch! You'll implement a simple but effective baseline model using:
- **Embedding Layer**: Convert word indices to dense vectors
- **Neural Network Layers**: Learn patterns from text
- **Output Layer**: Binary classification (disaster vs. non-disaster)

## 🎯 Learning Objectives

By completing this notebook, you will:
1. **Understand word embeddings** and how they represent text
2. **Build a PyTorch model** with proper architecture
3. **Implement forward pass** for text classification
4. **Learn about model components**: Embedding, Linear layers, activations
5. **Initialize model parameters** properly
6. **Understand model architecture decisions** and their trade-offs

## 📋 Prerequisites

Before starting, ensure you've completed:
- ✅ `00_exploration.ipynb` - Data exploration
- ✅ `01_preprocessing.ipynb` - Text preprocessing
- ✅ `02_vocab_and_dataloader.ipynb` - Vocabulary and DataLoader

You should have:
- `vocab_dict`: Vocabulary mapping words to indices
- `train_loader` and `val_loader`: DataLoaders ready
- Understanding of your data shapes: `[batch_size, seq_length]`

---

## 🏗️ Model Architecture Overview

We'll build a **simple baseline model** with this structure:

```
Input: [batch_size, seq_length] (word indices)
    ↓
Embedding Layer: [batch_size, seq_length, embedding_dim]
    ↓
Pooling/Aggregation: [batch_size, embedding_dim]
    ↓
Hidden Layer(s): [batch_size, hidden_dim]
    ↓
Output Layer: [batch_size, 1]
    ↓
Sigmoid: [batch_size, 1] (probability)
```

### Key Design Decisions:
- **Embedding Dimension**: How many features per word? (e.g., 100, 200, 300)
- **Pooling Strategy**: How to combine word embeddings? (mean, max, sum)
- **Hidden Layers**: How many? How large?
- **Activation Functions**: ReLU, tanh, or others?
- **Dropout**: Regularization to prevent overfitting

---

## 🚀 Let's Build!

Work through each TODO step-by-step. **Write the code yourself** - no copy-paste!

---


## TODO 1: Setup and Load Dependencies 📦

**Goal**: Import necessary libraries and load your vocabulary and data.

**What you need**:
- PyTorch modules: `torch`, `torch.nn`, `torch.nn.functional`
- Data handling: `pandas`, `numpy`
- Your previous work: vocabulary dictionary, DataLoaders

**Hint**: You'll need to re-run or import the code from `02_vocab_and_dataloader.ipynb` to get your `vocab_dict`, `train_loader`, and create a validation loader.

**Questions to consider**:
- What is the vocabulary size? (You'll need this for the embedding layer)
- What device will you use? (CPU vs. GPU)
- Do you need to set a random seed for reproducibility?

**Expected outcome**: 
- All imports successful
- `vocab_dict` loaded
- `train_loader` ready
- Know the `vocab_size` and `max_seq_length`


In [None]:
# TODO 1: Your code here
# Import libraries


# Load vocabulary and create DataLoaders


# Set device and random seed


# Print key information


## TODO 2: Understand Word Embeddings 🔤➡️🔢

**Goal**: Learn what embeddings are and why we use them.

**Key Concepts**:
1. **One-Hot Encoding Problem**: If vocab_size = 15,000, each word would be a 15,000-dimensional sparse vector ❌
2. **Dense Embeddings Solution**: Each word becomes a dense vector of size `embedding_dim` (e.g., 100) ✅
3. **Learned Representations**: The model learns meaningful word vectors during training

**Example**:
```python
# One-hot encoding (sparse)
word "fire" → [0, 0, 0, ..., 1, ..., 0]  # 15,000 dimensions, one 1

# Embedding (dense)
word "fire" → [0.23, -0.45, 0.67, ...]  # 100 dimensions, all meaningful
```

**Why embeddings?**:
- **Efficiency**: 100 dimensions vs. 15,000 dimensions
- **Generalization**: Similar words get similar vectors
- **Learnable**: Vectors are updated during training

**Task**: Create a simple example to understand `nn.Embedding`:
- Create a small embedding layer (vocab_size=10, embedding_dim=5)
- Pass in some word indices
- Observe the output shape and values

**Hint**: `nn.Embedding(num_embeddings, embedding_dim)` creates a lookup table


In [None]:
# TODO 2: Your code here
# Create a simple embedding example to understand how it works

# Create embedding layer


# Create sample input (word indices)


# Pass through embedding


# Print shapes and observe


## TODO 3: Design Your Model Architecture 🏗️

**Goal**: Create a PyTorch model class for text classification.

**Model Components**:
1. **Embedding Layer**: `nn.Embedding(vocab_size, embedding_dim, padding_idx=0)`
2. **Pooling/Aggregation**: How to convert [batch, seq_len, embed_dim] → [batch, embed_dim]
   - Mean pooling: Take average of all word vectors
   - Max pooling: Take maximum values
   - Sum pooling: Sum all word vectors
3. **Hidden Layers**: `nn.Linear(input_dim, hidden_dim)` with activation (ReLU)
4. **Output Layer**: `nn.Linear(hidden_dim, 1)` for binary classification
5. **Dropout**: `nn.Dropout(p=0.5)` for regularization

**Architecture Template**:
```python
class DisasterTweetClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, padding_idx=0):
        super().__init__()
        # Define layers here
        
    def forward(self, x):
        # x shape: [batch_size, seq_length]
        # Define forward pass
        # Return: [batch_size, 1]
```

**Design Decisions**:
- `embedding_dim`: Start with 100 or 200
- `hidden_dim`: Try 128 or 256
- Pooling: Mean pooling is a good start (handles variable lengths well)
- Dropout: 0.3-0.5 to prevent overfitting

**Hint**: The forward pass should:
1. Apply embedding
2. Pool across sequence length dimension
3. Pass through hidden layer(s) with activation
4. Apply dropout
5. Output layer for final prediction


In [None]:
# TODO 3: Your code here
# Create your model class

class DisasterTweetClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, padding_idx=0):
        super().__init__()
        # TODO: Define your layers
        
    def forward(self, x):
        # TODO: Implement forward pass
        # x shape: [batch_size, seq_length]
        
        # Step 1: Embedding
        
        # Step 2: Pooling (mean, max, or sum)
        
        # Step 3: Hidden layer(s) with activation
        
        # Step 4: Dropout
        
        # Step 5: Output layer
        
        return output  # shape: [batch_size, 1]


## TODO 4: Instantiate and Inspect Your Model 🔍

**Goal**: Create an instance of your model and understand its structure.

**Tasks**:
1. **Instantiate the model** with your chosen hyperparameters
2. **Move model to device** (CPU or GPU)
3. **Print model architecture** using `print(model)`
4. **Count parameters** to understand model size
5. **Test forward pass** with a dummy batch

**Counting Parameters**:
```python
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
```

**Expected hyperparameters**:
- `vocab_size`: Your actual vocabulary size (from TODO 1)
- `embedding_dim`: 100-300 (start with 100 or 200)
- `hidden_dim`: 128-512 (start with 128 or 256)
- `padding_idx`: 0 (your `<PAD>` token index)

**Test with dummy data**:
- Create a dummy batch: `torch.randint(0, vocab_size, (32, 50))`
- Pass through model
- Check output shape: should be `[32, 1]`

**Questions to answer**:
- How many trainable parameters does your model have?
- What's the shape of the embedding layer weight matrix?
- Does your model work with different batch sizes?


In [None]:
# TODO 4: Your code here
# Instantiate and inspect your model

# Define helper function


# Set hyperparameters


# Instantiate model


# Move to device


# Print model architecture


# Count and print parameters


# Test forward pass with dummy data


## TODO 5: Understand Loss Function and Optimizer ⚙️

**Goal**: Choose and configure loss function and optimizer for binary classification.

### Loss Function: Binary Cross-Entropy

For binary classification, we use **Binary Cross-Entropy (BCE) Loss**:

```python
# Two options:
# 1. BCELoss - requires sigmoid in model output
loss_fn = nn.BCELoss()

# 2. BCEWithLogitsLoss - includes sigmoid (more numerically stable) ✅ RECOMMENDED
loss_fn = nn.BCEWithLogitsLoss()
```

**Why BCEWithLogitsLoss?**
- More numerically stable (combines sigmoid + BCE)
- Less prone to gradient issues
- Standard choice for binary classification

### Optimizer: Adam

**Adam** (Adaptive Moment Estimation) is a great default choice:
```python
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
```

**Learning Rate Guidelines**:
- Start with `lr=0.001` (standard default)
- Too high → unstable training, loss explodes
- Too low → very slow training
- You can adjust later based on training behavior

### Optional: Learning Rate Scheduler

Reduce learning rate as training progresses:
```python
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=2
)
```

**Task**: Set up your loss function and optimizer.


In [None]:
# TODO 5: Your code here
# Set up loss function and optimizer

# Define loss function


# Define optimizer


# Optional: Learning rate scheduler


# Print configuration


## TODO 6: Test Model with Real Data 🧪

**Goal**: Verify your model works with actual data from your DataLoader.

**Tasks**:
1. Get one batch from your `train_loader`
2. Pass batch through model
3. Calculate loss
4. Check all shapes are correct
5. Verify gradients can be computed

**Expected Flow**:
```python
# Get batch
texts, labels = next(iter(train_loader))

# Forward pass
outputs = model(texts)

# Calculate loss
loss = loss_fn(outputs, labels)

# Backward pass (just to test)
loss.backward()
```

**Shapes to verify**:
- `texts`: `[batch_size, seq_length]` (e.g., `[32, 50]`)
- `labels`: `[batch_size]` (e.g., `[32]`)
- `outputs`: `[batch_size, 1]` (e.g., `[32, 1]`)
- `loss`: scalar value

**Important**: If using `BCEWithLogitsLoss`, you need to:
- Ensure labels are float: `labels.float()`
- Reshape output if needed: `outputs.squeeze()` or `labels.unsqueeze(1)`

**What to check**:
- ✅ No shape errors
- ✅ Loss is a reasonable number (not NaN, not infinity)
- ✅ Gradients are computed
- ✅ Model parameters require gradients


In [None]:
# TODO 6: Your code here
# Test model with real data

# Get one batch


# Move to device if needed


# Forward pass


# Calculate loss


# Print shapes and values


# Test backward pass


# Check gradients


## TODO 7: Save Your Model Architecture 💾

**Goal**: Save your model architecture and configuration for the next notebook.

You'll need these for training:
- Model class definition
- Hyperparameters (vocab_size, embedding_dim, hidden_dim)
- Model instance
- Loss function
- Optimizer

**Options for saving**:

### Option 1: Save just the class definition
Create a `src/models/baseline_model.py` file with your model class

### Option 2: Save model state dict
```python
torch.save(model.state_dict(), 'models/baseline_model.pth')
```

### Option 3: Save entire model
```python
torch.save(model, 'models/baseline_model_full.pth')
```

**Recommended approach** for this learning project:
- Copy your model class to `src/models/baseline_model.py`
- Save hyperparameters in a config dictionary
- This way you can import it in the next notebook for training

**Task**: 
1. Create the model file in `src/models/`
2. Save your hyperparameters
3. Test that you can reload everything


In [None]:
# TODO 7: Your code here
# Save model architecture and configuration

# Create config dictionary


# Option 1: Write model class to file (recommended for learning)
# You can manually create src/models/baseline_model.py and copy your class there

# Option 2: Save model state
# Uncomment if you want to save the initialized model


# Test reload
# from src.models.baseline_model import DisasterTweetClassifier
# loaded_model = DisasterTweetClassifier(**config)


---

## 🎉 Congratulations!

You've successfully:
- ✅ Built your first PyTorch NLP model from scratch
- ✅ Understood word embeddings and why they're used
- ✅ Designed a neural network architecture for text classification
- ✅ Set up loss function and optimizer
- ✅ Tested your model with real data
- ✅ Prepared everything for training

## 📊 Model Summary

Review what you've created:
- **Architecture**: Embedding → Pooling → Hidden Layer(s) → Output
- **Parameters**: ~XXX,XXX trainable parameters
- **Input**: Word indices `[batch_size, seq_length]`
- **Output**: Binary predictions `[batch_size, 1]`
- **Loss**: Binary Cross-Entropy
- **Optimizer**: Adam

## 🚀 Next Steps

Your model is ready! Move on to:
- **`04_training_and_eval.ipynb`**: Train your model and evaluate performance

---

## 💡 Key Learnings

**What you learned**:
1. **Embeddings convert discrete words to continuous vectors** - more efficient than one-hot
2. **PyTorch models are classes** inheriting from `nn.Module`
3. **Pooling aggregates variable-length sequences** into fixed-size representations
4. **BCEWithLogitsLoss is the standard** for binary classification
5. **Adam optimizer is a great default** for most deep learning tasks

**Questions to reflect on**:
- Why do we need pooling? (Variable length → Fixed length for FC layers)
- What's the difference between BCE and BCEWithLogitsLoss? (Numerical stability)
- How many parameters does your embedding layer have? (vocab_size × embedding_dim)
- Could you add more hidden layers? (Yes! Deeper networks learn more complex patterns)

---
