# 🧠 Notebook 05: Simple Neural Network with Embedding

## Building Your First NLP Model

This notebook teaches you how to build a neural network specifically designed for text classification. You'll create an embedding layer that learns dense representations of words, then use mean pooling to aggregate sequences into fixed-size vectors for classification.


## 🧠 Concept Primer: Neural Network Architecture

### What We're Building
A neural network that transforms sequences of word IDs into aspect predictions through learned embeddings and mean pooling.

### Why This Architecture Works
**Embeddings learn semantic relationships.** Instead of treating words as discrete symbols, embeddings represent them as dense vectors where similar words have similar representations.

### Architecture Flow
1. **Input**: Word IDs `[batch, seq_len]` → `[16, 128]`
2. **Embedding**: Lookup table → `[batch, seq_len, embed_dim]` → `[16, 128, 50]`
3. **Masking**: Ignore padding tokens during pooling
4. **Mean Pool**: Aggregate sequence → `[batch, embed_dim]` → `[16, 50]`
5. **Linear Layers**: Classification → `[batch, n_classes]` → `[16, n_aspects]`

### Math Mapping
- **Embedding**: `nn.Embedding(vocab_size, embed_dim)` → lookup table
- **Masking**: `(x != pad_id).unsqueeze(-1).float()` → `[batch, seq, 1]`
- **Mean Pool**: `(embedded * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)`
- **Classification**: `Linear(embed_dim, hidden) → ReLU → Linear(hidden, n_classes)`

### Common Pitfalls
- **Forgetting `.clamp(min=1)`** causes division by zero for empty sequences
- **Wrong padding_idx** in embedding layer breaks gradient flow
- **Incorrect tensor shapes** cause runtime errors


## 🔧 TODO #1: Define Model Class

**Task:** Create the neural network class with embedding layer and linear layers.

**Hint:** Use `class SimpleNNWithEmbedding(nn.Module):` with `__init__(self, vocab_size, embed_size=50, hidden_size=100, output_size, pad_id=1)`

**Expected Class Structure:**
```python
class SimpleNNWithEmbedding(nn.Module):
    def __init__(self, vocab_size, embed_size=50, hidden_size=100, output_size, pad_id=1):
        super().__init__()
        # TODO: Define layers here
        
    def forward(self, x):
        # TODO: Implement forward pass
        return logits
```


In [None]:
# TODO #1: Define model class
import torch
import torch.nn as nn

# Your code here


## 🔧 TODO #2: Implement Forward Pass

**Task:** Complete the forward method with embedding lookup, masking, mean pooling, and classification.

**Shape Hints:**
- `embedded = self.embedding(x)` → `[batch, seq, embed]`
- `mask = (x != self.pad_id).unsqueeze(-1).float()` → `[batch, seq, 1]`
- `masked_sum = (embedded * mask).sum(dim=1)` → `[batch, embed]`
- `count = mask.sum(dim=1).clamp(min=1)`
- `pooled = masked_sum / count` → `[batch, embed]`
- Pass through Linear1+ReLU, then Linear2 → `[batch, n_aspects]`

**Critical:** Don't forget `.clamp(min=1)` to prevent division by zero!


In [None]:
# TODO #2: Implement forward pass
# Add your forward method implementation to the class above


## 🔧 TODO #3: Instantiate Model and Training Components

**Task:** Create model instance, loss function, and optimizer.

**Hint:** 
- `model = SimpleNNWithEmbedding(vocab_size=len(vocab), output_size=n_aspects)`
- `criterion = nn.CrossEntropyLoss()`
- `optimizer = torch.optim.Adam(model.parameters(), lr=0.005)`

**Expected Variables:**
- `model` → Instantiated neural network
- `criterion` → Cross-entropy loss function
- `optimizer` → Adam optimizer with learning rate 0.005

**Test:** Try `model(X_batch)` to verify forward pass returns shape `[16, n_aspects]`


In [None]:
# TODO #3: Instantiate model and training components
# Your code here


## 📝 Reflection Prompts

### 🤔 Understanding Check
1. **Why mask padding tokens in mean pooling?** What would happen if you included padding in the average?

2. **What does the embedding layer actually learn?** How does it differ from one-hot encoding?

3. **Why use mean pooling instead of just taking the last token?** What information would you lose?

4. **How does the embedding dimension (50) affect model capacity?** What's the tradeoff?

### 🎯 Architecture Design
- Why is this architecture well-suited for text classification?
- How does mean pooling handle variable-length sequences?
- What would happen if you used max pooling instead of mean pooling?

---

**Write your reflections here:**
