# 🧠 Notebook 05: Simple Neural Network with Embedding

## Building Your First NLP Model

This notebook teaches you how to build a neural network specifically designed for text classification. You'll create an embedding layer that learns dense representations of words, then use mean pooling to aggregate sequences into fixed-size vectors for classification.


## 🧠 Concept Primer: Neural Network Architecture

### What We're Building
A neural network that transforms sequences of word IDs into aspect predictions through learned embeddings and mean pooling.

### Why This Architecture Works
**Embeddings learn semantic relationships.** Instead of treating words as discrete symbols, embeddings represent them as dense vectors where similar words have similar representations.

### Architecture Flow
1. **Input**: Word IDs `[batch, seq_len]` → `[16, 128]`
2. **Embedding**: Lookup table → `[batch, seq_len, embed_dim]` → `[16, 128, 50]`
3. **Masking**: Ignore padding tokens during pooling
4. **Mean Pool**: Aggregate sequence → `[batch, embed_dim]` → `[16, 50]`
5. **Linear Layers**: Classification → `[batch, n_classes]` → `[16, n_aspects]`

### Math Mapping
- **Embedding**: `nn.Embedding(vocab_size, embed_dim)` → lookup table
- **Masking**: `(x != pad_id).unsqueeze(-1).float()` → `[batch, seq, 1]`
- **Mean Pool**: `(embedded * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)`
- **Classification**: `Linear(embed_dim, hidden) → ReLU → Linear(hidden, n_classes)`

### Common Pitfalls
- **Forgetting `.clamp(min=1)`** causes division by zero for empty sequences
- **Wrong padding_idx** in embedding layer breaks gradient flow
- **Incorrect tensor shapes** cause runtime errors


## 🔧 TODO #1: Define Model Class

**Task:** Create the neural network class with embedding layer and linear layers.

**Hint:** Use `class SimpleNNWithEmbedding(nn.Module):` with `__init__(self, vocab_size, embed_size=50, hidden_size=100, output_size, pad_id=1)`

**Expected Class Structure:**
```python
class SimpleNNWithEmbedding(nn.Module):
    def __init__(self, vocab_size, embed_size=50, hidden_size=100, output_size, pad_id=1):
        super().__init__()
        # TODO: Define layers here
        
    def forward(self, x):
        # TODO: Implement forward pass
        return logits
```


In [4]:
# TODO #1: Define model class
import torch
import torch.nn as nn

# Your code here
class SimpleNN(nn.Module):
    def __init__(self, vocab_size, output_size,embed_size=50, hidden_size=100, pad_id=0):
        super(SimpleNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=pad_id)
        self.fc1 = nn.Linear(embed_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        x = self.embedding(x)
        embedded = self.embedding(x)  # [16, 128, 50]
        mask = (x != self.pad_id).unsqueeze(-1).float()  # [16, 128, 1]
        masked_embedded = embedded * mask  # [16, 128, 50]
        pooled = masked_embedded.sum(dim=1) / mask.sum(dim=1).clamp(min=1)  # [16, 50]


        x = self.fc1(pooled)
        x = self.relu(x)
        x = self.fc2(x)
        return x


## 🔧 TODO #2: Implement Forward Pass

**Task:** Complete the forward method with embedding lookup, masking, mean pooling, and classification.

**Shape Hints:**
- `embedded = self.embedding(x)` → `[batch, seq, embed]`
- `mask = (x != self.pad_id).unsqueeze(-1).float()` → `[batch, seq, 1]`
- `masked_sum = (embedded * mask).sum(dim=1)` → `[batch, embed]`
- `count = mask.sum(dim=1).clamp(min=1)`
- `pooled = masked_sum / count` → `[batch, embed]`
- Pass through Linear1+ReLU, then Linear2 → `[batch, n_aspects]`

**Critical:** Don't forget `.clamp(min=1)` to prevent division by zero!


In [None]:
# TODO #2: Implement forward pass
# Add your forward method implementation to the class above
import torch
import torch.nn as nn

# Your code here
class SimpleNN(nn.Module):
    def __init__(self, vocab_size, output_size,embed_size=50, hidden_size=100, pad_id=0):
        super(SimpleNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=pad_id)
        self.fc1 = nn.Linear(embed_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        x = self.embedding(x)
        embedded = self.embedding(x)  # [16, 128, 50]
        mask = (x != self.pad_id).unsqueeze(-1).float()  # [16, 128, 1]
        masked_embedded = embedded * mask  # [16, 128, 50]
        pooled = masked_embedded.sum(dim=1) / mask.sum(dim=1).clamp(min=1)  # [16, 50]


        x = self.fc1(pooled)
        x = self.relu(x)
        x = self.fc2(x)
        return x


## 🔧 TODO #3: Instantiate Model and Training Components

**Task:** Create model instance, loss function, and optimizer.

**Hint:** 
- `model = SimpleNNWithEmbedding(vocab_size=len(vocab), output_size=n_aspects)`
- `criterion = nn.CrossEntropyLoss()`
- `optimizer = torch.optim.Adam(model.parameters(), lr=0.005)`

**Expected Variables:**
- `model` → Instantiated neural network
- `criterion` → Cross-entropy loss function
- `optimizer` → Adam optimizer with learning rate 0.005

**Test:** Try `model(X_batch)` to verify forward pass returns shape `[16, n_aspects]`


In [5]:
# Comes from the previous notebooks
import pandas as pd
import re
from collections import Counter
from torch.utils.data import DataLoader, TensorDataset

def pad_or_truncate(sequence, max_len, padding_value=0):
    """
    Pads or truncates a sequence to a specified target length.

    Parameters:
    sequence (list): The input sequence to be padded or truncated.
    target_length (int): The desired length of the output sequence.
    padding_value (any): The value to use for padding if the sequence is shorter than the target length.

    Returns:
    list: The padded or truncated sequence.
    """
    if len(sequence) < max_len:
        # Pad the sequence
        return sequence + [padding_value] * (max_len - len(sequence))
    else:
        # Truncate the sequence
        return sequence[:max_len]

# Test the function
sample_sequence = [1, 2, 3]
padded_sequence = pad_or_truncate(sample_sequence, 5, padding_value=0)
truncated_sequence = pad_or_truncate(sample_sequence, 2)

def tokenize(text):
    # Use regex to find words, ignoring punctuation
    tokens = re.findall(r'\b\w+\b', text.lower())
    return tokens

train_reviews_df = pd.read_csv('../data/imdb_movie_reviews_train.csv')
test_reviews_df = pd.read_csv('../data/imdb_movie_reviews_test.csv')

tokenized_corpus_train = train_reviews_df['review'].apply(tokenize).tolist() 
tokenized_corpus_test = test_reviews_df['review'].apply(tokenize).tolist()

combined_corpus = [token for sublist in tokenized_corpus_train + tokenized_corpus_test for token in sublist]

word_freqs = Counter(combined_corpus)
print(word_freqs.most_common(10))

max_vocab_size = 1002
most_common_words = word_freqs.most_common(max_vocab_size - 2)  # Reserve 2 for <PAD> and <UNK>
vocab = {'<PAD>': 0, '<UNK>': 1, **{word: idx + 2 for idx, (word, _) in enumerate(most_common_words)}}
print(list(vocab.items())[:10])  # Print first 10 items in vocabulary dictionary
vocab_size = len(vocab)

def encode_text(text, vocab):
    tokens = tokenize(text)
    encoded = [vocab.get(token, vocab['<UNK>']) for token in tokens]
    return encoded

encoded_reviews_train = train_reviews_df['review'].apply(lambda x: encode_text(x, vocab)).tolist()
encoded_reviews_test = test_reviews_df['review'].apply(lambda x: encode_text(x, vocab)).tolist()

max_sequence_length = 128
padded_encoded_reviews_train = [pad_or_truncate(seq, max_sequence_length, padding_value=vocab['<PAD>']) for seq in encoded_reviews_train]
padded_encoded_reviews_test = [pad_or_truncate(seq, max_sequence_length, padding_value=vocab['<PAD>']) for seq in encoded_reviews_test]
X_tensor_train = torch.tensor(padded_encoded_reviews_train, dtype=torch.long)
X_tensor_test = torch.tensor(padded_encoded_reviews_test, dtype=torch.long)
y_tensor_train = torch.tensor(train_reviews_df['aspect_encoded'].values, dtype=torch.long)
y_tensor_test = torch.tensor(test_reviews_df['aspect_encoded'].values, dtype=torch.long)

batch_size = 16
train_dataset = TensorDataset(X_tensor_train, y_tensor_train)
test_dataset = TensorDataset(X_tensor_test, y_tensor_test)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

[('the', 1009), ('a', 423), ('of', 405), ('and', 399), ('to', 295), ('is', 295), ('in', 235), ('it', 191), ('s', 147), ('that', 140)]
[('<PAD>', 0), ('<UNK>', 1), ('the', 2), ('a', 3), ('of', 4), ('and', 5), ('to', 6), ('is', 7), ('in', 8), ('it', 9)]


In [7]:
# TODO #3: Instantiate model and training components
# Your code here
n_aspects = 3

model = SimpleNN(vocab_size=len(vocab), output_size=n_aspects)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

## 📝 Reflection Prompts

### 🤔 Understanding Check
1. **Why mask padding tokens in mean pooling?** What would happen if you included padding in the average?

2. **What does the embedding layer actually learn?** How does it differ from one-hot encoding?

3. **Why use mean pooling instead of just taking the last token?** What information would you lose?

4. **How does the embedding dimension (50) affect model capacity?** What's the tradeoff?

### 🎯 Architecture Design
- Why is this architecture well-suited for text classification?
- How does mean pooling handle variable-length sequences?
- What would happen if you used max pooling instead of mean pooling?

## 📝 Reflection Prompts

### 🤔 Understanding Check
1. **Why mask padding tokens in mean pooling?** What would happen if you included padding in the average?

2. **What does the embedding layer actually learn?** How does it differ from one-hot encoding?

3. **Why use mean pooling instead of just taking the last token?** What information would you lose?

4. **How does the embedding dimension (50) affect model capacity?** What's the tradeoff?

### 🎯 Architecture Design
- Why is this architecture well-suited for text classification?
- How does mean pooling handle variable-length sequences?
- What would happen if you used max pooling instead of mean pooling?

---

## 📝 My Reflections

### 🤔 Understanding Check Answers

1. **Why mask padding tokens in mean pooling?**
   - Masking prevents padding tokens from contributing to the average
   - Without masking, we'd average real tokens + padding tokens, diluting the semantic meaning
   - Padding tokens (0s) should be excluded from the mean calculation to preserve the true representation

2. **What does the embedding layer actually learn?**
   - The embedding layer learns the ideal spatial positions that words should occupy based on their semantic meaning and proximity
   - Words with similar meanings get closer spatial positions in the 50-dimensional space
   - Meaning becomes a mathematical formula where spatial differences translate into semantic relationships
   - The 50 dimensions represent 50 potential spatial positions per token that can be related to other tokens

3. **Why use mean pooling instead of just taking the last token?**
   - **Mean pooling**: Preserves information from all words, creating a comprehensive summary of the entire review
   - **Last token only**: Would lose information from all other words in the sequence
   - Mean pooling handles variable length sequences by averaging across all tokens
   - Taking only the last token would create huge discrepancies and lose semantic richness

4. **How does the embedding dimension (50) affect model capacity?**
   - More embeddings don't necessarily mean more accuracy - it's a computational trade-off
   - 50 dimensions provide a good balance between representation capacity and computational efficiency
   - Higher dimensions require more computational power for gradient calculations during training
   - The key is finding the right balance for the specific task

### 🎯 Architecture Design Analysis

**Why this architecture is well-suited for text classification:**
- Manages embeddings, handles shape differences between layers, and enables multi-class prediction
- The flow: Word IDs → Embeddings → Masking → Pooling → Classification
- Each step serves a specific purpose in converting text to predictions

**How mean pooling handles variable-length sequences:**
- Mean pooling aggregates the entire sequence into a fixed-size vector
- Handles variable length sequences by averaging across all tokens
- Preserves semantic information from all words rather than just the strongest
- Creates a "summary" representation of the entire review

**What would happen with max pooling instead:**
- Max pooling would only consider the strongest/most important word
- Would lose information from other words in the sequence
- Could create huge discrepancies and lose the semantic richness of the text
- Would not provide a comprehensive representation of the entire review

**Key insight:** This architecture elegantly handles the fundamental challenge of converting variable-length text sequences into fixed-size representations that neural networks can process, while preserving semantic meaning through learned embeddings and intelligent pooling strategies.
