# Topic 3: Building Neural Networks with nn.Module

## Learning Objectives

By the end of this notebook, you will:
- Understand WHY `nn.Module` is the foundation of PyTorch models
- Build neural networks using `nn.Module` and pre-built layers
- Understand the difference between layers, models, and modules
- Use activation functions and understand their purpose
- Inspect model architecture and parameters
- Build custom layers and models
- Understand forward pass vs backward pass

---

## 1. The Big Picture: Why nn.Module?

### From Manual to Automatic

In Topic 2, we implemented linear regression manually:
```python
w = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)
y_pred = w * x + b
```

**Problems with this approach**:
1. Tedious: Manually create every parameter
2. Error-prone: Easy to forget `requires_grad=True`
3. Not modular: Hard to reuse and compose
4. No structure: Can't easily inspect or save models

### The nn.Module Solution

`nn.Module` is PyTorch's **base class for all neural network components**. It provides:

1. **Automatic parameter management**: Tracks all parameters automatically
2. **Modular design**: Build complex models from simple building blocks
3. **State management**: Easy to save/load models
4. **GPU support**: Move entire model to GPU with one line
5. **Training/eval modes**: Switch behavior (dropout, batchnorm, etc.)

### The Hierarchy

```
nn.Module (base class)
    ├── Layers (nn.Linear, nn.Conv2d, etc.)
    ├── Activation Functions (nn.ReLU, nn.Sigmoid, etc.)
    ├── Loss Functions (nn.MSELoss, nn.CrossEntropyLoss, etc.)
    └── Your Custom Models (inherit from nn.Module)
```

**Key insight**: Everything is a module! This makes PyTorch incredibly composable.

In [None]:
# Setup
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

print(f"PyTorch version: {torch.__version__}")

# Set random seed
torch.manual_seed(42)
np.random.seed(42)

---

## 2. Your First Module: nn.Linear

### What Is nn.Linear?

`nn.Linear` implements: $y = xW^T + b$

Where:
- $x$: input `(batch_size, in_features)`
- $W$: weight matrix `(out_features, in_features)` 
- $b$: bias vector `(out_features,)`
- $y$: output `(batch_size, out_features)`

**Why "Linear"?** It's a linear transformation (no activation function).

**Why this is used**: Every fully connected layer in a neural network is an `nn.Linear`.

In [None]:
# Create a linear layer
in_features = 10
out_features = 5

linear = nn.Linear(in_features, out_features)

print(f"Linear layer: {linear}")
print()

# Inspect parameters
print("Parameters:")
print(f"Weight shape: {linear.weight.shape}")  # (out_features, in_features)
print(f"Bias shape: {linear.bias.shape}")      # (out_features,)
print()

# All parameters have requires_grad=True automatically!
print(f"Weight requires_grad: {linear.weight.requires_grad}")
print(f"Bias requires_grad: {linear.bias.requires_grad}")
print()

# Use the layer (forward pass)
batch_size = 3
x = torch.randn(batch_size, in_features)
y = linear(x)  # Calls linear.forward(x) internally

print(f"Input shape: {x.shape}")
print(f"Output shape: {y.shape}")
print(f"Output:\n{y}")

### Key Observations

1. **Automatic initialization**: Weights and biases are initialized with good default values
2. **requires_grad=True**: Set automatically for all parameters
3. **Callable**: Use `layer(x)` instead of manually computing `x @ W.T + b`
4. **Shape handling**: Automatically handles batches

In [None]:
# Compare with manual implementation
x = torch.randn(3, 10)
linear = nn.Linear(10, 5)

# Using nn.Linear
y_auto = linear(x)

# Manual implementation
y_manual = x @ linear.weight.T + linear.bias

print("Using nn.Linear:")
print(y_auto)
print()

print("Manual implementation:")
print(y_manual)
print()

print(f"Are they equal? {torch.allclose(y_auto, y_manual)}")
print("\nConclusion: nn.Linear is just a convenient wrapper!")

---

## 3. Activation Functions: Adding Non-Linearity

### Why Activation Functions?

**Problem**: Stacking linear layers without activations is still linear!

$$\text{Linear}(\text{Linear}(x)) = W_2(W_1x + b_1) + b_2 = (W_2W_1)x + (W_2b_1 + b_2) = Wx + b$$

Just another linear transformation! You can't learn complex patterns.

**Solution**: Add **non-linear** activation functions between layers.

### Common Activation Functions

1. **ReLU** (Rectified Linear Unit): $f(x) = \max(0, x)$
   - Most popular in deep learning
   - Fast to compute
   - Helps with vanishing gradient problem

2. **Sigmoid**: $f(x) = \frac{1}{1 + e^{-x}}$
   - Outputs between 0 and 1
   - Used for binary classification output
   - Can cause vanishing gradients

3. **Tanh**: $f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
   - Outputs between -1 and 1
   - Zero-centered (better than sigmoid)
   - Still has vanishing gradient issues

4. **LeakyReLU**: $f(x) = \max(0.01x, x)$
   - Fixes "dying ReLU" problem
   - Allows small negative values

In [None]:
# Visualize activation functions
x = torch.linspace(-5, 5, 200)

# Compute activations
relu = F.relu(x)
sigmoid = torch.sigmoid(x)
tanh = torch.tanh(x)
leaky_relu = F.leaky_relu(x, negative_slope=0.1)

# Plot
plt.figure(figsize=(14, 10))

# ReLU
plt.subplot(2, 2, 1)
plt.plot(x.numpy(), relu.numpy(), 'b-', linewidth=2)
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.title('ReLU: max(0, x)', fontsize=14)
plt.xlabel('x', fontsize=12)
plt.ylabel('ReLU(x)', fontsize=12)
plt.grid(True, alpha=0.3)

# Sigmoid
plt.subplot(2, 2, 2)
plt.plot(x.numpy(), sigmoid.numpy(), 'r-', linewidth=2)
plt.axhline(y=0.5, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.title('Sigmoid: 1/(1 + e^(-x))', fontsize=14)
plt.xlabel('x', fontsize=12)
plt.ylabel('Sigmoid(x)', fontsize=12)
plt.grid(True, alpha=0.3)

# Tanh
plt.subplot(2, 2, 3)
plt.plot(x.numpy(), tanh.numpy(), 'g-', linewidth=2)
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.title('Tanh: (e^x - e^(-x))/(e^x + e^(-x))', fontsize=14)
plt.xlabel('x', fontsize=12)
plt.ylabel('Tanh(x)', fontsize=12)
plt.grid(True, alpha=0.3)

# LeakyReLU
plt.subplot(2, 2, 4)
plt.plot(x.numpy(), leaky_relu.numpy(), 'm-', linewidth=2)
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.title('LeakyReLU: max(0.1x, x)', fontsize=14)
plt.xlabel('x', fontsize=12)
plt.ylabel('LeakyReLU(x)', fontsize=12)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key properties:")
print("ReLU: Simple, effective, most popular")
print("Sigmoid: Outputs [0, 1], good for probabilities")
print("Tanh: Outputs [-1, 1], zero-centered")
print("LeakyReLU: Prevents dying ReLU problem")

### Two Ways to Use Activations

1. **Functional API** (`torch.nn.functional`): Stateless functions
2. **Module API** (`torch.nn`): Stateful modules

**When to use which?**
- Functional: When you just need to apply the function
- Module: When building models (more consistent style)

In [None]:
# Functional API
x = torch.tensor([-1.0, 0.0, 1.0])
y_func = F.relu(x)
print(f"Functional ReLU: {y_func}")

# Module API
relu = nn.ReLU()
y_module = relu(x)
print(f"Module ReLU: {y_module}")

print(f"\nAre they equal? {torch.equal(y_func, y_module)}")
print("\nUse modules when building models for consistency.")

---

## 4. Building Your First Neural Network

### The Recipe

1. **Inherit from `nn.Module`**
2. **Define layers in `__init__`**: Create all your layers
3. **Implement `forward`**: Define how data flows through the network
4. **That's it!** PyTorch handles everything else

### Example: 3-Layer Fully Connected Network

```
Input (10) → Linear(10→64) → ReLU → Linear(64→32) → ReLU → Linear(32→5) → Output (5)
```

In [None]:
class SimpleNet(nn.Module):
    """Simple 3-layer neural network"""
    
    def __init__(self, input_size, hidden_size1, hidden_size2, output_size):
        # MUST call parent constructor
        super().__init__()
        
        # Define layers
        self.fc1 = nn.Linear(input_size, hidden_size1)
        self.fc2 = nn.Linear(hidden_size1, hidden_size2)
        self.fc3 = nn.Linear(hidden_size2, output_size)
        
        # Define activations
        self.relu = nn.ReLU()
    
    def forward(self, x):
        """Forward pass: define computation"""
        # Layer 1
        x = self.fc1(x)      # Linear transformation
        x = self.relu(x)     # Non-linearity
        
        # Layer 2
        x = self.fc2(x)
        x = self.relu(x)
        
        # Layer 3 (output layer - no activation)
        x = self.fc3(x)
        
        return x

# Create model
model = SimpleNet(input_size=10, hidden_size1=64, hidden_size2=32, output_size=5)
print(model)
print()

# Test forward pass
x = torch.randn(3, 10)  # Batch of 3 samples
output = model(x)       # Calls model.forward(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Output:\n{output}")

### What Just Happened?

1. **Defined architecture in `__init__`**: Specified all layers
2. **Defined computation in `forward`**: How data flows
3. **Automatic parameter tracking**: All parameters are tracked automatically
4. **Forward pass**: Calling `model(x)` executes `forward(x)`
5. **Backward pass**: Will happen automatically when we call `loss.backward()`

In [None]:
# Inspect model parameters
print("Model parameters:")
for name, param in model.named_parameters():
    print(f"{name:15s} | shape: {str(param.shape):20s} | requires_grad: {param.requires_grad}")

print()

# Count total parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

---

## 5. Sequential API: The Quick Way

For simple sequential models, `nn.Sequential` is more concise.

In [None]:
# Same network using nn.Sequential
model_seq = nn.Sequential(
    nn.Linear(10, 64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 5)
)

print(model_seq)
print()

# Test it
x = torch.randn(3, 10)
output = model_seq(x)
print(f"Output shape: {output.shape}")

### When to Use Sequential vs Custom Module?

**Use `nn.Sequential` when**:
- Simple linear flow (output of layer N → input of layer N+1)
- No branching or skip connections
- Quick prototyping

**Use custom `nn.Module` when**:
- Complex architectures (ResNet, Transformers)
- Need control flow (if/else, loops)
- Skip connections or multiple inputs/outputs
- Want readable, documented code

In [None]:
# Named Sequential for better readability
from collections import OrderedDict

model_named = nn.Sequential(OrderedDict([
    ('fc1', nn.Linear(10, 64)),
    ('relu1', nn.ReLU()),
    ('fc2', nn.Linear(64, 32)),
    ('relu2', nn.ReLU()),
    ('fc3', nn.Linear(32, 5))
]))

print(model_named)
print()

# Access layers by name
print(f"First layer: {model_named.fc1}")
print(f"First layer weight shape: {model_named.fc1.weight.shape}")

---

## 6. Model Inspection and Utilities

PyTorch provides many utilities to inspect and manipulate models.

In [None]:
# Model summary
model = SimpleNet(10, 64, 32, 5)

print("Model architecture:")
print(model)
print()

# Iterate over modules
print("All modules:")
for i, module in enumerate(model.modules()):
    print(f"{i}: {module.__class__.__name__}")
print()

# Iterate over children (direct children only)
print("Direct children:")
for i, module in enumerate(model.children()):
    print(f"{i}: {module}")

In [None]:
# Get specific parameters
print("First layer weights:")
print(f"Shape: {model.fc1.weight.shape}")
print(f"First 3 rows:\n{model.fc1.weight[:3]}")
print()

# Get all parameters as list
params = list(model.parameters())
print(f"Number of parameter tensors: {len(params)}")
for i, p in enumerate(params):
    print(f"Parameter {i}: shape {p.shape}")

### Training vs Evaluation Mode

Some layers behave differently during training vs evaluation (Dropout, BatchNorm).

In [None]:
# Check mode
print(f"Model training mode: {model.training}")

# Switch to evaluation mode
model.eval()
print(f"After eval(): {model.training}")

# Switch back to training mode
model.train()
print(f"After train(): {model.training}")

print("\nAlways remember:")
print("- model.train() before training")
print("- model.eval() before evaluation/inference")

### Moving Models to GPU

Move entire model (all parameters) to GPU with one line.

In [None]:
# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Move model to device
model = model.to(device)

# Verify parameters moved
print(f"\nFirst parameter device: {next(model.parameters()).device}")

# Now inputs must also be on same device
x = torch.randn(3, 10).to(device)
output = model(x)
print(f"Output device: {output.device}")

print("\nKey point: Model and inputs must be on same device!")

---

## 7. Building Complex Architectures

Let's build more sophisticated models!

### Example 1: Model with Skip Connections (Residual)

Skip connections allow gradients to flow better through deep networks.

In [None]:
class ResidualBlock(nn.Module):
    """Basic residual block: output = F(x) + x"""
    
    def __init__(self, size):
        super().__init__()
        self.fc1 = nn.Linear(size, size)
        self.fc2 = nn.Linear(size, size)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        # Save input for skip connection
        identity = x
        
        # Compute F(x)
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        
        # Add skip connection: F(x) + x
        out = out + identity
        out = self.relu(out)
        
        return out

# Test it
block = ResidualBlock(64)
x = torch.randn(2, 64)
output = block(x)

print(f"ResidualBlock:")
print(block)
print(f"\nInput shape: {x.shape}")
print(f"Output shape: {output.shape}")
print("\nWhy residuals? Help gradients flow in very deep networks!")

### Example 2: Model with Multiple Outputs

In [None]:
class MultiOutputNet(nn.Module):
    """Network with two output heads"""
    
    def __init__(self, input_size, hidden_size, output_size1, output_size2):
        super().__init__()
        
        # Shared layers
        self.shared = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU()
        )
        
        # Output head 1 (e.g., classification)
        self.head1 = nn.Linear(hidden_size, output_size1)
        
        # Output head 2 (e.g., regression)
        self.head2 = nn.Linear(hidden_size, output_size2)
    
    def forward(self, x):
        # Shared feature extraction
        features = self.shared(x)
        
        # Two separate outputs
        output1 = self.head1(features)
        output2 = self.head2(features)
        
        return output1, output2

# Test it
model = MultiOutputNet(input_size=20, hidden_size=64, output_size1=10, output_size2=1)
x = torch.randn(3, 20)
out1, out2 = model(x)

print(f"Input shape: {x.shape}")
print(f"Output 1 shape: {out1.shape}")  # (3, 10) - classification
print(f"Output 2 shape: {out2.shape}")  # (3, 1) - regression
print("\nUse case: Multi-task learning!")

### Example 3: ModuleList and ModuleDict

For dynamic architectures with variable number of layers.

In [None]:
class DynamicNet(nn.Module):
    """Network with variable number of layers"""
    
    def __init__(self, input_size, hidden_sizes, output_size):
        super().__init__()
        
        # Create layers dynamically
        self.layers = nn.ModuleList()
        
        # Input layer
        self.layers.append(nn.Linear(input_size, hidden_sizes[0]))
        
        # Hidden layers
        for i in range(len(hidden_sizes) - 1):
            self.layers.append(nn.Linear(hidden_sizes[i], hidden_sizes[i+1]))
        
        # Output layer
        self.layers.append(nn.Linear(hidden_sizes[-1], output_size))
        
        self.relu = nn.ReLU()
    
    def forward(self, x):
        # Pass through all layers except last
        for layer in self.layers[:-1]:
            x = self.relu(layer(x))
        
        # Output layer (no activation)
        x = self.layers[-1](x)
        return x

# Create network with 5 hidden layers
model = DynamicNet(input_size=10, hidden_sizes=[64, 128, 128, 64, 32], output_size=5)
print(model)
print(f"\nNumber of layers: {len(model.layers)}")

# Test
x = torch.randn(2, 10)
output = model(x)
print(f"Output shape: {output.shape}")

---

## Mini Exercises

### Exercise 1: Build a Simple Classifier

Create a 2-layer neural network:
- Input: 784 features (28x28 flattened image)
- Hidden: 128 neurons with ReLU
- Output: 10 classes

Use `nn.Module` (not Sequential).

In [None]:
# Your code here


In [None]:
# Solution
class SimpleClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

model = SimpleClassifier()
print(model)
print()

# Test with batch of 5 images
x = torch.randn(5, 784)
output = model(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")

# Count parameters
total = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total:,}")

### Exercise 2: Convert to Sequential

Rewrite the above network using `nn.Sequential`.

In [None]:
# Your code here


In [None]:
# Solution
model_seq = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

print(model_seq)
print()

# Test
x = torch.randn(5, 784)
output = model_seq(x)
print(f"Output shape: {output.shape}")

# Both versions should produce same shape
print("\nMuch more concise!")

### Exercise 3: Add Dropout

Modify the network to include dropout (p=0.5) after the first layer.
Test that dropout behaves differently in train vs eval mode.

In [None]:
# Your code here


In [None]:
# Solution
class ClassifierWithDropout(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=0.5)
        self.fc2 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)  # Randomly zeros 50% of elements
        x = self.fc2(x)
        return x

model = ClassifierWithDropout()
x = torch.randn(3, 784)

# Training mode
model.train()
out1 = model(x)
out2 = model(x)  # Different output!
print("Training mode (dropout active):")
print(f"Output 1:\n{out1[0, :5]}")
print(f"Output 2:\n{out2[0, :5]}")
print(f"Are they equal? {torch.equal(out1, out2)}")
print()

# Eval mode
model.eval()
out1 = model(x)
out2 = model(x)  # Same output!
print("Eval mode (dropout disabled):")
print(f"Output 1:\n{out1[0, :5]}")
print(f"Output 2:\n{out2[0, :5]}")
print(f"Are they equal? {torch.equal(out1, out2)}")

---

## Comprehensive Exercise: Build a Configurable Network

Create a flexible neural network class that:
1. Takes a list of layer sizes as input
2. Supports different activation functions
3. Optionally adds dropout after each hidden layer
4. Optionally adds batch normalization

**Requirements**:
- Use `nn.ModuleList` for dynamic layers
- Support ReLU, Tanh, or LeakyReLU activations
- Clean, well-documented code

**Test case**: Create a network with layers `[100, 256, 128, 64, 10]`, ReLU activation, dropout=0.3

In [None]:
# Your code here


In [None]:
# Solution
class ConfigurableNet(nn.Module):
    """
    Flexible fully connected neural network.
    
    Args:
        layer_sizes: List of layer sizes [input, hidden1, hidden2, ..., output]
        activation: 'relu', 'tanh', or 'leaky_relu'
        dropout: Dropout probability (0 = no dropout)
        batch_norm: Whether to use batch normalization
    """
    
    def __init__(self, layer_sizes, activation='relu', dropout=0.0, batch_norm=False):
        super().__init__()
        
        self.layer_sizes = layer_sizes
        self.dropout_p = dropout
        self.use_batch_norm = batch_norm
        
        # Choose activation
        if activation == 'relu':
            self.activation = nn.ReLU()
        elif activation == 'tanh':
            self.activation = nn.Tanh()
        elif activation == 'leaky_relu':
            self.activation = nn.LeakyReLU()
        else:
            raise ValueError(f"Unknown activation: {activation}")
        
        # Build layers
        self.layers = nn.ModuleList()
        self.batch_norms = nn.ModuleList()
        self.dropouts = nn.ModuleList()
        
        for i in range(len(layer_sizes) - 1):
            # Linear layer
            self.layers.append(nn.Linear(layer_sizes[i], layer_sizes[i+1]))
            
            # Batch norm (not for output layer)
            if batch_norm and i < len(layer_sizes) - 2:
                self.batch_norms.append(nn.BatchNorm1d(layer_sizes[i+1]))
            else:
                self.batch_norms.append(None)
            
            # Dropout (not for output layer)
            if dropout > 0 and i < len(layer_sizes) - 2:
                self.dropouts.append(nn.Dropout(dropout))
            else:
                self.dropouts.append(None)
    
    def forward(self, x):
        # Pass through all layers except last
        for i in range(len(self.layers) - 1):
            x = self.layers[i](x)
            
            if self.batch_norms[i] is not None:
                x = self.batch_norms[i](x)
            
            x = self.activation(x)
            
            if self.dropouts[i] is not None:
                x = self.dropouts[i](x)
        
        # Output layer (no activation/dropout/batchnorm)
        x = self.layers[-1](x)
        return x
    
    def __repr__(self):
        return (f"ConfigurableNet(layers={self.layer_sizes}, "
                f"dropout={self.dropout_p}, batch_norm={self.use_batch_norm})")


# Test case 1: Basic network
model1 = ConfigurableNet([100, 256, 128, 64, 10], activation='relu', dropout=0.3)
print("Model 1:")
print(model1)
print()

x = torch.randn(5, 100)
output = model1(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print()

# Test case 2: With batch norm
model2 = ConfigurableNet([50, 128, 64, 10], activation='leaky_relu', 
                         dropout=0.5, batch_norm=True)
print("Model 2:")
print(model2)
print()

# Count parameters
total = sum(p.numel() for p in model1.parameters())
print(f"Model 1 parameters: {total:,}")

total = sum(p.numel() for p in model2.parameters())
print(f"Model 2 parameters: {total:,}")

---

## Key Takeaways

1. **nn.Module is everything**: Base class for all neural network components
2. **Two parts**: `__init__` (define layers) and `forward` (define computation)
3. **Automatic parameter tracking**: All `nn.Module` objects are tracked
4. **Activation functions**: Add non-linearity, essential for learning complex patterns
5. **Sequential vs Module**: Sequential for simple chains, Module for flexibility
6. **train() vs eval()**: Always switch modes appropriately
7. **Composability**: Build complex models from simple modules
8. **ModuleList/ModuleDict**: For dynamic architectures

### Building Blocks Learned

- `nn.Linear`: Fully connected layer
- `nn.ReLU`, `nn.Sigmoid`, `nn.Tanh`: Activation functions
- `nn.Dropout`: Regularization
- `nn.BatchNorm1d`: Normalization
- `nn.Sequential`: Container for linear stacks
- `nn.ModuleList`: Container for dynamic layers

---

## Next Steps

You can now build neural network architectures! Next, we'll learn about **loss functions** - how to measure and optimize model performance.

Continue to: [Topic 4: Loss Functions - A Comprehensive Guide](04_loss_functions.ipynb)

---

## Further Reading

- [PyTorch nn.Module Documentation](https://pytorch.org/docs/stable/generated/torch.nn.Module.html)
- [PyTorch Neural Network Tutorial](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html)
- [CS231n: Neural Networks](http://cs231n.github.io/neural-networks-1/)
- [Activation Functions Explained](https://mlfromscratch.com/activation-functions-explained/)