# Training a Neural Network

## Overview
This notebook covers the essential components of training a neural network in PyTorch, including data loading, loss functions, and the complete training loop.

---

## 1. Data Preparation

### Why Data Preparation Matters
Before training a neural network, you need to organize your data into a format that PyTorch can efficiently process. This involves:
- Converting data to tensors
- Creating datasets
- Batching data for efficient training
- Shuffling data to prevent learning order-dependent patterns

### Key Components:
- **TensorDataset**: Wraps tensors to create a dataset
- **DataLoader**: Handles batching, shuffling, and iteration over data

---

## 2. Loss Functions

### What is a Loss Function?
A loss function (also called cost function or objective function) measures how far the model's predictions are from the actual target values. The goal of training is to **minimize this loss**.

### Common Loss Functions:
- **Mean Squared Error (MSE)**: Used for regression tasks
  - Formula: $\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$
  - Measures average squared difference between predictions and targets
  
- **Cross Entropy Loss**: Used for classification tasks
  - Measures difference between predicted and true probability distributions

### Why MSE for Regression?
- Penalizes larger errors more heavily (due to squaring)
- Smooth and differentiable (good for gradient descent)
- Easy to interpret (same units as the target squared)

---

## 3. The Training Loop

### Core Training Steps:
1. **Zero Gradients**: Clear previous gradients with `optimizer.zero_grad()`
2. **Forward Pass**: Pass input through model to get predictions
3. **Compute Loss**: Calculate how wrong the predictions are
4. **Backward Pass**: Compute gradients with `loss.backward()`
5. **Update Weights**: Apply gradients to update parameters with `optimizer.step()`

### Why Zero Gradients?
PyTorch **accumulates** gradients by default. If you don't zero them, gradients from previous iterations will be added to current ones, causing incorrect updates.

### Training Over Epochs:
- **Epoch**: One complete pass through the entire training dataset
- **Batch**: A subset of the training data processed together
- Multiple epochs allow the model to see the data multiple times and learn better

---

## 4. Optimizers

### What is an Optimizer?
An optimizer implements the algorithm for updating model parameters based on computed gradients.

### Common Optimizers:
- **SGD (Stochastic Gradient Descent)**: Basic optimizer
- **Adam**: Adaptive learning rate, works well in most cases
- **RMSprop**: Good for recurrent neural networks

### Optimizer Workflow:
```python
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# In training loop:
optimizer.zero_grad()  # Clear gradients
loss.backward()        # Compute gradients
optimizer.step()       # Update parameters
```

---

## 5. Training Workflow Summary

```
For each epoch:
    For each batch in dataloader:
        1. Zero gradients
        2. Forward pass (get predictions)
        3. Calculate loss
        4. Backward pass (compute gradients)
        5. Update parameters
```

---

## 6. Key PyTorch Classes & Methods

### Data Handling:
- `TensorDataset(X, y)`: Create dataset from tensors
- `DataLoader(dataset, batch_size, shuffle)`: Create data loader

### Training:
- `criterion = nn.MSELoss()`: Create loss function
- `optimizer = optim.SGD(model.parameters(), lr)`: Create optimizer
- `optimizer.zero_grad()`: Clear gradients
- `loss.backward()`: Compute gradients
- `optimizer.step()`: Update parameters

---

## 7. Important Concepts

### Batch Size:
- **Small batches** (e.g., 32): 
  - More frequent updates
  - More noise in gradient estimates
  - Better generalization
  
- **Large batches** (e.g., 256):
  - Fewer updates per epoch
  - More stable gradients
  - Faster computation on GPU

### Shuffling:
- Randomizes order of samples each epoch
- Prevents model from learning order of data
- Improves generalization

### Learning Rate:
- Controls how much to adjust weights based on gradients
- Too high: Model may not converge (overshoots minimum)
- Too low: Training is very slow
- Typical values: 0.001 to 0.1

---

## 8. Quick Reference

| Component | Purpose | Example |
|-----------|---------|---------|
| TensorDataset | Package data and labels | `TensorDataset(X_tensor, y_tensor)` |
| DataLoader | Batch and iterate data | `DataLoader(dataset, batch_size=32)` |
| Loss Function | Measure prediction error | `nn.MSELoss()` |
| Optimizer | Update parameters | `optim.SGD(model.parameters(), lr=0.01)` |
| Training Loop | Learn from data | See examples below |

---

## Example 1: Creating a Dataset with TensorDataset

This example demonstrates:
- **Extracting features and labels** from a pandas DataFrame (`animals`)
- **Converting to tensors** using `torch.tensor()`
- **Creating a TensorDataset** to package inputs and labels together
- **Accessing samples** from the dataset using indexing

**Key components:**
- `X`: Feature matrix (all columns except first and last)
- `y`: Target labels (last column)
- `TensorDataset`: Combines X and y into a single dataset object
- Indexing with `dataset[0]` retrieves the first sample and its label

**Why use TensorDataset?**
- Automatically pairs inputs with their corresponding labels
- Makes it easy to iterate through data
- Works seamlessly with DataLoader for batching

**Note**: This assumes you have a DataFrame called `animals` loaded in your environment.

In [None]:
import torch
import pandas as pd
from torch.utils.data import TensorDataset, DataLoader
import numpy as np
import torch.nn as nn
import torch.optim as optim
import 

# Create a sample animals DataFrame
animals = pd.DataFrame({
	'Name': ['Dog', 'Cat', 'Bird', 'Fish'],
	'Height': [60, 25, 15, 5],
	'Weight': [30, 5, 0.5, 0.1],
	'Age': [5, 3, 2, 1]
})

X = animals.iloc[:, 1:-1].to_numpy()  
y = animals.iloc[:, -1].to_numpy()

# Create a dataset
dataset = TensorDataset(torch.tensor(X), torch.tensor(y))

# Print the first sample
input_sample, label_sample = dataset[0]
print('Input sample:', input_sample)
print('Label sample:', label_sample)

## Example 2: Creating a DataLoader for Batch Processing

This example demonstrates:
- **Creating a DataLoader** from the dataset
- **Setting batch size**: `batch_size=2` means 2 samples per batch
- **Shuffling**: `shuffle=True` randomizes sample order each iteration
- **Iterating through batches**: The loop automatically provides batched data

**How DataLoader works:**
- Splits the dataset into batches of specified size
- Each iteration returns a batch of inputs and a batch of labels
- Automatically handles the last batch (which may be smaller)

**Parameters explained:**
- `batch_size=2`: Process 2 samples at a time
- `shuffle=True`: Randomize order to prevent learning patterns from data order

**Benefits of batching:**
- More efficient computation (especially on GPU)
- Provides regularization effect (prevents overfitting)
- Allows training on datasets larger than memory (by loading one batch at a time)

**Output:** Each iteration prints one batch of inputs and their corresponding labels.

In [None]:
# Create a DataLoader
dataloader = DataLoader(dataset, batch_size = 2, shuffle=True)

# Iterate over the dataloader
for batch_inputs, batch_labels in dataloader:
    print('batch_inputs:', batch_inputs)
    print('batch_labels:', batch_labels)

## Example 3: Computing Mean Squared Error (MSE) Loss

This example demonstrates:
- **Calculating MSE manually** using NumPy
- **Calculating MSE using PyTorch** with `nn.MSELoss()`
- **Comparing both methods** to verify they produce the same result

**Mean Squared Error Formula:**
$$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_{\text{pred}} - y_{\text{true}})^2$$

**How it works:**
1. Compute the difference between predictions and true values
2. Square each difference (to make all values positive and penalize large errors)
3. Take the mean of all squared differences

**Why use MSE?**
- Standard loss function for regression problems
- Differentiable (needed for backpropagation)
- Penalizes larger errors more heavily due to squaring
- Easy to interpret (lower is better)

**PyTorch vs NumPy:**
- NumPy version: Manual calculation using `np.mean((y_pred - y)**2)`
- PyTorch version: Built-in `nn.MSELoss()` function
- Both produce identical results

**Use case:** This loss function is used during training to measure how well the model's predictions match the true values.

In [None]:
y_pred = np.array([3, 5.0, 2.5, 7.0])  
y = np.array([3.0, 4.5, 2.0, 8.0])     

# Calculate MSE using NumPy
mse_numpy = np.mean((y_pred - y)**2)

# Create the MSELoss function in PyTorch
criterion = nn.MSELoss()

# Calculate MSE using PyTorch
mse_pytorch = criterion(torch.tensor(y_pred), torch.tensor(y))

print("MSE (NumPy):", mse_numpy)
print("MSE (PyTorch):", mse_pytorch)

## Example 4: Complete Training Loop

This example demonstrates the **full training workflow** for a neural network:

**Training Loop Structure:**
```
For each epoch (complete pass through dataset):
    For each batch in dataloader:
        1. optimizer.zero_grad()    ← Clear previous gradients
        2. prediction = model(feature)  ← Forward pass
        3. loss = criterion(prediction, target)  ← Compute loss
        4. loss.backward()         ← Compute gradients (backpropagation)
        5. optimizer.step()        ← Update weights
```

**Step-by-step breakdown:**

1. **Outer loop (`range(num_epochs)`)**: 
   - Iterates through multiple complete passes of the training data
   - More epochs = more learning opportunities

2. **Inner loop (`for data in dataloader`)**: 
   - Processes one batch at a time
   - `feature, target = data` unpacks batch inputs and labels

3. **`optimizer.zero_grad()`**: 
   - **CRITICAL**: Clears old gradients
   - Without this, gradients accumulate and cause wrong updates

4. **Forward pass (`prediction = model(feature)`)**: 
   - Passes input through the network
   - Gets predicted output

5. **Loss computation (`loss = criterion(prediction, target)`)**: 
   - Measures how wrong predictions are
   - Uses MSE or other loss function defined as `criterion`

6. **Backward pass (`loss.backward()`)**: 
   - Computes gradients for all parameters
   - Uses automatic differentiation (autograd)
   - Gradients stored in `.grad` attribute of each parameter

7. **Parameter update (`optimizer.step()`)**: 
   - Updates all model parameters using computed gradients
   - Applies learning rule (SGD, Adam, etc.)
   - Formula: `weight = weight - learning_rate * gradient`

**After training:**
- `show_results(model, dataloader)` displays the results
- Model has learned from the data and can make predictions

**Prerequisites for this code:**
- `model`: A defined neural network
- `criterion`: Loss function (e.g., `nn.MSELoss()`)
- `optimizer`: Optimization algorithm (e.g., `torch.optim.SGD()`)
- `num_epochs`: Number of times to iterate through dataset
- `dataloader`: DataLoader with training data

**Key concept:** This loop is the heart of deep learning - it's how neural networks learn from data!

In [None]:
# Define the model
model = nn.Sequential(
    nn.Linear(16, 8),
    nn.Linear(8, 4),
    nn.Linear(4, 1)
)

# Define loss function and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Set number of epochs
num_epochs = 5

# Loop over the number of epochs and the dataloader
for i in range(num_epochs):
  for data in dataloader:
    # Set the gradients to zero
    optimizer.zero_grad()
    # Run a forward pass
    feature, target = data
    prediction = model(feature)    
    # Compute the loss
    loss = criterion(prediction, target)    
    # Compute the gradients
    loss.backward()
    # Update the model's parameters
    optimizer.step()

print(f"Training complete! Final loss: {loss.item():.4f}")