# PyTorch Deep Dive: Training Your First Model

We have the Data (Tensor). We have the Machine (Model). We have the Math (Autograd).

Now we need to teach the machine. This is **Training**.

## Learning Objectives
- **The Vocabulary**: What is an "Epoch", "Batch", "Loss", and "Optimizer"?
- **The Intuition**: Training as "Learning to Ride a Bike".
- **The Loop**: The 5-step process that repeats millions of times.
- **The Visual**: Watching the loss go down.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

torch.manual_seed(42)

## Part 1: The Vocabulary (Definitions First)

Training a model is like training an athlete. Here are the terms:

### 1. Epoch
- One full pass through the entire dataset.
- Example: If you have 1000 images and you look at all 1000, that's 1 Epoch.
- Analogy: Reading the textbook cover-to-cover once.

### 2. Batch
- A small chunk of data processed at once.
- We don't learn from 1 example at a time (too slow/noisy), nor all at once (too big for RAM).
- Analogy: Studying one chapter at a time.

### 3. Loss Function (The Scorecard)
- Measures how bad the model's prediction is.
- Example: MSE (Mean Squared Error) for numbers, CrossEntropy for categories.
- Analogy: The grade on a practice test.

### 4. Optimizer (The Coach)
- The algorithm that updates the weights to reduce the loss.
- Example: SGD (Stochastic Gradient Descent), Adam.
- Analogy: The coach telling you "Lean left!" or "Pedal harder!".

In [None]:
# Visualize Epochs, Batches, and Dataset relationship
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Left: Dataset division into batches
ax1.set_xlim(0, 10)
ax1.set_ylim(0, 10)

# Simulate a dataset of 100 samples
total_samples = 100
batch_size = 20
num_batches = total_samples // batch_size

colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A', '#98D8C8']
y_start = 8

for batch_idx in range(num_batches):
    start_sample = batch_idx * batch_size
    end_sample = (batch_idx + 1) * batch_size
    
    # Draw batch rectangle
    rect = plt.Rectangle((1, y_start - batch_idx * 1.5), 8, 1, 
                         facecolor=colors[batch_idx], edgecolor='black', linewidth=2)
    ax1.add_patch(rect)
    
    # Add label
    ax1.text(5, y_start - batch_idx * 1.5 + 0.5, 
            f'Batch {batch_idx + 1}\nSamples {start_sample}-{end_sample}',
            ha='center', va='center', fontsize=11, fontweight='bold')

ax1.text(5, 9.5, 'Dataset (100 samples)', ha='center', fontsize=14, fontweight='bold',
        bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7))
ax1.text(5, 0.5, f'1 Epoch = Processing all {num_batches} batches', ha='center', 
        fontsize=12, fontweight='bold', color='red',
        bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.7))
ax1.axis('off')
ax1.set_title('Dataset Division into Batches', fontsize=14, fontweight='bold')

# Right: Multiple epochs
ax2.set_xlim(0, 6)
ax2.set_ylim(0, 10)

epochs_to_show = 3
for epoch in range(epochs_to_show):
    y_base = 8 - epoch * 3
    
    # Epoch label
    ax2.text(0.5, y_base + 0.5, f'Epoch {epoch + 1}', fontsize=12, fontweight='bold',
            rotation=0, va='center', bbox=dict(boxstyle='round', facecolor='lightgray'))
    
    # Mini batches
    for batch in range(num_batches):
        x_pos = 1.5 + batch * 0.8
        rect = plt.Rectangle((x_pos, y_base), 0.6, 0.8, 
                           facecolor=colors[batch], edgecolor='black', linewidth=1.5)
        ax2.add_patch(rect)
        ax2.text(x_pos + 0.3, y_base + 0.4, f'B{batch+1}', 
                ha='center', va='center', fontsize=8, fontweight='bold')
    
    # Arrow to next epoch
    if epoch < epochs_to_show - 1:
        ax2.annotate('', xy=(0.5, y_base - 1), xytext=(0.5, y_base - 0.2),
                   arrowprops=dict(arrowstyle='->', lw=2, color='red'))

ax2.text(3, 0.5, 'Training continues...', ha='center', fontsize=11, 
        style='italic', color='gray')
ax2.axis('off')
ax2.set_title('Multiple Epochs of Training', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("Key Concepts:")
print(f"• Dataset: {total_samples} total samples")
print(f"• Batch Size: {batch_size} samples per batch")
print(f"• Batches per Epoch: {num_batches}")
print(f"• 1 Epoch = Model sees each sample exactly once")
print(f"• Multiple Epochs = Model learns from the same data repeatedly")
print(f"• Total Updates (for 10 epochs) = {num_batches * 10} weight updates!")

### Visualization: Understanding Epochs and Batches

Let's visualize how data is divided into batches and processed over multiple epochs.

## Part 2: The Intuition (Learning to Ride a Bike)

How do you learn to ride a bike?

1. **Try**: You get on and pedal. (Forward Pass).
2. **Fail**: You fall over. (Compute Loss).
3. **Blame**: You realize you leaned too far left. (Compute Gradients).
4. **Adjust**: You lean a bit to the right next time. (Update Parameters).
5. **Repeat**: You do it again.

This is exactly how Neural Networks learn.

In [None]:
import numpy as np

# Create flowchart visualization of training loop
fig, ax = plt.subplots(figsize=(12, 10))

steps = [
    {"text": "START\n(Random Weights)", "y": 0.95, "color": "lightgray"},
    {"text": "1. FORWARD PASS\npred = model(X)", "y": 0.80, "color": "lightblue"},
    {"text": "2. CALCULATE LOSS\nloss = criterion(pred, y)", "y": 0.65, "color": "lightcoral"},
    {"text": "3. ZERO GRADIENTS\noptimizer.zero_grad()", "y": 0.50, "color": "lightyellow"},
    {"text": "4. BACKPROPAGATION\nloss.backward()", "y": 0.35, "color": "lightgreen"},
    {"text": "5. UPDATE WEIGHTS\noptimizer.step()", "y": 0.20, "color": "plum"},
    {"text": "REPEAT\n(Next Epoch)", "y": 0.05, "color": "lightgray"}
]

for i, step in enumerate(steps):
    # Draw box
    bbox = dict(boxstyle='round,pad=0.8', facecolor=step["color"], 
                edgecolor='black', linewidth=2.5)
    ax.text(0.5, step["y"], step["text"], ha='center', va='center',
           fontsize=13, fontweight='bold', bbox=bbox)
    
    # Draw arrow to next step
    if i < len(steps) - 1:
        ax.annotate('', xy=(0.5, steps[i+1]["y"] + 0.04), 
                   xytext=(0.5, step["y"] - 0.04),
                   arrowprops=dict(arrowstyle='->', lw=3, color='black'))

# Draw loop-back arrow
ax.annotate('', xy=(0.72, 0.92), xytext=(0.72, 0.08),
           arrowprops=dict(arrowstyle='->', lw=2.5, color='red', 
                         connectionstyle="arc3,rad=.5"))
ax.text(0.85, 0.5, 'Training Loop\n(Many Epochs)', fontsize=11, 
       color='red', fontweight='bold', rotation=90, va='center')

# Add side annotations
annotations = [
    (0.05, 0.80, "Compute predictions\nfrom current weights"),
    (0.05, 0.65, "How wrong are\nwe? (Error)"),
    (0.05, 0.50, "Clear old gradients\n(Critical!)"),
    (0.05, 0.35, "Calculate how to\nimprove (∂Loss/∂W)"),
    (0.05, 0.20, "Adjust weights:\nW = W - lr × ∂Loss/∂W")
]

for x, y, text in annotations:
    ax.text(x, y, text, fontsize=9, style='italic', 
           bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.axis('off')
ax.set_title('The 5-Step Training Loop\n(Heart of Deep Learning)', 
            fontsize=16, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

print("This loop runs MILLIONS of times during training!")
print("Each iteration makes the model slightly better.")

### Visualization: The Training Process

Let's visualize the 5-step training loop as a flowchart.

## Part 3: The Setup (Data, Model, Loss, Optimizer)

Before the loop, we need 4 things:

1. **Data**: $X$ (Inputs) and $y$ (Targets).
2. **Model**: The network.
3. **Loss Function**: The Scorecard.
4. **Optimizer**: The Coach.

In [None]:
# 1. Data (Linear Regression: y = 2x + 1)
X = torch.linspace(0, 10, 100).view(-1, 1) # 100 inputs
y = 2 * X + 1 + torch.randn(X.shape) * 0.5 # 100 targets (with noise)

# 2. Model (Linear Layer)
model = nn.Linear(1, 1)

# 3. Loss Function (MSE: Mean Squared Error)
criterion = nn.MSELoss()

# 4. Optimizer (SGD: Stochastic Gradient Descent)
optimizer = optim.SGD(model.parameters(), lr=0.01)

## Part 4: The Training Loop (The 5 Steps)

This loop is the heartbeat of Deep Learning. Memorize these 5 steps.

1. **Forward Pass**: `pred = model(X)`
2. **Calculate Loss**: `loss = criterion(pred, y)`
3. **Zero Gradients**: `optimizer.zero_grad()` (Don't forget!)
4. **Backpropagation**: `loss.backward()` (Compute gradients)
5. **Step**: `optimizer.step()` (Update weights)

In [None]:
# Enhanced visualization of training results
fig = plt.figure(figsize=(16, 10))

# 1. Loss Curve with annotations
ax1 = plt.subplot(2, 3, 1)
ax1.plot(losses, linewidth=2.5, color='red', label='Training Loss')
ax1.fill_between(range(len(losses)), losses, alpha=0.3, color='red')
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('MSE Loss', fontsize=12)
ax1.set_title('Loss Curve\n(Should Decrease!)', fontsize=13, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.legend()

# Mark key points
min_loss_idx = np.argmin(losses)
ax1.scatter([0, min_loss_idx, len(losses)-1], 
           [losses[0], losses[min_loss_idx], losses[-1]], 
           s=100, c=['red', 'green', 'blue'], zorder=5)
ax1.annotate(f'Start\nLoss={losses[0]:.2f}', xy=(0, losses[0]), 
            xytext=(10, losses[0]+0.5), fontsize=9,
            arrowprops=dict(arrowstyle='->', color='red'))
ax1.annotate(f'Best\nLoss={losses[min_loss_idx]:.2f}', 
            xy=(min_loss_idx, losses[min_loss_idx]), 
            xytext=(min_loss_idx+10, losses[min_loss_idx]+0.5), fontsize=9,
            arrowprops=dict(arrowstyle='->', color='green'))

# 2. Model Fit
ax2 = plt.subplot(2, 3, 2)
ax2.scatter(X.numpy(), y.numpy(), alpha=0.6, s=30, label='True Data', color='blue')
with torch.no_grad():
    predictions_final = model(X)
    ax2.plot(X.numpy(), predictions_final.numpy(), color='red', 
            linewidth=3, label='Learned Line', linestyle='--')
ax2.set_xlabel('X (Input)', fontsize=12)
ax2.set_ylabel('y (Output)', fontsize=12)
ax2.set_title('Model Predictions vs True Data', fontsize=13, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

# 3. Residuals (Prediction Errors)
ax3 = plt.subplot(2, 3, 3)
with torch.no_grad():
    residuals = (y - predictions_final).numpy()
ax3.scatter(X.numpy(), residuals, alpha=0.6, s=30, color='purple')
ax3.axhline(y=0, color='red', linestyle='--', linewidth=2, label='Perfect Fit')
ax3.set_xlabel('X (Input)', fontsize=12)
ax3.set_ylabel('Residual (Error)', fontsize=12)
ax3.set_title('Prediction Errors\n(Should be random noise)', fontsize=13, fontweight='bold')
ax3.legend()
ax3.grid(True, alpha=0.3)

# 4. Parameter Evolution (if we tracked it)
# Let's retrain and track weight/bias changes
model_track = nn.Linear(1, 1)
optimizer_track = optim.SGD(model_track.parameters(), lr=0.01)
weight_history = []
bias_history = []

for epoch in range(epochs):
    predictions = model_track(X)
    loss = criterion(predictions, y)
    optimizer_track.zero_grad()
    loss.backward()
    optimizer_track.step()
    
    weight_history.append(model_track.weight.item())
    bias_history.append(model_track.bias.item())

ax4 = plt.subplot(2, 3, 4)
ax4.plot(weight_history, label='Weight', linewidth=2, color='blue')
ax4.axhline(y=2.0, color='blue', linestyle='--', linewidth=1, alpha=0.5, label='True Weight (2.0)')
ax4.set_xlabel('Epoch', fontsize=12)
ax4.set_ylabel('Weight Value', fontsize=12)
ax4.set_title('Weight Convergence', fontsize=13, fontweight='bold')
ax4.legend()
ax4.grid(True, alpha=0.3)

ax5 = plt.subplot(2, 3, 5)
ax5.plot(bias_history, label='Bias', linewidth=2, color='green')
ax5.axhline(y=1.0, color='green', linestyle='--', linewidth=1, alpha=0.5, label='True Bias (1.0)')
ax5.set_xlabel('Epoch', fontsize=12)
ax5.set_ylabel('Bias Value', fontsize=12)
ax5.set_title('Bias Convergence', fontsize=13, fontweight='bold')
ax5.legend()
ax5.grid(True, alpha=0.3)

# 6. Learning Rate Comparison
ax6 = plt.subplot(2, 3, 6)
learning_rates = [0.001, 0.01, 0.1]
colors = ['blue', 'green', 'red']

for lr, color in zip(learning_rates, colors):
    model_lr = nn.Linear(1, 1)
    optimizer_lr = optim.SGD(model_lr.parameters(), lr=lr)
    losses_lr = []
    
    for epoch in range(50):
        predictions = model_lr(X)
        loss = criterion(predictions, y)
        losses_lr.append(loss.item())
        optimizer_lr.zero_grad()
        loss.backward()
        optimizer_lr.step()
    
    ax6.plot(losses_lr, label=f'LR={lr}', linewidth=2, color=color)

ax6.set_xlabel('Epoch', fontsize=12)
ax6.set_ylabel('Loss', fontsize=12)
ax6.set_title('Effect of Learning Rate', fontsize=13, fontweight='bold')
ax6.legend()
ax6.grid(True, alpha=0.3)
ax6.set_yscale('log')

plt.tight_layout()
plt.show()

print("Results:")
print(f"• Learned Weight: {model.weight.item():.3f} (True: 2.000)")
print(f"• Learned Bias:   {model.bias.item():.3f} (True: 1.000)")
print(f"• Final Loss:     {losses[-1]:.4f}")
print(f"• Loss Reduction: {((losses[0] - losses[-1]) / losses[0] * 100):.1f}%")

## Part 5: Visualization (Did it learn?)

Let's see if the model learned the line $y = 2x + 1$.

In [None]:
# Plot Loss Curve
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(losses)
plt.title("Loss Curve (Lower is Better)")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")

# Plot Predictions
plt.subplot(1, 2, 2)
plt.scatter(X, y, label="Data")
with torch.no_grad(): # Don't track gradients for plotting
    plt.plot(X, model(X), color='red', label="Prediction")
plt.title("Model Fit")
plt.legend()
plt.show()

# Check learned parameters
print(f"Learned Weight: {model.weight.item():.2f} (True: 2.0)")
print(f"Learned Bias: {model.bias.item():.2f} (True: 1.0)")

## Summary Checklist

1. **Epoch** = One full pass through the dataset.
2. **Loss** = The error metric we want to minimize.
3. **Optimizer** = The algorithm (SGD, Adam) that updates weights.
4. **The 5 Steps**: Forward -> Loss -> Zero -> Backward -> Step.

You have now trained your first AI model from scratch. Congratulations!