# Session 6: Training Models with PyTorch - Loss and Optimizers

**Objective:** To understand the fundamental training loop in machine learning by using PyTorch's `autograd` to minimize a loss function, and then to learn how `torch.optim` simplifies this process.

## Part 1: Concepts

### 1. The Goal: Minimizing Loss

Machine learning is essentially about adjusting a model's parameters (like a weight `w`) so that its predictions get closer and closer to the true values. We measure how 'wrong' the model is using a **loss function**. Our goal is to find the parameters that make the loss as small as possible.

The process looks like this:
1.  **Forward Pass:** Make a prediction.
2.  **Compute Loss:** Compare the prediction to the true value.
3.  **Backward Pass:** Calculate the gradient of the loss with respect to the model's parameters. This tells us the direction to adjust our parameters to reduce the loss.
4.  **Update Parameters:** Adjust the parameters in the opposite direction of the gradient.

We repeat this process iteratively. This is called **Gradient Descent**.


### 2. Manual Gradient Descent (Single Data Point)

Let's solve a simple problem: `y = w * x`. We have a single data point `x=2` and a target `y=4`. We know the answer should be `w=2`, but let's pretend we don't and make the model learn it.

Our loss function will be the **Mean Squared Error (MSE)**: `loss = (prediction - y)^2`.

((Important: Why is MSE a "good" loss function?))

What is the abstract loop to follow to *learn* w *given* x,y ?

- Calculate the prediction given the parameter w.
- Calculate how wrong the prediction is (the loss).
- Calculate what is the direction to reduce the wrongness (the grad of the loss)
- Update the w in the rection of the gradient
- Re-calculate the prediction and loop again!

In [None]:
import torch

# --- Setup ---
# Known data point
x = torch.tensor(2.0)
y = torch.tensor(4.0)

# Initialize a random weight. We must set requires_grad=True to compute gradients.
w = torch.tensor(0.0, requires_grad=True)

In [None]:
y_pred = w * x
loss = (y_pred - y)**2

print(f"x is {x} with grad {x.grad}")
print(f"y is {y} with grad {y.grad}")
print(f"y_pred is {y_pred} with grad {y_pred.grad}")
print(f"loss is {loss} with grad {loss.grad}")
print(f"w is {w} with grad {w.grad}")

loss.backward()
# d loss / d w = 2 * (y_pred - y) * x
print("\nmanual gradient is:")
print("d loss / d w = 2 * (y_pred - y) * x: ", 2 * (y_pred - y) * x)

print(f"\n autograd. doing backward pass!\n\n")

print(f"x is {x} with grad {x.grad}")
print(f"y is {y} with grad {y.grad}")
print(f"y_pred is {y_pred} with grad {y_pred.grad}")
print(f"loss is {loss} with grad {loss.grad}")
print(f"w is {w} with grad {w.grad}")

print("\nmanual update is:")
print("w -> w - alpha * d loss/ d w")
# w -> w - 0.1 * w.grad

print("new prediction is: ", (w-0.1*w.grad)*x)

print(f"\nresetting grad to zero.\n\n")
w.grad.zero_()




In [None]:
# Hyperparameter: how big of a step we take during each update
learning_rate = 0.01
print(f"Initial prediction: f(2) = {w * x:.3f}")

In [None]:
# --- Training Loop ---
for epoch in range(40):
    # 1. Forward pass: make a prediction
    y_pred = w * x
    
    # 2. Compute loss
    loss = (y_pred - y)**2
    
    # 3. Backward pass: compute gradient of loss w.r.t. w
    loss.backward() # Populates w.grad
    
    # 4. Manually update weight using the gradient
    # We wrap this in `with torch.no_grad()` because we don't want this update to be tracked.
    with torch.no_grad():
        w -= learning_rate * w.grad
    
    # **VERY IMPORTANT**: Zero the gradients after updating
    # If we don't, gradients will accumulate from all previous steps.
    w.grad.zero_()
    
    if (epoch + 1) % 2 == 0:
        print(f'Epoch {epoch+1}: w = {w:.3f}, loss = {loss:.8f}')

print(f"\nFinal prediction: f(2) = {w * x:.3f}")

### 3. Gradient Descent with a Batch of Data

Usually, we have more than one data point. The process is the same, but now our forward and backward passes operate on a whole **batch** of data at once. The loss is typically the average loss over the entire batch.

In [None]:
# --- Setup for Batch ---
# True relationship: y = 2*x
X = torch.tensor([1.0, 2.0, 3.0, 4.0])
Y = torch.tensor([2.0, 4.0, 6.0, 8.0])

# Initialize a random weight
w = torch.tensor(0.0, requires_grad=True)

learning_rate = 0.01

# --- Training Loop for Batch ---
for epoch in range(40):
    # 1. Forward pass (works on the whole batch)
    Y_pred = w * X
    
    # 2. Compute loss (average over the batch)
    loss = torch.mean((Y_pred - Y)**2)
    
    # 3. Backward pass
    loss.backward()
    
    # 4. Manual update
    with torch.no_grad():
        w -= learning_rate * w.grad
        
    # Zero the gradients
    w.grad.zero_()
    
    if (epoch + 1) % 2 == 0:
        print(f'Epoch {epoch+1}: w = {w:.3f}, loss = {loss:.8f}')

### 4. Using a PyTorch Optimizer

Manually updating weights and zeroing gradients is repetitive. PyTorch provides **optimizers** in `torch.optim` that handle this for us.

The two key methods are:
- `optimizer.step()`: Updates all the parameters that were passed to the optimizer during its creation.
- `optimizer.zero_grad()`: Sets the gradients of all managed parameters to zero.

In [None]:
# --- Setup with Optimizer ---
X = torch.tensor([1.0, 2.0, 3.0, 4.0])
Y = torch.tensor([2.0, 4.0, 6.0, 8.0])
w = torch.tensor(0.0, requires_grad=True)
learning_rate = 0.01

# Define the model (for this simple case, it's just a multiplication)
def model(x):
    return w * x

# Define the loss function
loss_fn = torch.nn.MSELoss()

# Define the optimizer
# We pass it the parameters it should manage ([w]) and the learning rate.
optimizer = torch.optim.SGD([w], lr=learning_rate)

# --- Training Loop with Optimizer ---
for epoch in range(20):
    # 1. Forward pass
    Y_pred = model(X)
    
    # 2. Compute loss
    loss = loss_fn(Y_pred, Y)
    
    # 3. Backward pass
    loss.backward()
    
    # 4. Update weights with optimizer
    optimizer.step()
    
    # Zero the gradients with optimizer
    optimizer.zero_grad()
    
    if (epoch + 1) % 2 == 0:
        print(f'Epoch {epoch+1}: w = {w:.3f}, loss = {loss:.8f}')

## Part 2: Exercises & Debugging (90 mins)

### Lab 6.1: Learning a Linear Model (`y = w*x + b`)

* **Task:** The true relationship between our data is now `y = 3*x + 2`. Your goal is to find the correct values for both `w` and `b`.
  1.  Set up the data `X` and `Y` based on the true relationship.
  2.  Initialize two parameters, `w` and `b`, to random values (or zero) and make sure they `require_grad`.
  3.  Set up an `SGD` optimizer to manage **both** `w` and `b`.
  4.  Complete the training loop to learn the values of `w` and `b`.
  5.  Print the final learned `w` and `b` after training.

In [None]:
# --- Your Code Here ---

# 1. Setup Data
X_true = torch.arange(10, dtype=torch.float32)
Y_true = 3 * X_true + 2 # True relationship

# 2. Initialize parameters
w = torch.tensor(0.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

# Hyperparameters
learning_rate = 0.01
n_epochs = 50

# 3. Define Optimizer (managing both w and b)
optimizer = torch.optim.SGD([w, b], lr=learning_rate)
loss_fn = torch.nn.MSELoss()

# 4. Training Loop
for epoch in range(n_epochs):
    # Forward pass
    Y_pred = w * X_true + b
    
    # Calculate loss
    loss = loss_fn(Y_pred, Y_true)
    
    # Backward pass
    loss.backward()
    
    # Update parameters and zero gradients
    optimizer.step()
    optimizer.zero_grad()
    
    if (epoch + 1) % 5 == 0:
        print(f'Epoch {epoch+1}: w = {w:.3f}, b = {b:.3f}, loss = {loss:.4f}')

# 5. Print final results
print("\n--- Final Learned Parameters ---")
print(f"Learned w: {w:.4f} (True value is 3)")
print(f"Learned b: {b:.4f} (True value is 2)")