### Day 3 — Backpropagation (Core)

**Goals**
- Understand why gradients are needed.
- Derive gradients for a single neuron (sigmoid) step-by-step.
- Implement manual backprop for 1 neuron and autograd for a small MLP in PyTorch.
- Be ready to explain `loss.backward()`, `optimizer.zero_grad()` and `optimizer.step()`.


### Youtube video
https://youtu.be/Ilg3gGewQ5U?si=EYpJXJrUgL7s8ATf



In [1]:

import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt


### Quick theory (formulas to memorize + explanation)

We consider a **single neuron** with input `x`, weights `w`, bias `b`, and target `y`.

---

### Step 1: Linear combination (before activation)

$$
z = w \cdot x + b
$$

**Meaning:**
- The neuron first computes a weighted sum of inputs.
- Each input is multiplied by its corresponding weight.
- Bias `b` shifts the result.

Example:
If  
x = [2, 3]  
w = [0.5, 0.8]  
b = 1  

Then:
z = (0.5×2) + (0.8×3) + 1

This value `z` is called the **pre-activation** or **logit**.

---

### Step 2: Activation function

$$
a = \sigma(z)
$$

Where:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

**Meaning:**
- The activation function introduces **non-linearity**.
- Sigmoid squashes the output between **0 and 1**.
- This is useful for **binary classification**.

So:
- If z is large → a ≈ 1
- If z is small → a ≈ 0

`a` is the **predicted output** of the neuron.

---

### Step 3: Loss function (error measurement)

$$
L = \frac{1}{2}(a - y)^2
$$

**Meaning:**
- This measures how far the prediction `a` is from the true label `y`.
- It is called **Mean Squared Error (MSE)** for one sample.

If:
- prediction = 0.8
- target = 1.0

Then:
Loss = 0.5 × (0.8 − 1)²

The goal of training:
→ **Minimize this loss**

---

## Backpropagation: Chain rule pieces

We want:

$$
\frac{\partial L}{\partial w}
$$

Meaning:
**How does the loss change if we change the weight?**

We compute this using the **chain rule**.

---

### Step 4: Loss with respect to output

$$
\frac{\partial L}{\partial a} = (a - y)
$$

**Meaning:**
- This tells us how sensitive the loss is to the output.
- If prediction is too big → gradient positive
- If prediction is too small → gradient negative

This is the **error signal**.This is output.

---

### Step 5: Output with respect to z

$$
\frac{\partial a}{\partial z} = a(1-a)
$$

**Meaning:**
- This is the derivative of the sigmoid function.
- It tells us how the output changes when `z` changes.
- This term controls how strongly the error flows backward.

---

### Step 6: z with respect to weights and bias

$$
\frac{\partial z}{\partial w_i} = x_i
\quad , \quad
\frac{\partial z}{\partial b} = 1
$$

**Meaning:**
From:
z = w·x + b

If we slightly change a weight:
- z changes proportional to the input value.

So:
- If input is large → weight has big effect
- If input is zero → weight has no effect

Bias always affects z equally, so:
dz/db = 1

---

## Final combined gradients (chain rule)

### Gradient with respect to weight

$$
\frac{\partial L}{\partial w_i}
=
\frac{\partial L}{\partial a}
\cdot
\frac{\partial a}{\partial z}
\cdot
x_i
$$

**Meaning:**
Weight update depends on:
1. Output error
2. Activation sensitivity
3. Input value

This shows:
- If input is zero → weight won’t change
- If error is large → weight changes more

---

### Gradient with respect to bias

$$
\frac{\partial L}{\partial b}
=
\frac{\partial L}{\partial a}
\cdot
\frac{\partial a}{\partial z}
$$

**Meaning:**
- Bias update depends only on:
  - error
  - activation derivative

---

## Weight update rule (Gradient Descent)

After computing gradients:

$$
w_{\text{new}} = w_{\text{old}} - \eta \frac{\partial L}{\partial w}
$$

$$
b_{\text{new}} = b_{\text{old}} - \eta \frac{\partial L}{\partial b}
$$

Where:
- $( \eta )$ = learning rate 
- Controls how big the update step is

---

## Big Picture (Training Steps)

1. Forward pass:
   - Compute prediction

2. Compute loss:
   - Compare prediction with target

3. Backward pass:
   - Compute gradients using chain rule

4. Update weights:
   - Reduce the loss

Repeat many times.





### Code (manual numeric example — single neuron)

In [2]:
# Manual numeric example (single neuron)
# Small, reproducible numbers so we can check digit-by-digit.

# Data
x = torch.tensor([1.0, 0.0])   # input features
y = torch.tensor(1.0)          # target output

# Initialize parameters (simple values)
w = torch.tensor([0.0, 0.0])   # start at zero for clarity
b = torch.tensor(0.0)

# Sigmoid function = 1 / (1 + exp(-z))
def sigmoid(t):
    return 1.0 / (1.0 + torch.exp(-t))

# Forward pass
z = w[0]*x[0] + w[1]*x[1] + b  # z = 0*1 + 0*0 + 0 = 0
a = sigmoid(z)                  # sigmoid(0) = 0.5

# Loss (1/2 (a - y)^2)
loss = 0.5 * (a - y)**2

print("Forward results:")
print(" z =", z.item())
print(" a =", a.item())
print(" loss =", loss.item())

# Backprop by hand (chain rule)
dL_da = (a - y)                  # = 0.5 - 1 = -0.5
da_dz = a * (1 - a)              # 0.5 * 0.5 = 0.25
dL_dz = dL_da * da_dz            # -0.5 * 0.25 = -0.125

dL_dw0 = dL_dz * x[0]            # -0.125 * 1 = -0.125 (dl/dw0 = dl/da * da/dz * x[0])
dL_dw1 = dL_dz * x[1]            # -0.125 * 0 = -0.0
dL_db  = dL_dz * 1.0             # -0.125

print("\nGradients (by-hand):")   # the gradients tell us how to change the weights to reduce the loss
print(" dL/da =", dL_da.item())
print(" da/dz =", da_dz.item())
print(" dL/dz =", dL_dz.item())
print(" dL/dw0 =", dL_dw0.item())
print(" dL/dw1 =", dL_dw1.item())
print(" dL/db  =", dL_db.item())

# Weight update example: learning_rate (eta) = 0.1
lr = 0.1
grad_step = lr * dL_dw0         # -0.0125 
new_w0 = w[0] - grad_step       # 0 - (-0.0125) = 0.0125
new_b  = b - lr * dL_db         # 0 - 0.1 * (-0.125) = 0.0125

print("\nWeight update (lr=0.1):")
print(" lr * dL_dw0 =", grad_step.item())
print(" new w0 =", new_w0.item())
print(" new b  =", new_b.item())


Forward results:
 z = 0.0
 a = 0.5
 loss = 0.125

Gradients (by-hand):
 dL/da = -0.5
 da/dz = 0.25
 dL/dz = -0.125
 dL/dw0 = -0.125
 dL/dw1 = -0.0
 dL/db  = -0.125

Weight update (lr=0.1):
 lr * dL_dw0 = -0.012500000186264515
 new w0 = 0.012500000186264515
 new b  = 0.012500000186264515


#### **What to observe in the manual loop**
- The loss should decrease or move toward lower values.
- Signs of gradients determine whether weights increase or decrease.
- Every update uses: new_weight = old_weight - lr * gradient.
- This manual loop is educational — in practice we use autograd.


### Code (manual backprop training loop — single neuron)

In [7]:
# Manual training loop for the same single-neuron (one-sample) case
x = torch.tensor([1.0, 0.0])
y = torch.tensor(1.0)

# Initialize parameters (float tensors we will update manually)
w = torch.tensor([0.0, 0.0])
b = torch.tensor(0.0)

def sigmoid(t):
    return 1.0 / (1.0 + torch.exp(-t))

lr = 0.1
print("Manual training (5 epochs):")  
for epoch in range(1, 6):        # 5 manual updates
    # Forward
    z = w[0]*x[0] + w[1]*x[1] + b
    a = sigmoid(z)
    loss = 0.5 * (a - y)**2

    # Backprop (same derivatives as above)
    dL_da = (a - y)
    da_dz = a * (1 - a)
    dL_dz = dL_da * da_dz
    dL_dw0 = dL_dz * x[0]
    dL_dw1 = dL_dz * x[1]
    dL_db  = dL_dz * 1.0

    # Update (digit-by-digit method)
    w = w - lr * torch.tensor([dL_dw0, dL_dw1])
    b = b - lr * dL_db

    print(f"Epoch {epoch}: loss={loss.item():.6f}, w={w.numpy()}, b={b.item():.6f}")   


Manual training (5 epochs):
Epoch 1: loss=0.125000, w=[0.0125 0.    ], b=0.012500
Epoch 2: loss=0.121895, w=[0.02484183 0.        ], b=0.024842
Epoch 3: loss=0.118868, w=[0.03702385 0.        ], b=0.037024
Epoch 4: loss=0.115919, w=[0.04904478 0.        ], b=0.049045
Epoch 5: loss=0.113049, w=[0.06090366 0.        ], b=0.060904


#### **What to observe in the manual loop**
- The loss should decrease or move toward lower values.
- Signs of gradients determine whether weights increase or decrease.
- Every update uses: new_weight = old_weight - lr * gradient.
- This manual loop is educational — in practice we use autograd.


### Code (PyTorch autograd example: small MLP single-sample)

In [8]:
# PyTorch autograd example: small MLP on a single sample to inspect gradients
model = nn.Sequential(
    nn.Linear(2, 3),
    nn.ReLU(),
    nn.Linear(3, 1),
    nn.Sigmoid()
)

# Single data point
x = torch.tensor([1.0, 0.0])         # shape (2,)
y = torch.tensor([1.0])              # shape (1,)

# Setup loss and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Before training: forward and inspect loss
pred = model(x)
loss = criterion(pred, y)
print("Before training: loss =", loss.item())

# Backprop with autograd
optimizer.zero_grad()
loss.backward()   # compute gradients for all parameters

print("\nGradients (autograd) — mean value per parameter tensor:")
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad mean={param.grad.mean().item():.6f}, shape={param.grad.shape}")

# Step (update parameters)
optimizer.step()

# After one step
pred2 = model(x)
loss2 = criterion(pred2, y)
print("\nAfter 1 optimizer.step(): loss =", loss2.item())


Before training: loss = 0.11452315747737885

Gradients (autograd) — mean value per parameter tensor:
0.weight: grad mean=-0.006758, shape=torch.Size([3, 2])
0.bias: grad mean=-0.013515, shape=torch.Size([3])
2.weight: grad mean=-0.083264, shape=torch.Size([1, 3])
2.bias: grad mean=-0.151534, shape=torch.Size([1])

After 1 optimizer.step(): loss = 0.10778116434812546
