# PyTorch Deep Dive: Autograd and Gradients

Welcome to the "Engine Room" of PyTorch. 

In this notebook, we will learn how machines actually learn. But before we talk about mountains or hikers, we need to define the **Language of Learning**.

## Learning Objectives
- **The Vocabulary**: What is a "Loss Function" and "Learning Rate"?
- **The Intuition**: The "Blindfolded Hiker" analogy (Now that we know the terms).
- **The Mechanism**: The "Recording Tape" analogy (requires_grad).
- **The Math**: What exactly is a gradient? (Sensitivity).


In [2]:
import torch
import matplotlib.pyplot as plt
import numpy as np

torch.manual_seed(42)
print("PyTorch version:", torch.__version__)

PyTorch version: 2.3.0


## Part 1: The Vocabulary (Definitions First)

Before we can understand *how* AI learns, we need to agree on what "Learning" means.

### 1. The Loss Function (The Scorecard)
Imagine you are taking a test. You get a score of 80/100. Your "Error" is 20.
In AI, we call this "Error" the **Loss**.

- **High Loss** = The model is doing a bad job (e.g., calling a Cat a Dog).
- **Low Loss** = The model is doing a good job.
- **Goal**: Make the Loss as close to **0** as possible.

### 2. The Learning Rate (The Step Size)
When the model realizes it made a mistake, it needs to change its numbers (weights) to fix it.
But *how much* should it change them?

- **Learning Rate**: A tiny number (usually 0.01 or 0.001) that controls how big of a change we make.
- **Too Big**: You might overshoot the correct answer.
- **Too Small**: It will take forever to learn.

## Part 2: The Intuition (The Blindfolded Hiker)

Now that we know the terms, let's visualize them.

Imagine you are standing on a mountain at night. You are blindfolded. Your goal is to get to the very bottom of the valley.

- **The Mountain**: This is the **Loss Function**. High elevation means High Error. Low elevation means Low Error.
- **Your Location**: This represents the current **Parameters** (Weights) of your model.
- **The Slope**: This is the **Gradient**. It tells you which way is "Up".
- **The Step**: This is the **Learning Rate**. You take a step *opposite* to the slope to go down.

**The Algorithm (Gradient Descent):**
1. Feel the slope with your feet (Calculate Gradient).
2. Turn around to face downhill (Negative Gradient).
3. Take a small step (Learning Rate).
4. Repeat until you are at the bottom (Loss = 0).

## Part 3: The Mechanism (The Recording Tape)

How does PyTorch know the slope of a complex mountain?

Imagine a cassette tape recorder. 

When you create a tensor with `requires_grad=True`, you press **RECORD**.
PyTorch silently watches every operation you do (addition, multiplication, power) and writes it down on a "tape" (the Computation Graph).

When you call `.backward()`, PyTorch **rewinds the tape** and calculates the derivatives step-by-step using the Chain Rule.

In [3]:
# Let's see the "Recording Tape" in action

# 1. Create a tensor and press RECORD
x = torch.tensor(3.0, requires_grad=True)
print(f"x: {x}")

# 2. Do some operations (PyTorch is recording!)
y = x + 2
z = y * y * 3
out = z / 4

print(f"Output: {out}")

# 3. Look at the tape (The 'grad_fn')
# This tells us what operation created this variable
print(f"Function that created 'out': {out.grad_fn}")
print(f"Function that created 'z':   {z.grad_fn}")
print(f"Function that created 'y':   {y.grad_fn}")

x: 3.0
Output: 18.75
Function that created 'out': <DivBackward0 object at 0x13ccb3ca0>
Function that created 'z':   <MulBackward0 object at 0x13ccb3ca0>
Function that created 'y':   <AddBackward0 object at 0x13ccb3ca0>


## Part 4: The Math (Sensitivity)

What does `x.grad = 4.5` actually mean?

It means **Sensitivity**.

If `x.grad` is 4.5, it means:
"If I increase `x` by a tiny amount (0.001), the output `y` will increase by **4.5 times** that amount."

$$ \frac{dy}{dx} \approx \frac{\Delta y}{\Delta x} $$

Let's prove this with code.

In [4]:
# Define a simple function y = x^2
def f(x):
    return x ** 2

x = torch.tensor(4.0, requires_grad=True)
y = f(x)
y.backward()
gradient = x.grad.item()

print(f"At x=4, the gradient is {gradient}")
print("This means if we nudge x by +0.001, y should change by approx 0.008 (8 * 0.001)")

# Let's verify manually
delta = 0.001
x_nudged = 4.0 + delta
y_nudged = x_nudged ** 2

y_original = 4.0 ** 2
actual_change = y_nudged - y_original
predicted_change = gradient * delta

print(f"Actual change in y:    {actual_change:.5f}")
print(f"Predicted change (8*d): {predicted_change:.5f}")
print("See? The gradient predicts how the output responds to input changes!")

At x=4, the gradient is 8.0
This means if we nudge x by +0.001, y should change by approx 0.008 (8 * 0.001)
Actual change in y:    0.00800
Predicted change (8*d): 0.00800
See? The gradient predicts how the output responds to input changes!


## Part 5: The "Gotcha" - Accumulating Gradients

This is the #1 bug for beginners.

PyTorch **accumulates** (adds) gradients by default. It doesn't replace them. This is useful for RNNs, but bad for standard training.

**ALWAYS zero your gradients before the next step.**

In [5]:
weights = torch.ones(2, requires_grad=True)

for i in range(3):
    loss = (weights * 3).sum()
    loss.backward()
    print(f"Step {i}, Gradients: {weights.grad}")
    
    # If we don't zero them, they just keep growing!
    # Uncomment the line below to fix it:
    # weights.grad.zero_()

Step 0, Gradients: tensor([3., 3.])
Step 1, Gradients: tensor([6., 6.])
Step 2, Gradients: tensor([9., 9.])


## Summary Checklist

1. **Loss Function** = The Scorecard (Lower is better).
2. **Learning Rate** = The Step Size (Don't be too greedy).
3. **Gradient** = The Slope (Direction of steepest ascent).
4. **Gradient Descent** = Go opposite the gradient to find the bottom.
5. **requires_grad=True** = Turn on the "Recording Tape".

You are now ready to build Neural Networks.