# PyTorch Tutorial: Autograd and Gradients

This notebook covers one of PyTorch's most powerful features: **automatic differentiation**. This is what makes training neural networks possible!

## Learning Objectives

By the end of this notebook, you will:
- Understand what gradients are and why they're important
- Learn how PyTorch automatically computes gradients
- Use `requires_grad` to enable gradient tracking
- Compute gradients manually and automatically
- Understand the relationship between gradients and backpropagation

---

## What is a Gradient?

In simple terms, a **gradient** tells us:
- **Which direction** to change our parameters to improve our model
- **How much** to change them

Think of it like hiking: the gradient points uphill. In machine learning, we want to go "downhill" (minimize error), so we move in the **opposite direction** of the gradient.

### Mathematical Definition

For a function f(x), the gradient (derivative) tells us the rate of change:
- If gradient is positive â†’ function increases as x increases
- If gradient is negative â†’ function decreases as x increases
- If gradient is zero â†’ we're at a minimum or maximum

**In neural networks**: We use gradients to update weights and biases to minimize the loss (error).



## Setting Up

Let's import what we need and set up for gradient computation:


In [None]:
import torch
import matplotlib.pyplot as plt
import numpy as np

# Set random seed for reproducibility
torch.manual_seed(42)

print("PyTorch version:", torch.__version__)


PyTorch version: 2.3.0


AttributeError: module 'torch.autograd' has no attribute 'is_available'

## Understanding requires_grad

To compute gradients, PyTorch needs to track operations on tensors. We enable this with `requires_grad=True`:


In [None]:
# Create a regular tensor (no gradient tracking)
x_normal = torch.tensor(2.0)
print("Normal tensor:", x_normal)
print("requires_grad:", x_normal.requires_grad)
print()

# Create a tensor with gradient tracking enabled
# This tells PyTorch: "I want to compute gradients with respect to this tensor"
x = torch.tensor(2.0, requires_grad=True)
print("Tensor with gradient tracking:")
print("Value:", x)
print("requires_grad:", x.requires_grad)
print()

# Note: When requires_grad=True, PyTorch builds a computation graph
# This graph tracks all operations so gradients can be computed later


## Simple Example: Computing a Gradient

Let's start with a simple function: y = xÂ²

We'll compute the gradient of y with respect to x:


In [None]:
# Create a tensor that we want to compute gradients for
x = torch.tensor(3.0, requires_grad=True)
print("x =", x)
print()

# Define a function: y = x^2
y = x ** 2
print("y = x^2 =", y)
print()

# Now compute the gradient!
# This computes dy/dx (derivative of y with respect to x)
y.backward()  # This triggers gradient computation

# Access the gradient
print("Gradient dy/dx =", x.grad)
print()

# Let's verify: For y = x^2, dy/dx = 2x
# At x = 3, dy/dx = 2 * 3 = 6
print("Manual calculation: 2 * x = 2 * 3 =", 2 * 3)
print("PyTorch gradient:", x.grad.item())
print("Match!", abs(x.grad.item() - 6) < 0.0001)


### Important: Zeroing Gradients

**Critical point**: Gradients accumulate! If you call `.backward()` multiple times, gradients add up. Always zero gradients before computing new ones (we'll see why this matters in training):


In [None]:
# Example showing gradient accumulation
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2

# First backward pass
y.backward()
print("After first backward():", x.grad)

# Second backward pass (gradients accumulate!)
y.backward()  # This will add to existing gradients
print("After second backward():", x.grad)  # Notice it doubled!

# Zero the gradients
x.grad.zero_()  # The underscore means "in-place operation"
print("After zeroing:", x.grad)

# Now compute again
y = x ** 2
y.backward()
print("After new backward():", x.grad)  # Back to correct value


## More Complex Example: Multiple Variables

Let's compute gradients for a function with multiple variables: z = xÂ² + yÂ²


In [None]:
# Create two tensors with gradient tracking
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

print("x =", x)
print("y =", y)
print()

# Define function: z = x^2 + y^2
z = x ** 2 + y ** 2
print("z = x^2 + y^2 =", z)
print()

# Compute gradients
z.backward()

# Access gradients for both variables
print("Gradient with respect to x (dz/dx):", x.grad)
print("Gradient with respect to y (dz/dy):", y.grad)
print()

# Verify manually:
# dz/dx = 2x = 2 * 2 = 4
# dz/dy = 2y = 2 * 3 = 6
print("Manual check:")
print("dz/dx = 2x =", 2 * 2)
print("dz/dy = 2y =", 2 * 3)


## Visualizing Gradients

Let's visualize what gradients mean by plotting a function and its gradient:


In [None]:
# Let's visualize y = x^2 and its gradient at different points
x_values = np.linspace(-3, 3, 100)
y_values = x_values ** 2

# Compute gradients at a few points
gradient_points = [-2, -1, 0, 1, 2]
gradients = []

for point in gradient_points:
    x = torch.tensor(float(point), requires_grad=True)
    y = x ** 2
    y.backward()
    gradients.append(x.grad.item())
    x.grad.zero_()

# Plot
plt.figure(figsize=(10, 6))
plt.plot(x_values, y_values, 'b-', linewidth=2, label='y = xÂ²')
plt.scatter(gradient_points, [p**2 for p in gradient_points], 
           color='red', s=100, zorder=5, label='Points where we computed gradients')

# Draw gradient arrows (tangent lines)
for i, point in enumerate(gradient_points):
    # Gradient is the slope: dy/dx = 2x
    slope = gradients[i]
    # Draw a small line showing the gradient direction
    x_line = np.array([point - 0.5, point + 0.5])
    y_line = point**2 + slope * (x_line - point)
    plt.plot(x_line, y_line, 'r--', linewidth=2, alpha=0.7)

plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Function y = xÂ² and its Gradients', fontsize=14)
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

print("Notice: At x=0, the gradient is 0 (minimum point)")
print("For x>0, gradient is positive (function increases)")
print("For x<0, gradient is negative (function decreases)")


## Gradient Descent: The Core of Training

**Gradient Descent** is the algorithm used to train neural networks. The idea is simple:

1. Compute the gradient (which direction to go)
2. Move in the **opposite** direction (to minimize loss)
3. Repeat until we reach a minimum

Let's implement a simple gradient descent to find the minimum of y = xÂ²:


In [None]:
# Gradient descent to find minimum of y = x^2
# We know the minimum is at x = 0, but let's find it using gradient descent!

# Starting point (we'll start far from the minimum)
x = torch.tensor(5.0, requires_grad=True)

# Learning rate: how big steps we take
# Too large: might overshoot the minimum
# Too small: takes too long to converge
learning_rate = 0.1

# Store history for visualization
history = []

print("Gradient Descent to minimize y = xÂ²")
print("=" * 50)
print(f"Starting point: x = {x.item():.2f}")

# Perform gradient descent for several steps
for step in range(20):
    # Compute function value
    y = x ** 2
    
    # Compute gradient
    y.backward()
    
    # Update x: move in opposite direction of gradient
    # x_new = x_old - learning_rate * gradient
    with torch.no_grad():  # We don't want to track gradients for this update
        x -= learning_rate * x.grad
    
    # Zero gradients for next iteration
    x.grad.zero_()
    
    # Store for visualization
    history.append((x.item(), y.item()))
    
    if step % 5 == 0:
        print(f"Step {step}: x = {x.item():.4f}, y = {y.item():.4f}, gradient = {x.grad if x.grad is None else 'zeroed'}")

print(f"\nFinal: x = {x.item():.4f}, y = {y.item():.4f}")
print(f"Expected minimum: x = 0.0, y = 0.0")


### Visualizing Gradient Descent

Let's see how gradient descent converges to the minimum:


In [None]:
# Extract x and y values from history
x_history = [h[0] for h in history]
y_history = [h[1] for h in history]

# Plot the function and the descent path
x_plot = np.linspace(-1, 5.5, 100)
y_plot = x_plot ** 2

plt.figure(figsize=(12, 5))

# Plot 1: Function and path
plt.subplot(1, 2, 1)
plt.plot(x_plot, y_plot, 'b-', linewidth=2, label='y = xÂ²')
plt.plot(x_history, y_history, 'ro-', linewidth=2, markersize=8, label='Gradient Descent Path')
plt.scatter([x_history[0]], [y_history[0]], color='green', s=200, marker='*', 
           label='Start', zorder=5)
plt.scatter([x_history[-1]], [y_history[-1]], color='red', s=200, marker='*', 
           label='End', zorder=5)
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Gradient Descent Path', fontsize=14)
plt.grid(True, alpha=0.3)
plt.legend()

# Plot 2: Convergence
plt.subplot(1, 2, 2)
plt.plot(range(len(y_history)), y_history, 'ro-', linewidth=2, markersize=8)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('y value', fontsize=12)
plt.title('Convergence to Minimum', fontsize=14)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice how the algorithm converges to x=0, y=0 (the minimum)!")


## Understanding the Computation Graph

PyTorch builds a **computation graph** to track operations. This graph is used to compute gradients via backpropagation.

Let's understand this with a simple example:


In [None]:
# Example: z = (x + y) * (x - y)
# Let's trace through the computation graph

x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)

print("Inputs:")
print(f"x = {x.item()}, requires_grad = {x.requires_grad}")
print(f"y = {y.item()}, requires_grad = {y.requires_grad}")
print()

# Intermediate computations
a = x + y  # a = 3 + 2 = 5
b = x - y  # b = 3 - 2 = 1

print("Intermediate values:")
print(f"a = x + y = {a.item()}")
print(f"b = x - y = {b.item()}")
print(f"a.requires_grad = {a.requires_grad}")  # Automatically True because it depends on x, y
print()

# Final computation
z = a * b  # z = 5 * 1 = 5

print("Final value:")
print(f"z = a * b = {z.item()}")
print()

# Compute gradients
z.backward()

print("Gradients:")
print(f"dz/dx = {x.grad.item()}")
print(f"dz/dy = {y.grad.item()}")
print()

# Manual verification:
# z = (x+y)(x-y) = xÂ² - yÂ²
# dz/dx = 2x = 2*3 = 6
# dz/dy = -2y = -2*2 = -4
print("Manual check:")
print(f"dz/dx = 2x = {2*3}")
print(f"dz/dy = -2y = {-2*2}")


## Detaching from the Computation Graph

Sometimes you want to stop tracking gradients. Use `.detach()`:


In [None]:
# Create tensor with gradient tracking
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2

print("Before detaching:")
print(f"y.requires_grad = {y.requires_grad}")
print(f"y.grad_fn = {y.grad_fn}")  # grad_fn shows the operation that created this tensor
print()

# Detach: creates a new tensor without gradient tracking
y_detached = y.detach()

print("After detaching:")
print(f"y_detached.requires_grad = {y_detached.requires_grad}")
print(f"y_detached.grad_fn = {y_detached.grad_fn}")  # None because no gradient tracking
print()

# Common use case: when you want to use tensor values without tracking gradients
# For example, when evaluating a model (we'll see this in training notebooks)


## Using torch.no_grad()

The `torch.no_grad()` context manager disables gradient tracking. This is useful when:
- Evaluating models (we don't need gradients)
- Updating parameters (we don't want to track the update itself)
- Saving memory and computation


In [None]:
# Example: Computing values without tracking gradients
x = torch.tensor(2.0, requires_grad=True)

# Normal computation (tracks gradients)
y1 = x ** 2
print("With gradient tracking:")
print(f"y1.requires_grad = {y1.requires_grad}")
print()

# Using no_grad context (doesn't track gradients)
with torch.no_grad():
    y2 = x ** 2
    print("Inside no_grad context:")
    print(f"y2.requires_grad = {y2.requires_grad}")
    print(f"y2 value = {y2.item()}")

print()
print("Outside no_grad context:")
print(f"y2.requires_grad = {y2.requires_grad}")  # Still False
print()

# Common use: when evaluating models or updating parameters
# We'll use this extensively in training loops!


## Advanced: Multiple Outputs

Sometimes you compute gradients for multiple outputs from the same input:


In [None]:
# Example: Multiple outputs from one input
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Multiple outputs
y1 = x ** 2
y2 = x ** 3

print("Input:", x)
print("Output 1 (xÂ²):", y1)
print("Output 2 (xÂ³):", y2)
print()

# If we want gradients for both, we can compute them separately
# Or use a combined output

# Method 1: Compute gradients for y1
y1.sum().backward(retain_graph=True)  # retain_graph keeps graph for second backward
print("Gradients after y1.backward():")
print("x.grad =", x.grad)

# Zero and compute for y2
x.grad.zero_()
y2.sum().backward()
print("\nGradients after y2.backward():")
print("x.grad =", x.grad)


## Practice Exercises

### Exercise 1: Simple Gradient
1. Create a tensor `x = 4.0` with `requires_grad=True`
2. Compute `y = 3xÂ² + 2x + 1`
3. Compute the gradient dy/dx
4. Verify manually: dy/dx = 6x + 2 = 6(4) + 2 = 26

### Exercise 2: Multiple Variables
1. Create `x = 2.0` and `y = 3.0`, both with `requires_grad=True`
2. Compute `z = xÂ²y + xyÂ²`
3. Compute gradients with respect to both x and y
4. Verify: dz/dx = 2xy + yÂ², dz/dy = xÂ² + 2xy

### Exercise 3: Gradient Descent
1. Implement gradient descent to minimize `f(x) = (x - 3)Â²`
2. Start at x = 0
3. Use learning rate = 0.1
4. Run for 20 iterations
5. Verify you converge to x = 3


## Solutions to Exercises

### Exercise 1 Solution


In [None]:
# Exercise 1 Solution
x = torch.tensor(4.0, requires_grad=True)
y = 3 * x ** 2 + 2 * x + 1

print(f"x = {x.item()}")
print(f"y = 3xÂ² + 2x + 1 = {y.item()}")

y.backward()
print(f"\nGradient dy/dx = {x.grad.item()}")
print(f"Manual check: 6x + 2 = {6*4 + 2}")
print(f"Match: {abs(x.grad.item() - 26) < 0.0001}")


### Exercise 2 Solution


In [None]:
# Exercise 2 Solution
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

z = x ** 2 * y + x * y ** 2

print(f"x = {x.item()}, y = {y.item()}")
print(f"z = xÂ²y + xyÂ² = {z.item()}")

z.backward()

print(f"\nGradient dz/dx = {x.grad.item()}")
print(f"Gradient dz/dy = {y.grad.item()}")

print(f"\nManual check:")
print(f"dz/dx = 2xy + yÂ² = 2(2)(3) + 3Â² = {2*2*3 + 3*3}")
print(f"dz/dy = xÂ² + 2xy = 2Â² + 2(2)(3) = {2*2 + 2*2*3}")


### Exercise 3 Solution


In [None]:
# Exercise 3 Solution: Gradient descent for f(x) = (x - 3)Â²
x = torch.tensor(0.0, requires_grad=True)
learning_rate = 0.1

print("Gradient Descent to minimize f(x) = (x - 3)Â²")
print("=" * 50)

for step in range(20):
    f = (x - 3) ** 2
    f.backward()
    
    with torch.no_grad():
        x -= learning_rate * x.grad
    
    x.grad.zero_()
    
    if step % 5 == 0:
        print(f"Step {step}: x = {x.item():.4f}, f(x) = {f.item():.4f}")

print(f"\nFinal: x = {x.item():.4f} (expected: 3.0)")
print(f"Converged: {abs(x.item() - 3.0) < 0.1}")


## Key Takeaways

1. **Gradients** tell us how to update parameters to minimize loss
2. **requires_grad=True** enables gradient tracking for a tensor
3. **.backward()** computes gradients using backpropagation
4. **Gradient Descent**: Update parameters by moving opposite to the gradient
5. **Always zero gradients** (`.grad.zero_()`) before computing new ones
6. **torch.no_grad()** disables gradient tracking (faster, less memory)
7. **Computation Graph**: PyTorch automatically builds a graph to track operations

## What's Next?

In the next notebook, we'll learn about:
- **Building Neural Networks**: Using `nn.Module` to create network architectures
- **Layers**: Linear, convolutional, activation functions, and more
- **Putting it together**: Building your first neural network

The gradient computation we learned here is exactly what happens when training neural networks - PyTorch automatically computes gradients for all parameters!

---

**Great job! You now understand automatic differentiation! ðŸŽ‰**
