# üî• PyTorch Fundamentals & MNIST Model Training

## A Comprehensive Tutorial for Deep Learning

---

This notebook covers everything you need to know about PyTorch, from basic tensor operations to building and training neural networks on the MNIST dataset.

### üìö Table of Contents

1. **Import Required Libraries**
2. **PyTorch Tensor Basics**
3. **Mathematical Operations on Tensors**
4. **Statistical Functions in PyTorch**
5. **Automatic Differentiation with Autograd**
6. **Gradient Calculation Examples**
7. **Computation Graph Visualization**
8. **Loading MNIST Dataset**
9. **Implementing Custom DataLoader Class**
10. **Building Neural Network Model Class**
11. **Loss Functions in PyTorch**
12. **Optimizers in PyTorch**
13. **Training Loop Implementation**
14. **Model Evaluation and Testing**
15. **Deep Learning Modules Overview**

---

## 1. Import Required Libraries

First, let's import all the necessary libraries for our PyTorch journey. We'll use:
- **torch**: The core PyTorch library
- **torch.nn**: Neural network modules
- **torch.optim**: Optimization algorithms
- **torchvision**: Computer vision utilities and datasets
- **matplotlib**: For visualizations

In [None]:
# Core PyTorch imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Torchvision for datasets and transforms
import torchvision
from torchvision import datasets, transforms

# Visualization libraries
import matplotlib.pyplot as plt
import numpy as np

# For computation graph visualization
try:
    from torchviz import make_dot
    TORCHVIZ_AVAILABLE = True
except ImportError:
    TORCHVIZ_AVAILABLE = False
    print("Note: Install torchviz for computation graph visualization: pip install torchviz")

# Check PyTorch version and CUDA availability
print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA Device: {torch.cuda.get_device_name(0)}")

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

---

## 2. PyTorch Tensor Basics

### What are Tensors?

**Tensors** are the fundamental data structure in PyTorch - they are multi-dimensional arrays similar to NumPy's ndarrays, but with additional capabilities:
- Can run on GPUs for accelerated computing
- Support automatic differentiation for deep learning
- Optimized for neural network operations

### Tensor Dimensions:
- **0-D Tensor (Scalar)**: Single value
- **1-D Tensor (Vector)**: Array of values
- **2-D Tensor (Matrix)**: Table of values
- **N-D Tensor**: Higher dimensional arrays (e.g., images, videos)

In [None]:
# ============================================
# TENSOR CREATION METHODS
# ============================================

# 1. Creating tensors from Python lists
tensor_from_list = torch.tensor([1, 2, 3, 4, 5])
print("Tensor from list:", tensor_from_list)

# 2. Creating 2D tensor (matrix)
matrix = torch.tensor([[1, 2, 3], 
                       [4, 5, 6]])
print(f"\n2D Tensor (Matrix):\n{matrix}")

# 3. Zeros tensor - useful for initialization
zeros = torch.zeros(3, 4)  # 3 rows, 4 columns
print(f"\nZeros tensor (3x4):\n{zeros}")

# 4. Ones tensor
ones = torch.ones(2, 3)
print(f"\nOnes tensor (2x3):\n{ones}")

# 5. Random tensor (uniform distribution between 0 and 1)
rand_tensor = torch.rand(3, 3)
print(f"\nRandom tensor (uniform):\n{rand_tensor}")

# 6. Random tensor (normal distribution, mean=0, std=1)
randn_tensor = torch.randn(3, 3)
print(f"\nRandom tensor (normal):\n{randn_tensor}")

# 7. Arange - similar to Python's range()
arange_tensor = torch.arange(0, 10, 2)  # start, end, step
print(f"\nArange tensor (0 to 10, step 2): {arange_tensor}")

# 8. Linspace - evenly spaced values
linspace_tensor = torch.linspace(0, 1, 5)  # 5 values from 0 to 1
print(f"\nLinspace tensor (0 to 1, 5 values): {linspace_tensor}")

# 9. Eye - Identity matrix
identity = torch.eye(4)
print(f"\nIdentity matrix (4x4):\n{identity}")

# 10. Full - tensor filled with a specific value
full_tensor = torch.full((2, 3), 7.0)
print(f"\nFull tensor (filled with 7):\n{full_tensor}")

In [None]:
# ============================================
# TENSOR ATTRIBUTES
# ============================================

sample_tensor = torch.rand(3, 4, 5)  # 3D tensor

print("=" * 50)
print("TENSOR ATTRIBUTES")
print("=" * 50)

# Shape - dimensions of the tensor
print(f"Shape: {sample_tensor.shape}")
print(f"Size (same as shape): {sample_tensor.size()}")

# Number of dimensions
print(f"Number of dimensions (ndim): {sample_tensor.ndim}")

# Data type
print(f"Data type (dtype): {sample_tensor.dtype}")

# Device (CPU or GPU)
print(f"Device: {sample_tensor.device}")

# Total number of elements
print(f"Total elements (numel): {sample_tensor.numel()}")

# Is it on CUDA?
print(f"Is CUDA tensor: {sample_tensor.is_cuda}")

# Requires gradient?
print(f"Requires gradient: {sample_tensor.requires_grad}")

In [None]:
# ============================================
# TENSOR DATA TYPES
# ============================================

print("=" * 50)
print("TENSOR DATA TYPES")
print("=" * 50)

# Float tensors (default)
float32_tensor = torch.tensor([1.0, 2.0, 3.0])
print(f"Float32 (default): {float32_tensor.dtype}")

# Specifying data types
float64_tensor = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float64)
print(f"Float64: {float64_tensor.dtype}")

int32_tensor = torch.tensor([1, 2, 3], dtype=torch.int32)
print(f"Int32: {int32_tensor.dtype}")

int64_tensor = torch.tensor([1, 2, 3], dtype=torch.int64)
print(f"Int64: {int64_tensor.dtype}")

bool_tensor = torch.tensor([True, False, True])
print(f"Boolean: {bool_tensor.dtype}")

# Type conversion using .to() or specific methods
converted = float32_tensor.to(torch.int64)
print(f"\nConverted float32 to int64: {converted}")

# Alternative conversion methods
print(f"Using .int(): {float32_tensor.int()}")
print(f"Using .long(): {float32_tensor.long()}")
print(f"Using .float(): {int32_tensor.float()}")

In [None]:
# ============================================
# TENSOR RESHAPING OPERATIONS
# ============================================

print("=" * 50)
print("TENSOR RESHAPING")
print("=" * 50)

original = torch.arange(12)
print(f"Original tensor: {original}")
print(f"Original shape: {original.shape}")

# Reshape to 3x4
reshaped = original.reshape(3, 4)
print(f"\nReshaped to (3, 4):\n{reshaped}")

# View - similar to reshape but shares memory
viewed = original.view(4, 3)
print(f"\nView as (4, 3):\n{viewed}")

# Flatten - convert to 1D
flattened = reshaped.flatten()
print(f"\nFlattened: {flattened}")

# Squeeze - remove dimensions of size 1
tensor_with_ones = torch.rand(1, 3, 1, 4, 1)
print(f"\nOriginal shape with 1s: {tensor_with_ones.shape}")
squeezed = tensor_with_ones.squeeze()
print(f"After squeeze: {squeezed.shape}")

# Unsqueeze - add a dimension of size 1
tensor_2d = torch.rand(3, 4)
print(f"\n2D tensor shape: {tensor_2d.shape}")
unsqueezed = tensor_2d.unsqueeze(0)  # Add dimension at position 0
print(f"After unsqueeze(0): {unsqueezed.shape}")

# Transpose
matrix = torch.rand(2, 3)
print(f"\nMatrix shape: {matrix.shape}")
transposed = matrix.T  # or matrix.transpose(0, 1)
print(f"Transposed shape: {transposed.shape}")

# Permute - reorder dimensions
tensor_3d = torch.rand(2, 3, 4)
print(f"\n3D tensor shape: {tensor_3d.shape}")
permuted = tensor_3d.permute(2, 0, 1)
print(f"Permuted (2,0,1) shape: {permuted.shape}")

In [None]:
# ============================================
# TENSOR INDEXING AND SLICING
# ============================================

print("=" * 50)
print("TENSOR INDEXING AND SLICING")
print("=" * 50)

tensor = torch.arange(20).reshape(4, 5)
print(f"Original tensor:\n{tensor}")

# Single element access
print(f"\nElement at [1, 2]: {tensor[1, 2]}")

# Row access
print(f"Row 0: {tensor[0]}")
print(f"Row 1: {tensor[1, :]}")

# Column access
print(f"Column 2: {tensor[:, 2]}")

# Slicing
print(f"\nRows 1-2, Columns 2-4:\n{tensor[1:3, 2:5]}")

# Boolean indexing
print(f"\nElements > 10: {tensor[tensor > 10]}")

# Fancy indexing
indices = torch.tensor([0, 2, 3])
print(f"\nRows at indices [0, 2, 3]:\n{tensor[indices]}")

---

## 3. Mathematical Operations on Tensors

PyTorch provides a comprehensive set of mathematical operations that can be performed on tensors. These operations are optimized for both CPU and GPU computation.

### Categories of Mathematical Operations:
- **Element-wise operations**: Applied to each element independently
- **Matrix operations**: Linear algebra operations
- **Reduction operations**: Aggregate values across dimensions

In [None]:
# ============================================
# ELEMENT-WISE OPERATIONS
# ============================================

print("=" * 50)
print("ELEMENT-WISE OPERATIONS")
print("=" * 50)

a = torch.tensor([1.0, 2.0, 3.0, 4.0])
b = torch.tensor([5.0, 6.0, 7.0, 8.0])

print(f"Tensor a: {a}")
print(f"Tensor b: {b}")

# Addition
print(f"\nAddition (a + b): {a + b}")
print(f"torch.add(a, b): {torch.add(a, b)}")

# Subtraction
print(f"\nSubtraction (a - b): {a - b}")
print(f"torch.sub(a, b): {torch.sub(a, b)}")

# Multiplication (element-wise)
print(f"\nMultiplication (a * b): {a * b}")
print(f"torch.mul(a, b): {torch.mul(a, b)}")

# Division
print(f"\nDivision (a / b): {a / b}")
print(f"torch.div(a, b): {torch.div(a, b)}")

# Floor division
print(f"\nFloor division (b // a): {b // a}")

# Modulo
print(f"\nModulo (b % a): {b % a}")

# Power
print(f"\nPower (a ** 2): {a ** 2}")
print(f"torch.pow(a, 2): {torch.pow(a, 2)}")

In [None]:
# ============================================
# MATHEMATICAL FUNCTIONS
# ============================================

print("=" * 50)
print("MATHEMATICAL FUNCTIONS")
print("=" * 50)

x = torch.tensor([1.0, 2.0, 3.0, 4.0])
print(f"Tensor x: {x}")

# Exponential and logarithm
print(f"\nexp(x): {torch.exp(x)}")
print(f"log(x): {torch.log(x)}")
print(f"log10(x): {torch.log10(x)}")
print(f"log2(x): {torch.log2(x)}")

# Square root
print(f"\nsqrt(x): {torch.sqrt(x)}")

# Absolute value
y = torch.tensor([-1.0, -2.0, 3.0, 4.0])
print(f"\nTensor y: {y}")
print(f"abs(y): {torch.abs(y)}")

# Trigonometric functions
angles = torch.tensor([0.0, np.pi/4, np.pi/2, np.pi])
print(f"\nAngles (radians): {angles}")
print(f"sin(angles): {torch.sin(angles)}")
print(f"cos(angles): {torch.cos(angles)}")
print(f"tan(angles[:3]): {torch.tan(angles[:3])}")

# Clamp - limit values to a range
values = torch.tensor([-3, -1, 0, 2, 5, 10])
print(f"\nOriginal values: {values}")
print(f"Clamped (0, 5): {torch.clamp(values, min=0, max=5)}")

# Round, floor, ceil
floats = torch.tensor([1.2, 2.5, 3.7, 4.1])
print(f"\nFloat tensor: {floats}")
print(f"Round: {torch.round(floats)}")
print(f"Floor: {torch.floor(floats)}")
print(f"Ceil: {torch.ceil(floats)}")

In [None]:
# ============================================
# MATRIX OPERATIONS
# ============================================

print("=" * 50)
print("MATRIX OPERATIONS")
print("=" * 50)

A = torch.tensor([[1., 2.], 
                  [3., 4.]])
B = torch.tensor([[5., 6.], 
                  [7., 8.]])

print(f"Matrix A:\n{A}")
print(f"\nMatrix B:\n{B}")

# Matrix multiplication
print(f"\nMatrix multiplication (A @ B):\n{A @ B}")
print(f"\ntorch.matmul(A, B):\n{torch.matmul(A, B)}")
print(f"\ntorch.mm(A, B):\n{torch.mm(A, B)}")

# Dot product (for 1D tensors)
v1 = torch.tensor([1., 2., 3.])
v2 = torch.tensor([4., 5., 6.])
print(f"\nVector v1: {v1}")
print(f"Vector v2: {v2}")
print(f"Dot product: {torch.dot(v1, v2)}")

# Matrix-vector multiplication
v = torch.tensor([1., 2.])
print(f"\nMatrix-vector (A @ v): {A @ v}")

# Batch matrix multiplication
batch_A = torch.rand(3, 2, 4)  # 3 matrices of shape 2x4
batch_B = torch.rand(3, 4, 3)  # 3 matrices of shape 4x3
batch_result = torch.bmm(batch_A, batch_B)
print(f"\nBatch multiplication: {batch_A.shape} @ {batch_B.shape} = {batch_result.shape}")

# Transpose
print(f"\nTranspose of A:\n{A.T}")

# Determinant
print(f"\nDeterminant of A: {torch.det(A)}")

# Inverse
print(f"\nInverse of A:\n{torch.inverse(A)}")

In [None]:
# ============================================
# BROADCASTING
# ============================================

print("=" * 50)
print("BROADCASTING")
print("=" * 50)

# Broadcasting allows operations on tensors of different shapes
matrix = torch.ones(3, 3)
vector = torch.tensor([1., 2., 3.])

print(f"Matrix (3x3):\n{matrix}")
print(f"\nVector (3,): {vector}")

# Vector is broadcast across rows
result = matrix + vector
print(f"\nMatrix + Vector (broadcast):\n{result}")

# Scalar broadcasting
print(f"\nMatrix * 5:\n{matrix * 5}")

# Column broadcasting (need to reshape)
column = torch.tensor([[1.], [2.], [3.]])
print(f"\nColumn vector:\n{column}")
print(f"\nMatrix + Column (broadcast):\n{matrix + column}")

In [None]:
# ============================================
# IN-PLACE OPERATIONS
# ============================================

print("=" * 50)
print("IN-PLACE OPERATIONS")
print("=" * 50)

# In-place operations modify the tensor directly and end with underscore (_)
x = torch.tensor([1., 2., 3., 4.])
print(f"Original x: {x}")

# In-place addition
x.add_(10)
print(f"After x.add_(10): {x}")

# In-place multiplication
x.mul_(2)
print(f"After x.mul_(2): {x}")

# In-place zero
x.zero_()
print(f"After x.zero_(): {x}")

# In-place fill
x.fill_(7)
print(f"After x.fill_(7): {x}")

# ‚ö†Ô∏è Warning: In-place operations can cause issues with autograd!
print("\n‚ö†Ô∏è Note: Be careful with in-place operations when using autograd!")

---

## 4. Statistical Functions in PyTorch

Statistical functions are essential for data analysis and are frequently used in machine learning for:
- Normalization
- Batch statistics
- Loss computation
- Model evaluation

In [None]:
# ============================================
# STATISTICAL FUNCTIONS
# ============================================

print("=" * 50)
print("STATISTICAL FUNCTIONS")
print("=" * 50)

# Create a sample tensor
data = torch.tensor([[1., 2., 3., 4.],
                     [5., 6., 7., 8.],
                     [9., 10., 11., 12.]])

print(f"Data tensor:\n{data}")
print(f"Shape: {data.shape}")

# Basic statistics
print(f"\n--- Basic Statistics ---")
print(f"Mean: {torch.mean(data)}")
print(f"Sum: {torch.sum(data)}")
print(f"Standard Deviation: {torch.std(data)}")
print(f"Variance: {torch.var(data)}")

# Min and Max
print(f"\n--- Min/Max ---")
print(f"Min: {torch.min(data)}")
print(f"Max: {torch.max(data)}")

# Median
print(f"Median: {torch.median(data)}")

# Argmin and Argmax (index of min/max)
print(f"\n--- Argmin/Argmax ---")
print(f"Argmin (flattened): {torch.argmin(data)}")
print(f"Argmax (flattened): {torch.argmax(data)}")

In [None]:
# ============================================
# DIMENSION-WISE OPERATIONS (using dim parameter)
# ============================================

print("=" * 50)
print("DIMENSION-WISE OPERATIONS")
print("=" * 50)

data = torch.tensor([[1., 2., 3., 4.],
                     [5., 6., 7., 8.],
                     [9., 10., 11., 12.]])

print(f"Data tensor:\n{data}")
print(f"Shape: {data.shape} (3 rows, 4 columns)")

# Sum along dimensions
print(f"\n--- Sum along dimensions ---")
print(f"Sum along dim=0 (columns): {torch.sum(data, dim=0)}")  # Shape: (4,)
print(f"Sum along dim=1 (rows): {torch.sum(data, dim=1)}")     # Shape: (3,)

# Mean along dimensions
print(f"\n--- Mean along dimensions ---")
print(f"Mean along dim=0: {torch.mean(data, dim=0)}")
print(f"Mean along dim=1: {torch.mean(data, dim=1)}")

# Keep dimensions (useful for broadcasting)
print(f"\n--- Keeping dimensions ---")
mean_keepdim = torch.mean(data, dim=1, keepdim=True)
print(f"Mean with keepdim=True:\n{mean_keepdim}")
print(f"Shape: {mean_keepdim.shape}")

# Normalization example (subtract mean, divide by std)
print(f"\n--- Normalization Example ---")
mean = torch.mean(data, dim=1, keepdim=True)
std = torch.std(data, dim=1, keepdim=True)
normalized = (data - mean) / std
print(f"Normalized data:\n{normalized}")

# Max with indices
print(f"\n--- Max with indices ---")
max_values, max_indices = torch.max(data, dim=1)
print(f"Max values per row: {max_values}")
print(f"Max indices per row: {max_indices}")

---

## 5. Automatic Differentiation with Autograd

### What is Autograd?

**Autograd** is PyTorch's automatic differentiation engine that powers neural network training. It automatically computes gradients (derivatives) of tensor operations.

### Key Concepts:
- **requires_grad=True**: Tells PyTorch to track operations on this tensor
- **Computational Graph**: Records all operations for backpropagation
- **backward()**: Computes gradients by traversing the graph backwards
- **.grad**: Stores the computed gradient

### Why is this important?
In neural networks, we need to compute gradients of the loss function with respect to all parameters to update them during training.

In [None]:
# ============================================
# AUTOGRAD BASICS
# ============================================

print("=" * 50)
print("AUTOGRAD BASICS")
print("=" * 50)

# Create tensors with gradient tracking
x = torch.tensor([2.0, 3.0], requires_grad=True)
print(f"x = {x}")
print(f"requires_grad: {x.requires_grad}")

# Perform operations - PyTorch builds a computation graph
y = x ** 2
print(f"\ny = x¬≤ = {y}")
print(f"y.grad_fn: {y.grad_fn}")  # Shows the operation that created y

z = y.sum()
print(f"\nz = sum(y) = {z}")
print(f"z.grad_fn: {z.grad_fn}")

# Compute gradients using backward()
z.backward()

# Access gradients
print(f"\n--- After backward() ---")
print(f"x.grad (dz/dx): {x.grad}")
# dz/dx = d(x‚ÇÅ¬≤ + x‚ÇÇ¬≤)/dx = 2x
# For x = [2, 3]: gradient = [4, 6] ‚úì

---

## 6. Gradient Calculation Examples

Let's explore more gradient computation examples and learn about important gradient-related operations.

In [None]:
# ============================================
# GRADIENT CALCULATION EXAMPLES
# ============================================

print("=" * 50)
print("GRADIENT CALCULATION EXAMPLES")
print("=" * 50)

# Example 1: Simple linear function
# f(x) = 3x + 2, df/dx = 3
print("--- Example 1: f(x) = 3x + 2 ---")
x = torch.tensor(5.0, requires_grad=True)
f = 3 * x + 2
f.backward()
print(f"x = {x.item()}")
print(f"f(x) = {f.item()}")
print(f"df/dx = {x.grad.item()}")  # Should be 3

# Example 2: Polynomial function
# f(x) = x¬≥ + 2x¬≤ + x, df/dx = 3x¬≤ + 4x + 1
print("\n--- Example 2: f(x) = x¬≥ + 2x¬≤ + x ---")
x = torch.tensor(2.0, requires_grad=True)
f = x**3 + 2*x**2 + x
f.backward()
print(f"x = {x.item()}")
print(f"f(x) = {f.item()}")
print(f"df/dx = {x.grad.item()}")  # At x=2: 3(4) + 4(2) + 1 = 21

# Example 3: Multiple variables
print("\n--- Example 3: f(x,y) = x¬≤y + y¬≥ ---")
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
f = x**2 * y + y**3
f.backward()
print(f"x = {x.item()}, y = {y.item()}")
print(f"f(x,y) = {f.item()}")
print(f"‚àÇf/‚àÇx = {x.grad.item()}")  # 2xy = 2*2*3 = 12
print(f"‚àÇf/‚àÇy = {y.grad.item()}")  # x¬≤ + 3y¬≤ = 4 + 27 = 31

In [None]:
# ============================================
# GRADIENT ACCUMULATION AND ZEROING
# ============================================

print("=" * 50)
print("GRADIENT ACCUMULATION")
print("=" * 50)

# Gradients accumulate by default - very important to understand!
x = torch.tensor(3.0, requires_grad=True)

# First backward
y1 = x ** 2
y1.backward()
print(f"After first backward: x.grad = {x.grad}")

# Second backward - gradients accumulate!
y2 = x ** 2
y2.backward()
print(f"After second backward: x.grad = {x.grad}")  # Will be 12, not 6!

# To prevent accumulation, zero the gradients
x.grad.zero_()
print(f"After zeroing: x.grad = {x.grad}")

y3 = x ** 2
y3.backward()
print(f"After third backward (with zeroing): x.grad = {x.grad}")  # Now correct: 6

print("\n‚ö†Ô∏è Always zero gradients before backward() in training loops!")

In [None]:
# ============================================
# TORCH.NO_GRAD() CONTEXT MANAGER
# ============================================

print("=" * 50)
print("TORCH.NO_GRAD() CONTEXT MANAGER")
print("=" * 50)

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# With gradient tracking (default)
y = x * 2
print(f"With gradients: y.requires_grad = {y.requires_grad}")

# Without gradient tracking - useful for inference
with torch.no_grad():
    z = x * 2
    print(f"With no_grad(): z.requires_grad = {z.requires_grad}")

# Operations inside no_grad() are not tracked
print(f"z has no grad_fn: {z.grad_fn}")

# Alternative: torch.inference_mode() (slightly faster)
with torch.inference_mode():
    w = x * 2
    print(f"With inference_mode(): w.requires_grad = {w.requires_grad}")

print("\n‚úÖ Use torch.no_grad() during evaluation/inference for:")
print("   - Memory efficiency (no computation graph stored)")
print("   - Faster computation")
print("   - Preventing accidental gradient updates")

In [None]:
# ============================================
# DETACH AND REQUIRES_GRAD CONTROL
# ============================================

print("=" * 50)
print("DETACH AND REQUIRES_GRAD CONTROL")
print("=" * 50)

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x * 2

# Detach - creates a new tensor that doesn't require gradients
z = y.detach()
print(f"y.requires_grad: {y.requires_grad}")
print(f"z (detached).requires_grad: {z.requires_grad}")

# Detach shares memory with original!
print(f"\nz shares memory with y: {z.data_ptr() == y.data_ptr()}")

# Using .data (less recommended, can cause issues)
w = y.data
print(f"w (using .data).requires_grad: {w.requires_grad}")

# Changing requires_grad
a = torch.tensor([1.0, 2.0, 3.0])
print(f"\nOriginal requires_grad: {a.requires_grad}")
a.requires_grad_(True)  # In-place modification
print(f"After requires_grad_(True): {a.requires_grad}")

---

## 7. Computation Graph Visualization

The **computation graph** is a directed acyclic graph (DAG) that PyTorch builds as you perform operations. Each node represents an operation, and edges represent the data (tensors) flowing between operations.

Understanding the computation graph is crucial for:
- Debugging gradient issues
- Understanding how backpropagation works
- Optimizing model architecture

In [None]:
# ============================================
# COMPUTATION GRAPH VISUALIZATION
# ============================================

print("=" * 50)
print("COMPUTATION GRAPH VISUALIZATION")
print("=" * 50)

# Create a simple computation graph
x = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True)
y = torch.tensor([[5., 6.], [7., 8.]], requires_grad=True)

# Perform operations
z = x * y           # Element-wise multiplication
w = z.sum()         # Sum all elements
v = w * 2           # Multiply by 2

print(f"x:\n{x}")
print(f"y:\n{y}")
print(f"z = x * y:\n{z}")
print(f"w = sum(z): {w}")
print(f"v = w * 2: {v}")

# Visualize the computation graph
if TORCHVIZ_AVAILABLE:
    # Create visualization
    dot = make_dot(v, params={'x': x, 'y': y})
    dot.render('computation_graph', format='png', cleanup=True)
    print("\n‚úÖ Computation graph saved as 'computation_graph.png'")
    
    # Display inline (if in Jupyter)
    from IPython.display import Image, display
    display(Image(filename='computation_graph.png'))
else:
    print("\n‚ö†Ô∏è Install torchviz to visualize computation graphs:")
    print("   pip install torchviz graphviz")

In [None]:
# ============================================
# UNDERSTANDING GRAD_FN
# ============================================

print("=" * 50)
print("UNDERSTANDING GRAD_FN")
print("=" * 50)

# grad_fn shows the operation that created a tensor
x = torch.tensor([1., 2., 3.], requires_grad=True)
print(f"x.grad_fn: {x.grad_fn}")  # None - it's a leaf tensor

y = x + 2
print(f"y = x + 2, y.grad_fn: {y.grad_fn}")

z = y * 3
print(f"z = y * 3, z.grad_fn: {z.grad_fn}")

w = z.mean()
print(f"w = mean(z), w.grad_fn: {w.grad_fn}")

# Tracing back through the graph
print("\n--- Tracing the graph ---")
current = w.grad_fn
step = 0
while current is not None:
    print(f"Step {step}: {current}")
    if hasattr(current, 'next_functions'):
        parents = current.next_functions
        if parents:
            current = parents[0][0]  # Follow first parent
        else:
            break
    else:
        break
    step += 1

---

## 8. Loading MNIST Dataset

### About MNIST
**MNIST** (Modified National Institute of Standards and Technology) is a classic dataset containing:
- **60,000 training images** of handwritten digits (0-9)
- **10,000 test images**
- Each image is **28√ó28 grayscale**

This is the "Hello World" of deep learning and perfect for learning neural networks!

In [None]:
# ============================================
# LOADING MNIST DATASET
# ============================================

print("=" * 50)
print("LOADING MNIST DATASET")
print("=" * 50)

# Define transforms
# 1. ToTensor(): Converts PIL Image to tensor and scales to [0, 1]
# 2. Normalize(): Normalizes with mean and std (MNIST: 0.1307, 0.3081)
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Download and load training dataset
train_dataset = datasets.MNIST(
    root='./data',           # Directory to store data
    train=True,              # Training set
    download=True,           # Download if not present
    transform=transform      # Apply transforms
)

# Download and load test dataset
test_dataset = datasets.MNIST(
    root='./data',
    train=False,
    download=True,
    transform=transform
)

print(f"Training dataset size: {len(train_dataset)}")
print(f"Test dataset size: {len(test_dataset)}")

# Examine a single sample
sample_image, sample_label = train_dataset[0]
print(f"\nSample image shape: {sample_image.shape}")
print(f"Sample label: {sample_label}")
print(f"Image dtype: {sample_image.dtype}")
print(f"Image min: {sample_image.min():.4f}, max: {sample_image.max():.4f}")

In [None]:
# ============================================
# VISUALIZING MNIST SAMPLES
# ============================================

# Create a figure to display sample images
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
fig.suptitle('MNIST Sample Images', fontsize=14)

# Get 10 random samples
indices = np.random.choice(len(train_dataset), 10, replace=False)

for i, idx in enumerate(indices):
    image, label = train_dataset[idx]
    ax = axes[i // 5, i % 5]
    
    # Convert tensor to numpy and remove channel dimension
    img_np = image.squeeze().numpy()
    
    # Denormalize for better visualization
    img_np = img_np * 0.3081 + 0.1307
    
    ax.imshow(img_np, cmap='gray')
    ax.set_title(f'Label: {label}')
    ax.axis('off')

plt.tight_layout()
plt.show()

# Display class distribution
print("\n--- Class Distribution ---")
labels = [train_dataset[i][1] for i in range(len(train_dataset))]
unique, counts = np.unique(labels, return_counts=True)
for digit, count in zip(unique, counts):
    print(f"Digit {digit}: {count} samples")

---

## 9. Implementing Custom DataLoader Class

### DataLoader Overview

The **DataLoader** is a powerful utility that provides:
- **Batching**: Groups samples into batches
- **Shuffling**: Randomizes the order of data
- **Parallel loading**: Uses multiple workers for efficiency
- **Prefetching**: Loads next batches while training

### Custom Dataset

Sometimes you need to create your own Dataset class. This requires implementing:
- `__init__()`: Initialize the dataset
- `__len__()`: Return the size of the dataset  
- `__getitem__()`: Return a sample given an index

In [None]:
# ============================================
# DATALOADER CLASS USAGE
# ============================================

print("=" * 50)
print("DATALOADER CLASS")
print("=" * 50)

# Create DataLoaders
BATCH_SIZE = 64

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=BATCH_SIZE,      # Number of samples per batch
    shuffle=True,               # Shuffle data each epoch
    num_workers=2,              # Parallel data loading processes
    pin_memory=True             # Speed up data transfer to GPU
)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,              # Don't shuffle test data
    num_workers=2,
    pin_memory=True
)

print(f"Batch size: {BATCH_SIZE}")
print(f"Number of training batches: {len(train_loader)}")
print(f"Number of test batches: {len(test_loader)}")

# Examine a batch
batch_images, batch_labels = next(iter(train_loader))
print(f"\nBatch images shape: {batch_images.shape}")
print(f"Batch labels shape: {batch_labels.shape}")
print(f"First 10 labels: {batch_labels[:10].tolist()}")

In [None]:
# ============================================
# CUSTOM DATASET CLASS IMPLEMENTATION
# ============================================

print("=" * 50)
print("CUSTOM DATASET CLASS")
print("=" * 50)

class CustomMNISTDataset(Dataset):
    """
    Custom Dataset class for MNIST-like data.
    
    This demonstrates how to create your own Dataset class
    for custom data sources.
    """
    
    def __init__(self, images, labels, transform=None):
        """
        Initialize the dataset.
        
        Args:
            images: Tensor or array of images
            labels: Tensor or array of labels
            transform: Optional transforms to apply
        """
        self.images = images
        self.labels = labels
        self.transform = transform
        
    def __len__(self):
        """Return the total number of samples."""
        return len(self.labels)
    
    def __getitem__(self, idx):
        """
        Return a single sample.
        
        Args:
            idx: Index of the sample
            
        Returns:
            tuple: (image, label)
        """
        image = self.images[idx]
        label = self.labels[idx]
        
        if self.transform:
            image = self.transform(image)
            
        return image, label

# Create a small custom dataset for demonstration
custom_images = torch.rand(100, 1, 28, 28)  # 100 random images
custom_labels = torch.randint(0, 10, (100,))  # 100 random labels

custom_dataset = CustomMNISTDataset(custom_images, custom_labels)

print(f"Custom dataset size: {len(custom_dataset)}")
sample_img, sample_lbl = custom_dataset[0]
print(f"Sample image shape: {sample_img.shape}")
print(f"Sample label: {sample_lbl}")

# Create DataLoader from custom dataset
custom_loader = DataLoader(custom_dataset, batch_size=16, shuffle=True)
print(f"Number of batches: {len(custom_loader)}")

---

## 10. Building Neural Network Model Class

### nn.Module Overview

In PyTorch, neural networks are built by subclassing `nn.Module`. Key components:

- **`__init__()`**: Define all layers and learnable parameters
- **`forward()`**: Define how data flows through the network

### Key Layers We'll Use:
- `nn.Linear`: Fully connected layer
- `nn.Conv2d`: 2D convolutional layer
- `nn.ReLU`: Activation function
- `nn.Flatten`: Convert 2D to 1D
- `nn.Dropout`: Regularization

In [None]:
# ============================================
# SIMPLE FEED-FORWARD NEURAL NETWORK
# ============================================

print("=" * 50)
print("SIMPLE FEED-FORWARD NEURAL NETWORK")
print("=" * 50)

class SimpleNN(nn.Module):
    """
    A simple feed-forward neural network for MNIST.
    
    Architecture:
    - Input: 784 (28x28 flattened)
    - Hidden Layer 1: 512 neurons + ReLU + Dropout
    - Hidden Layer 2: 256 neurons + ReLU + Dropout
    - Output: 10 classes (digits 0-9)
    """
    
    def __init__(self):
        super(SimpleNN, self).__init__()
        
        # Flatten layer: (batch, 1, 28, 28) -> (batch, 784)
        self.flatten = nn.Flatten()
        
        # Fully connected layers
        self.fc1 = nn.Linear(784, 512)   # Input to hidden 1
        self.fc2 = nn.Linear(512, 256)   # Hidden 1 to hidden 2
        self.fc3 = nn.Linear(256, 10)    # Hidden 2 to output
        
        # Activation and regularization
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x):
        """Forward pass through the network."""
        # Flatten the input
        x = self.flatten(x)              # (batch, 784)
        
        # First hidden layer
        x = self.fc1(x)                  # (batch, 512)
        x = self.relu(x)
        x = self.dropout(x)
        
        # Second hidden layer
        x = self.fc2(x)                  # (batch, 256)
        x = self.relu(x)
        x = self.dropout(x)
        
        # Output layer (no activation - CrossEntropyLoss handles it)
        x = self.fc3(x)                  # (batch, 10)
        
        return x

# Create model instance
simple_model = SimpleNN()
print(simple_model)

# Count parameters
total_params = sum(p.numel() for p in simple_model.parameters())
trainable_params = sum(p.numel() for p in simple_model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

In [None]:
# ============================================
# CONVOLUTIONAL NEURAL NETWORK (CNN)
# ============================================

print("=" * 50)
print("CONVOLUTIONAL NEURAL NETWORK (CNN)")
print("=" * 50)

class CNN(nn.Module):
    """
    A Convolutional Neural Network for MNIST.
    
    Architecture:
    - Conv Layer 1: 1 -> 32 channels, 3x3 kernel
    - Conv Layer 2: 32 -> 64 channels, 3x3 kernel
    - MaxPool: 2x2
    - Fully Connected: 9216 -> 128 -> 10
    """
    
    def __init__(self):
        super(CNN, self).__init__()
        
        # Convolutional layers
        self.conv1 = nn.Conv2d(
            in_channels=1,      # Grayscale input
            out_channels=32,    # 32 feature maps
            kernel_size=3,      # 3x3 filter
            padding=1           # Same padding
        )
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        
        # Pooling layer
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        
        # Fully connected layers
        # After 2 convs and 2 pools: 28 -> 14 -> 7, so 7x7x64 = 3136
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, 10)
        
        # Dropout
        self.dropout = nn.Dropout(0.25)
        
    def forward(self, x):
        # Conv block 1: Conv -> ReLU -> Pool
        x = self.conv1(x)           # (batch, 32, 28, 28)
        x = F.relu(x)
        x = self.pool(x)            # (batch, 32, 14, 14)
        
        # Conv block 2: Conv -> ReLU -> Pool
        x = self.conv2(x)           # (batch, 64, 14, 14)
        x = F.relu(x)
        x = self.pool(x)            # (batch, 64, 7, 7)
        
        # Flatten
        x = x.view(x.size(0), -1)   # (batch, 3136)
        
        # Fully connected layers
        x = F.relu(self.fc1(x))     # (batch, 128)
        x = self.dropout(x)
        x = self.fc2(x)             # (batch, 10)
        
        return x

# Create CNN model
cnn_model = CNN()
print(cnn_model)

# Count parameters
total_params = sum(p.numel() for p in cnn_model.parameters())
print(f"\nTotal parameters: {total_params:,}")

In [None]:
# ============================================
# MODEL INSPECTION
# ============================================

print("=" * 50)
print("MODEL INSPECTION")
print("=" * 50)

# Access model parameters
print("--- Named Parameters ---")
for name, param in cnn_model.named_parameters():
    print(f"{name}: shape={param.shape}, requires_grad={param.requires_grad}")

# Access specific layer
print("\n--- Accessing Layers ---")
print(f"First conv layer: {cnn_model.conv1}")
print(f"First conv layer weights shape: {cnn_model.conv1.weight.shape}")
print(f"First conv layer bias shape: {cnn_model.conv1.bias.shape}")

# Move model to device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"\n--- Device ---")
print(f"Using device: {device}")

# Move model to device
cnn_model = cnn_model.to(device)
print(f"Model moved to: {next(cnn_model.parameters()).device}")

---

## 11. Loss Functions in PyTorch

### What is a Loss Function?

A **loss function** (or cost function) measures how well our model's predictions match the true labels. The goal of training is to minimize this loss.

### Common Loss Functions:

| Loss Function | Use Case |
|--------------|----------|
| `nn.CrossEntropyLoss` | Multi-class classification |
| `nn.NLLLoss` | Multi-class with log probabilities |
| `nn.BCELoss` | Binary classification |
| `nn.BCEWithLogitsLoss` | Binary (more numerically stable) |
| `nn.MSELoss` | Regression |
| `nn.L1Loss` | Regression (robust to outliers) |

In [None]:
# ============================================
# LOSS FUNCTIONS DEMONSTRATION
# ============================================

print("=" * 50)
print("LOSS FUNCTIONS")
print("=" * 50)

# Create sample predictions and targets
# For multi-class classification (like MNIST)
predictions = torch.randn(5, 10)  # 5 samples, 10 classes (raw logits)
targets = torch.tensor([0, 3, 5, 7, 9])  # True class labels

print("Predictions (logits) shape:", predictions.shape)
print("Targets:", targets)

# 1. CrossEntropyLoss - Most common for multi-class
print("\n--- CrossEntropyLoss ---")
ce_loss = nn.CrossEntropyLoss()
loss_value = ce_loss(predictions, targets)
print(f"CrossEntropyLoss: {loss_value.item():.4f}")

# Note: CrossEntropyLoss combines LogSoftmax + NLLLoss
# So you should NOT apply softmax before CrossEntropyLoss!

# 2. NLLLoss - Negative Log Likelihood (needs log probabilities)
print("\n--- NLLLoss ---")
nll_loss = nn.NLLLoss()
log_probs = F.log_softmax(predictions, dim=1)  # Apply log_softmax first
loss_value = nll_loss(log_probs, targets)
print(f"NLLLoss (with log_softmax): {loss_value.item():.4f}")

# 3. MSELoss - Mean Squared Error (for regression)
print("\n--- MSELoss ---")
mse_loss = nn.MSELoss()
pred_reg = torch.randn(5)
target_reg = torch.randn(5)
loss_value = mse_loss(pred_reg, target_reg)
print(f"MSELoss: {loss_value.item():.4f}")

# 4. BCELoss - Binary Cross Entropy
print("\n--- BCELoss ---")
bce_loss = nn.BCELoss()
pred_binary = torch.sigmoid(torch.randn(5))  # Must be probabilities (0-1)
target_binary = torch.tensor([0., 1., 0., 1., 1.])
loss_value = bce_loss(pred_binary, target_binary)
print(f"BCELoss: {loss_value.item():.4f}")

# 5. L1Loss - Mean Absolute Error
print("\n--- L1Loss ---")
l1_loss = nn.L1Loss()
loss_value = l1_loss(pred_reg, target_reg)
print(f"L1Loss: {loss_value.item():.4f}")

---

## 12. Optimizers in PyTorch

### What is an Optimizer?

An **optimizer** updates the model's parameters based on the computed gradients to minimize the loss function.

### Training Loop Pattern:
```python
optimizer.zero_grad()   # Clear previous gradients
loss = criterion(output, target)  # Compute loss
loss.backward()         # Compute gradients
optimizer.step()        # Update parameters
```

### Popular Optimizers:

| Optimizer | Description |
|-----------|-------------|
| `SGD` | Stochastic Gradient Descent |
| `Adam` | Adaptive Moment Estimation |
| `AdamW` | Adam with weight decay |
| `RMSprop` | Root Mean Square Propagation |

In [None]:
# ============================================
# OPTIMIZERS DEMONSTRATION
# ============================================

print("=" * 50)
print("OPTIMIZERS")
print("=" * 50)

# Create a simple model for demonstration
demo_model = nn.Linear(10, 2)

# 1. SGD - Stochastic Gradient Descent
print("--- SGD (Stochastic Gradient Descent) ---")
sgd_optimizer = optim.SGD(
    demo_model.parameters(),
    lr=0.01,           # Learning rate
    momentum=0.9,      # Momentum factor
    weight_decay=1e-4  # L2 regularization
)
print(f"SGD: lr={sgd_optimizer.param_groups[0]['lr']}, momentum={sgd_optimizer.param_groups[0]['momentum']}")

# 2. Adam - Adaptive Moment Estimation
print("\n--- Adam ---")
adam_optimizer = optim.Adam(
    demo_model.parameters(),
    lr=0.001,          # Learning rate (default)
    betas=(0.9, 0.999), # Coefficients for running averages
    weight_decay=0     # L2 regularization
)
print(f"Adam: lr={adam_optimizer.param_groups[0]['lr']}")

# 3. AdamW - Adam with decoupled weight decay
print("\n--- AdamW ---")
adamw_optimizer = optim.AdamW(
    demo_model.parameters(),
    lr=0.001,
    weight_decay=0.01  # Better weight decay implementation
)
print(f"AdamW: lr={adamw_optimizer.param_groups[0]['lr']}")

# 4. RMSprop
print("\n--- RMSprop ---")
rmsprop_optimizer = optim.RMSprop(
    demo_model.parameters(),
    lr=0.01,
    alpha=0.99
)
print(f"RMSprop: lr={rmsprop_optimizer.param_groups[0]['lr']}")

In [None]:
# ============================================
# LEARNING RATE SCHEDULERS
# ============================================

print("=" * 50)
print("LEARNING RATE SCHEDULERS")
print("=" * 50)

# Learning rate schedulers adjust LR during training
demo_model = nn.Linear(10, 2)
optimizer = optim.Adam(demo_model.parameters(), lr=0.1)

# 1. StepLR - Decay by gamma every step_size epochs
scheduler1 = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
print(f"StepLR: Decay by 0.1 every 10 epochs")

# 2. ExponentialLR - Decay by gamma every epoch
scheduler2 = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.95)
print(f"ExponentialLR: Decay by 0.95 every epoch")

# 3. ReduceLROnPlateau - Reduce when metric plateaus
scheduler3 = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.1, patience=5
)
print(f"ReduceLROnPlateau: Reduce by 0.1 after 5 epochs of no improvement")

# 4. CosineAnnealingLR - Cosine annealing
scheduler4 = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
print(f"CosineAnnealingLR: Cosine schedule over 50 epochs")

# Demonstrate LR change
print("\n--- LR Schedule Demo (StepLR) ---")
for epoch in range(25):
    current_lr = optimizer.param_groups[0]['lr']
    if epoch % 5 == 0:
        print(f"Epoch {epoch}: lr = {current_lr:.6f}")
    scheduler1.step()

---

## 13. Training Loop Implementation

Now let's put everything together and train our CNN on the MNIST dataset!

### The Training Loop:
1. Set model to training mode (`model.train()`)
2. For each batch:
   - Move data to device
   - Zero gradients
   - Forward pass
   - Compute loss
   - Backward pass
   - Update parameters

In [None]:
# ============================================
# SETUP FOR TRAINING
# ============================================

print("=" * 50)
print("TRAINING SETUP")
print("=" * 50)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Create model
model = CNN().to(device)

# Loss function
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Learning rate scheduler
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)

# Training hyperparameters
NUM_EPOCHS = 5
BATCH_SIZE = 64

# Create data loaders (recreate with appropriate settings)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=0)

print(f"\nHyperparameters:")
print(f"  Epochs: {NUM_EPOCHS}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Learning rate: {optimizer.param_groups[0]['lr']}")
print(f"  Optimizer: Adam")
print(f"  Loss function: CrossEntropyLoss")

In [None]:
# ============================================
# TRAINING FUNCTION
# ============================================

def train_epoch(model, train_loader, criterion, optimizer, device):
    """
    Train the model for one epoch.
    
    Args:
        model: Neural network model
        train_loader: DataLoader for training data
        criterion: Loss function
        optimizer: Optimizer
        device: Device to use (CPU/GPU)
        
    Returns:
        Average training loss for the epoch
    """
    model.train()  # Set model to training mode
    running_loss = 0.0
    correct = 0
    total = 0
    
    for batch_idx, (images, labels) in enumerate(train_loader):
        # Move data to device
        images = images.to(device)
        labels = labels.to(device)
        
        # Zero the gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(images)
        
        # Compute loss
        loss = criterion(outputs, labels)
        
        # Backward pass
        loss.backward()
        
        # Update parameters
        optimizer.step()
        
        # Statistics
        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
        
        # Print progress every 200 batches
        if (batch_idx + 1) % 200 == 0:
            print(f'  Batch [{batch_idx + 1}/{len(train_loader)}] '
                  f'Loss: {loss.item():.4f} '
                  f'Acc: {100. * correct / total:.2f}%')
    
    epoch_loss = running_loss / len(train_loader)
    epoch_acc = 100. * correct / total
    
    return epoch_loss, epoch_acc

In [None]:
# ============================================
# EVALUATION FUNCTION
# ============================================

def evaluate(model, test_loader, criterion, device):
    """
    Evaluate the model on test data.
    
    Args:
        model: Neural network model
        test_loader: DataLoader for test data
        criterion: Loss function
        device: Device to use (CPU/GPU)
        
    Returns:
        Average test loss and accuracy
    """
    model.eval()  # Set model to evaluation mode
    running_loss = 0.0
    correct = 0
    total = 0
    
    # No gradient computation during evaluation
    with torch.no_grad():
        for images, labels in test_loader:
            images = images.to(device)
            labels = labels.to(device)
            
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            # Statistics
            running_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
    
    test_loss = running_loss / len(test_loader)
    test_acc = 100. * correct / total
    
    return test_loss, test_acc

In [None]:
# ============================================
# COMPLETE TRAINING LOOP
# ============================================

print("=" * 50)
print("TRAINING THE MODEL")
print("=" * 50)

# Track metrics
train_losses = []
train_accs = []
test_losses = []
test_accs = []

# Training loop
for epoch in range(NUM_EPOCHS):
    print(f"\nEpoch [{epoch + 1}/{NUM_EPOCHS}]")
    print("-" * 30)
    
    # Train
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    train_losses.append(train_loss)
    train_accs.append(train_acc)
    
    # Evaluate
    test_loss, test_acc = evaluate(model, test_loader, criterion, device)
    test_losses.append(test_loss)
    test_accs.append(test_acc)
    
    # Update learning rate
    scheduler.step()
    current_lr = optimizer.param_groups[0]['lr']
    
    # Print epoch summary
    print(f"\n  Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}%")
    print(f"  Test Loss: {test_loss:.4f}  | Test Acc: {test_acc:.2f}%")
    print(f"  Learning Rate: {current_lr:.6f}")

print("\n" + "=" * 50)
print("TRAINING COMPLETE!")
print("=" * 50)
print(f"Final Test Accuracy: {test_accs[-1]:.2f}%")

In [None]:
# ============================================
# PLOT TRAINING CURVES
# ============================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot Loss
axes[0].plot(range(1, NUM_EPOCHS + 1), train_losses, 'b-o', label='Training Loss')
axes[0].plot(range(1, NUM_EPOCHS + 1), test_losses, 'r-o', label='Test Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training and Test Loss')
axes[0].legend()
axes[0].grid(True)

# Plot Accuracy
axes[1].plot(range(1, NUM_EPOCHS + 1), train_accs, 'b-o', label='Training Accuracy')
axes[1].plot(range(1, NUM_EPOCHS + 1), test_accs, 'r-o', label='Test Accuracy')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy (%)')
axes[1].set_title('Training and Test Accuracy')
axes[1].legend()
axes[1].grid(True)

plt.tight_layout()
plt.show()

---

## 14. Model Evaluation and Testing

Now let's evaluate our trained model more thoroughly and visualize its predictions.

In [None]:
# ============================================
# DETAILED EVALUATION
# ============================================

print("=" * 50)
print("DETAILED EVALUATION")
print("=" * 50)

# Get predictions on entire test set
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for images, labels in test_loader:
        images = images.to(device)
        outputs = model(images)
        _, preds = torch.max(outputs, 1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.numpy())

all_preds = np.array(all_preds)
all_labels = np.array(all_labels)

# Per-class accuracy
print("\n--- Per-Class Accuracy ---")
for digit in range(10):
    mask = all_labels == digit
    class_acc = (all_preds[mask] == all_labels[mask]).mean() * 100
    print(f"Digit {digit}: {class_acc:.2f}%")

# Overall accuracy
overall_acc = (all_preds == all_labels).mean() * 100
print(f"\nOverall Test Accuracy: {overall_acc:.2f}%")

In [None]:
# ============================================
# VISUALIZE PREDICTIONS
# ============================================

# Get a batch of test images
test_images, test_labels = next(iter(test_loader))

# Make predictions
model.eval()
with torch.no_grad():
    test_images_device = test_images.to(device)
    outputs = model(test_images_device)
    probabilities = F.softmax(outputs, dim=1)
    _, predictions = torch.max(outputs, 1)

# Plot predictions
fig, axes = plt.subplots(3, 5, figsize=(15, 9))
fig.suptitle('Model Predictions on Test Images', fontsize=14)

for i in range(15):
    ax = axes[i // 5, i % 5]
    
    # Get image
    img = test_images[i].squeeze().numpy()
    img = img * 0.3081 + 0.1307  # Denormalize
    
    # Get prediction info
    pred = predictions[i].item()
    true = test_labels[i].item()
    prob = probabilities[i][pred].item()
    
    # Display
    ax.imshow(img, cmap='gray')
    color = 'green' if pred == true else 'red'
    ax.set_title(f'Pred: {pred} ({prob:.1%})\nTrue: {true}', color=color)
    ax.axis('off')

plt.tight_layout()
plt.show()

In [None]:
# ============================================
# CONFUSION MATRIX
# ============================================

from collections import Counter

# Create confusion matrix manually
confusion_matrix = np.zeros((10, 10), dtype=int)
for true, pred in zip(all_labels, all_preds):
    confusion_matrix[true, pred] += 1

# Plot confusion matrix
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(confusion_matrix, cmap='Blues')

# Add labels
ax.set_xticks(range(10))
ax.set_yticks(range(10))
ax.set_xlabel('Predicted Label', fontsize=12)
ax.set_ylabel('True Label', fontsize=12)
ax.set_title('Confusion Matrix', fontsize=14)

# Add text annotations
for i in range(10):
    for j in range(10):
        text = ax.text(j, i, confusion_matrix[i, j],
                      ha="center", va="center", 
                      color="white" if confusion_matrix[i, j] > confusion_matrix.max()/2 else "black")

plt.colorbar(im)
plt.tight_layout()
plt.show()

# Find most common misclassifications
print("\n--- Most Common Misclassifications ---")
misclassified = [(true, pred, count) for (true, pred), count in 
                 Counter(zip(all_labels, all_preds)).items() if true != pred]
misclassified.sort(key=lambda x: -x[2])

for true, pred, count in misclassified[:5]:
    print(f"True: {true}, Predicted: {pred}, Count: {count}")

---

## 15. Deep Learning Modules Overview

PyTorch provides a comprehensive set of modules for building neural networks. Let's explore the most important ones.

In [None]:
# ============================================
# NN.SEQUENTIAL - BUILDING MODELS EASILY
# ============================================

print("=" * 50)
print("NN.SEQUENTIAL")
print("=" * 50)

# nn.Sequential allows building models as a sequence of layers
sequential_model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(128, 10)
)

print("Sequential Model:")
print(sequential_model)

# Test forward pass
test_input = torch.randn(1, 1, 28, 28)
output = sequential_model(test_input)
print(f"\nInput shape: {test_input.shape}")
print(f"Output shape: {output.shape}")

In [None]:
# ============================================
# ACTIVATION FUNCTIONS
# ============================================

print("=" * 50)
print("ACTIVATION FUNCTIONS")
print("=" * 50)

x = torch.linspace(-5, 5, 100)

# Common activation functions
activations = {
    'ReLU': F.relu(x),
    'LeakyReLU': F.leaky_relu(x, 0.1),
    'Sigmoid': torch.sigmoid(x),
    'Tanh': torch.tanh(x),
    'GELU': F.gelu(x),
    'SiLU (Swish)': F.silu(x)
}

# Plot activation functions
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

for idx, (name, y) in enumerate(activations.items()):
    ax = axes[idx]
    ax.plot(x.numpy(), y.numpy(), linewidth=2)
    ax.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
    ax.axvline(x=0, color='k', linestyle='-', linewidth=0.5)
    ax.set_title(name, fontsize=12)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.suptitle('Common Activation Functions', y=1.02, fontsize=14)
plt.show()

In [None]:
# ============================================
# NORMALIZATION LAYERS
# ============================================

print("=" * 50)
print("NORMALIZATION LAYERS")
print("=" * 50)

# BatchNorm - normalizes across batch dimension
print("--- BatchNorm ---")
batch_norm_1d = nn.BatchNorm1d(100)  # For 1D data (N, C)
batch_norm_2d = nn.BatchNorm2d(32)   # For 2D data (N, C, H, W)

sample_1d = torch.randn(16, 100)  # Batch of 16, 100 features
sample_2d = torch.randn(16, 32, 28, 28)  # Batch of 16, 32 channels

output_1d = batch_norm_1d(sample_1d)
output_2d = batch_norm_2d(sample_2d)

print(f"BatchNorm1d: input {sample_1d.shape} -> output {output_1d.shape}")
print(f"BatchNorm2d: input {sample_2d.shape} -> output {output_2d.shape}")

# LayerNorm - normalizes across features
print("\n--- LayerNorm ---")
layer_norm = nn.LayerNorm([100])
output_ln = layer_norm(sample_1d)
print(f"LayerNorm: input {sample_1d.shape} -> output {output_ln.shape}")

# Compare before and after normalization
print(f"\nBefore BatchNorm - mean: {sample_1d.mean():.4f}, std: {sample_1d.std():.4f}")
print(f"After BatchNorm - mean: {output_1d.mean():.4f}, std: {output_1d.std():.4f}")

In [None]:
# ============================================
# POOLING LAYERS
# ============================================

print("=" * 50)
print("POOLING LAYERS")
print("=" * 50)

# Create sample feature map
feature_map = torch.randn(1, 1, 8, 8)
print(f"Input feature map shape: {feature_map.shape}")

# MaxPool2d - takes maximum value in each window
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
max_pooled = max_pool(feature_map)
print(f"\nMaxPool2d (2x2): {feature_map.shape} -> {max_pooled.shape}")

# AvgPool2d - takes average value in each window
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
avg_pooled = avg_pool(feature_map)
print(f"AvgPool2d (2x2): {feature_map.shape} -> {avg_pooled.shape}")

# AdaptiveAvgPool2d - outputs fixed size
adaptive_pool = nn.AdaptiveAvgPool2d((1, 1))  # Global average pooling
adaptive_pooled = adaptive_pool(feature_map)
print(f"AdaptiveAvgPool2d (1x1): {feature_map.shape} -> {adaptive_pooled.shape}")

In [None]:
# ============================================
# OTHER IMPORTANT MODULES
# ============================================

print("=" * 50)
print("OTHER IMPORTANT MODULES")
print("=" * 50)

# Dropout - regularization
print("--- Dropout ---")
dropout = nn.Dropout(p=0.5)  # 50% dropout probability
x = torch.ones(1, 10)
print(f"Input: {x}")
# Note: Dropout behaves differently in train vs eval mode
dropout.train()
print(f"Dropout (training): {dropout(x)}")
dropout.eval()
print(f"Dropout (eval): {dropout(x)}")

# Embedding - for categorical/text data
print("\n--- Embedding ---")
embedding = nn.Embedding(num_embeddings=1000, embedding_dim=128)  # 1000 words, 128-dim vectors
word_indices = torch.tensor([1, 5, 10, 100])
embedded = embedding(word_indices)
print(f"Word indices: {word_indices.shape} -> Embeddings: {embedded.shape}")

# Softmax
print("\n--- Softmax ---")
logits = torch.tensor([2.0, 1.0, 0.1])
softmax = nn.Softmax(dim=0)
probabilities = softmax(logits)
print(f"Logits: {logits}")
print(f"Softmax: {probabilities}")
print(f"Sum: {probabilities.sum():.4f}")  # Should be 1.0

# ModuleList - for dynamic architectures
print("\n--- ModuleList ---")
layers = nn.ModuleList([nn.Linear(10, 10) for _ in range(3)])
print(f"ModuleList with {len(layers)} layers")
for i, layer in enumerate(layers):
    print(f"  Layer {i}: {layer}")

In [None]:
# ============================================
# SAVING AND LOADING MODELS
# ============================================

print("=" * 50)
print("SAVING AND LOADING MODELS")
print("=" * 50)

# Method 1: Save entire model (not recommended)
print("--- Method 1: Save entire model ---")
torch.save(model, 'model_complete.pth')
print("Saved complete model to 'model_complete.pth'")

# Load entire model
loaded_model = torch.load('model_complete.pth', weights_only=False)
print("Loaded complete model")

# Method 2: Save state_dict only (recommended)
print("\n--- Method 2: Save state_dict (recommended) ---")
torch.save(model.state_dict(), 'model_state_dict.pth')
print("Saved model state_dict to 'model_state_dict.pth'")

# Load state_dict
new_model = CNN()  # Create new model instance
new_model.load_state_dict(torch.load('model_state_dict.pth', weights_only=True))
new_model.eval()  # Set to evaluation mode
print("Loaded model state_dict")

# Method 3: Save checkpoint (for resuming training)
print("\n--- Method 3: Save checkpoint ---")
checkpoint = {
    'epoch': NUM_EPOCHS,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'train_losses': train_losses,
    'test_losses': test_losses,
}
torch.save(checkpoint, 'checkpoint.pth')
print("Saved training checkpoint to 'checkpoint.pth'")

# Verify saved files
import os
for file in ['model_complete.pth', 'model_state_dict.pth', 'checkpoint.pth']:
    size = os.path.getsize(file) / 1024
    print(f"{file}: {size:.1f} KB")

---

## üìù Summary

### What We Covered:

1. **Tensor Basics**: Creation, attributes, reshaping, indexing
2. **Mathematical Operations**: Element-wise, matrix operations, broadcasting
3. **Statistical Functions**: Mean, std, var, min, max with dimension control
4. **Autograd**: Automatic differentiation, gradient computation
5. **Computation Graphs**: How PyTorch tracks operations for backpropagation
6. **MNIST Dataset**: Loading, transforming, and visualizing data
7. **DataLoader**: Batching, shuffling, and custom datasets
8. **Neural Network Models**: Building with nn.Module
9. **Loss Functions**: CrossEntropyLoss, MSELoss, and others
10. **Optimizers**: SGD, Adam, and learning rate schedulers
11. **Training Loop**: Complete implementation with evaluation
12. **Deep Learning Modules**: Normalization, pooling, activation functions
13. **Saving/Loading**: Models and checkpoints

### Key Takeaways:

- ‚úÖ Use `requires_grad=True` for tensors that need gradients
- ‚úÖ Always call `optimizer.zero_grad()` before `loss.backward()`
- ‚úÖ Use `torch.no_grad()` during evaluation
- ‚úÖ Set `model.train()` for training, `model.eval()` for evaluation
- ‚úÖ Save `state_dict()` rather than the whole model
- ‚úÖ Use appropriate loss functions for your task

In [None]:
# ============================================
# QUICK REFERENCE - COMMON PYTORCH PATTERNS
# ============================================

print("=" * 60)
print("QUICK REFERENCE - COMMON PYTORCH PATTERNS")
print("=" * 60)

print("""
üî∑ TENSOR CREATION:
   torch.tensor([1, 2, 3])     # From list
   torch.zeros(3, 4)           # All zeros
   torch.ones(3, 4)            # All ones
   torch.rand(3, 4)            # Uniform [0, 1)
   torch.randn(3, 4)           # Normal (0, 1)
   torch.arange(0, 10, 2)      # Range with step
   torch.linspace(0, 1, 10)    # Evenly spaced

üî∑ TENSOR OPERATIONS:
   x + y, torch.add(x, y)      # Addition
   x @ y, torch.matmul(x, y)   # Matrix multiplication
   x.view(2, 3)                # Reshape (shares memory)
   x.reshape(2, 3)             # Reshape (may copy)
   x.T, x.transpose(0, 1)      # Transpose

üî∑ AUTOGRAD:
   x = torch.tensor([1.], requires_grad=True)
   y = x ** 2
   y.backward()                # Compute gradients
   x.grad                      # Access gradient
   
üî∑ TRAINING LOOP:
   model.train()               # Training mode
   optimizer.zero_grad()       # Clear gradients
   outputs = model(inputs)     # Forward pass
   loss = criterion(outputs, targets)
   loss.backward()             # Backward pass
   optimizer.step()            # Update weights

üî∑ EVALUATION:
   model.eval()                # Evaluation mode
   with torch.no_grad():       # Disable gradient tracking
       outputs = model(inputs)

üî∑ DEVICE MANAGEMENT:
   device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
   model = model.to(device)
   data = data.to(device)

üî∑ SAVING/LOADING:
   torch.save(model.state_dict(), 'model.pth')
   model.load_state_dict(torch.load('model.pth'))
""")

print("\nüéâ Congratulations! You've completed the PyTorch Fundamentals tutorial!")