In [None]:
# 🔧 Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("⚠️ No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime → Change runtime type → GPU")

print(f"\n📦 Python {sys.version.split()[0]}")
print(f"🔥 PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"🎲 Random seed set to {SEED}")

%matplotlib inline

# Convolutions from Scratch -- Vizuara

**How do neural networks learn to see? We build the convolution operation from first principles, then stack them into a CNN that classifies images.**

In this notebook, we will:
1. Understand what a convolution is by implementing it from scratch
2. Visualize learned filters and feature maps
3. Build a complete CNN and train it on CIFAR-10
4. Understand max pooling, ReLU, and feature hierarchies

**Runtime:** Google Colab (GPU recommended, T4 is sufficient)
**Estimated time:** 45-60 minutes

## 1. Why Does This Matter?

Every modern vision system -- from self-driving cars to medical imaging AI -- relies on a simple idea: **slide a small filter across an image and compute a weighted sum at each position.** This operation, the convolution, is the building block of computer vision.

But why does sliding a small filter work for understanding images? The key insight is that images have **local structure**. Edges, corners, and textures are all local patterns -- they do not depend on what is happening on the other side of the image. A vertical edge looks the same whether it appears in the top-left or bottom-right corner.

By the end of this notebook, you will have implemented convolutions from scratch, visualized what CNNs actually learn, and trained a working image classifier. Let us begin.

In [None]:
# Setup
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from torchvision import datasets, transforms

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

## 2. Building Intuition

Let us start with a simple thought experiment. Imagine you have a small magnifying glass that you slide across a photograph. At each position, you look at the small patch of pixels under the magnifying glass, perform a calculation, and write down a single number.

When you have done this for every position, you have a new grid of numbers -- a **feature map** -- that captures some local pattern from the original image.

The calculation we perform at each position is a **weighted sum**: multiply each pixel by a corresponding weight, then add them all up. The set of weights is called a **filter** or **kernel**.

Let us see this concretely with a simple 5x5 image and a 3x3 filter.

In [None]:
# Create a simple 5x5 image with a vertical edge
image = np.array([
    [0, 0, 1, 1, 1],
    [0, 0, 1, 1, 1],
    [0, 0, 1, 1, 1],
    [0, 0, 1, 1, 1],
    [0, 0, 1, 1, 1],
], dtype=np.float32)

# A vertical edge detector filter
vertical_edge_filter = np.array([
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1],
], dtype=np.float32)

# Visualize both
fig, axes = plt.subplots(1, 2, figsize=(8, 3))
axes[0].imshow(image, cmap='gray', vmin=0, vmax=1)
axes[0].set_title('Input Image (5x5)', fontsize=12)
axes[0].set_xticks(range(5))
axes[0].set_yticks(range(5))
for i in range(5):
    for j in range(5):
        axes[0].text(j, i, f'{image[i,j]:.0f}', ha='center', va='center',
                    color='red' if image[i,j] < 0.5 else 'blue', fontsize=11)

axes[1].imshow(vertical_edge_filter, cmap='RdBu', vmin=-1, vmax=1)
axes[1].set_title('Vertical Edge Filter (3x3)', fontsize=12)
axes[1].set_xticks(range(3))
axes[1].set_yticks(range(3))
for i in range(3):
    for j in range(3):
        axes[1].text(j, i, f'{vertical_edge_filter[i,j]:.0f}', ha='center',
                    va='center', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## 3. The Mathematics

The 2D convolution operation can be written as:

$$O(i, j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} I(i+m, j+n) \cdot K(m, n)$$

where:
- $I$ is the input image
- $K$ is the filter (kernel) of size $k \times k$
- $O$ is the output feature map

Let us plug in some simple numbers. For our 5x5 image and 3x3 filter above, the output at position (0,0) is:

$$O(0,0) = (0 \times -1) + (0 \times 0) + (1 \times 1) + (0 \times -1) + (0 \times 0) + (1 \times 1) + (0 \times -1) + (0 \times 0) + (1 \times 1) = 3$$

This value of 3 tells us there is a strong vertical edge at this position. The filter assigns negative weights to the left side and positive weights to the right side -- so when pixels transition from dark (0) to bright (1), the output is large.

Now let us implement this from scratch.

In [None]:
def conv2d_manual(image, kernel):
    """
    Perform 2D convolution manually (no padding, stride=1).

    Args:
        image: 2D numpy array (H, W)
        kernel: 2D numpy array (kH, kW)

    Returns:
        output: 2D numpy array
    """
    H, W = image.shape
    kH, kW = kernel.shape
    out_H = H - kH + 1
    out_W = W - kW + 1
    output = np.zeros((out_H, out_W))

    for i in range(out_H):
        for j in range(out_W):
            # Extract the local patch
            patch = image[i:i+kH, j:j+kW]
            # Element-wise multiply and sum
            output[i, j] = np.sum(patch * kernel)

    return output

# Apply our vertical edge detector
output = conv2d_manual(image, vertical_edge_filter)

print("Input image (5x5):")
print(image)
print(f"\nFilter (3x3):")
print(vertical_edge_filter)
print(f"\nOutput feature map ({output.shape[0]}x{output.shape[1]}):")
print(output)

Notice how the output is a 3x3 feature map (the image shrinks because the filter cannot extend beyond the edges). The middle column has values of 3, which corresponds exactly to where the vertical edge is in the original image. This is exactly what we want.

Let us visualize the convolution operation step by step and try different filters.

In [None]:
# Visualize the convolution result
fig, axes = plt.subplots(1, 3, figsize=(12, 3))

axes[0].imshow(image, cmap='gray')
axes[0].set_title('Input Image')
for i in range(5):
    for j in range(5):
        axes[0].text(j, i, f'{image[i,j]:.0f}', ha='center', va='center', fontsize=10)

axes[1].imshow(vertical_edge_filter, cmap='RdBu', vmin=-1, vmax=1)
axes[1].set_title('Vertical Edge Filter')
for i in range(3):
    for j in range(3):
        axes[1].text(j, i, f'{vertical_edge_filter[i,j]:.0f}', ha='center', va='center', fontsize=12)

axes[2].imshow(output, cmap='hot')
axes[2].set_title('Feature Map (Edge Detected!)')
for i in range(3):
    for j in range(3):
        axes[2].text(j, i, f'{output[i,j]:.0f}', ha='center', va='center', fontsize=12)

plt.tight_layout()
plt.show()

# Now try different filters
horizontal_edge = np.array([[-1, -1, -1], [0, 0, 0], [1, 1, 1]], dtype=np.float32)
sharpen = np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]], dtype=np.float32)
blur = np.ones((3, 3), dtype=np.float32) / 9.0

# Create a more interesting test image
test_img = np.zeros((8, 8), dtype=np.float32)
test_img[2:6, 2:6] = 1  # White square in center

filters = {'Vertical Edge': vertical_edge_filter, 'Horizontal Edge': horizontal_edge,
           'Sharpen': sharpen, 'Blur': blur}

fig, axes = plt.subplots(1, 5, figsize=(18, 3))
axes[0].imshow(test_img, cmap='gray')
axes[0].set_title('Input (8x8)')

for idx, (name, filt) in enumerate(filters.items()):
    result = conv2d_manual(test_img, filt)
    axes[idx+1].imshow(result, cmap='hot')
    axes[idx+1].set_title(name)

plt.tight_layout()
plt.show()

## 4. Let's Build It -- Component by Component

Now that we understand the basic convolution, let us build the three key components of a CNN:

### Component 1: Convolution Layer (with learnable filters)

In a real CNN, the filter values are **not hand-designed** -- they are **learned** during training. The network discovers which patterns are most useful for the task. Let us implement a convolution layer using PyTorch.

In [None]:
# Component 1: Understanding PyTorch's Conv2d

# Create a single convolution layer
conv_layer = nn.Conv2d(
    in_channels=1,     # Grayscale input (1 channel)
    out_channels=4,    # Learn 4 different filters
    kernel_size=3,     # 3x3 filters
    padding=1,         # Add padding to maintain spatial size
    bias=False         # No bias for clarity
)

# Inspect the learnable parameters
print(f"Filter shape: {conv_layer.weight.shape}")
print(f"  -> {conv_layer.weight.shape[0]} filters")
print(f"  -> each filter: {conv_layer.weight.shape[2]}x{conv_layer.weight.shape[3]}")
print(f"  -> Total parameters: {conv_layer.weight.numel()}")

# Apply to a random 8x8 image
x = torch.randn(1, 1, 8, 8)  # (batch=1, channels=1, H=8, W=8)
y = conv_layer(x)
print(f"\nInput shape:  {x.shape}")
print(f"Output shape: {y.shape}  (4 feature maps, same spatial size due to padding)")

### Component 2: ReLU Activation

After convolution, we apply the **ReLU** (Rectified Linear Unit) activation function, which simply sets all negative values to zero:

$$\text{ReLU}(x) = \max(0, x)$$

Let us plug in some numbers: $\text{ReLU}(-3) = 0$, $\text{ReLU}(0) = 0$, $\text{ReLU}(5) = 5$.

This introduces non-linearity -- without it, stacking multiple convolutions would be equivalent to a single convolution (linear operations compose into linear operations).

In [None]:
# Component 2: ReLU activation
x_sample = torch.tensor([-3.0, -1.0, 0.0, 1.0, 3.0, 5.0])
x_relu = F.relu(x_sample)

print("Input:", x_sample.numpy())
print("ReLU: ", x_relu.numpy())

# Visualize ReLU
x_range = torch.linspace(-5, 5, 100)
plt.figure(figsize=(6, 3))
plt.plot(x_range, F.relu(x_range), 'b-', linewidth=2)
plt.axhline(y=0, color='k', linewidth=0.5)
plt.axvline(x=0, color='k', linewidth=0.5)
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('ReLU Activation: max(0, x)')
plt.grid(alpha=0.3)
plt.show()

### Component 3: Max Pooling

Max pooling reduces the spatial dimensions by keeping only the **maximum value** in each local region:

$$\text{MaxPool}(i, j) = \max(O(2i, 2j), O(2i{+}1, 2j), O(2i, 2j{+}1), O(2i{+}1, 2j{+}1))$$

Let us plug in some numbers. Given a 4x4 feature map with a 2x2 pooling window:

```
6  2 | 3  1
1  7 | 0  4
-----------
8  0 | 5  2
3  1 | 2  9
```

Max pooling gives: `[[7, 4], [8, 9]]` -- we keep the strongest activation from each 2x2 block.

In [None]:
# Component 3: Max Pooling
feature_map = torch.tensor([[
    [6., 2., 3., 1.],
    [1., 7., 0., 4.],
    [8., 0., 5., 2.],
    [3., 1., 2., 9.]
]]).unsqueeze(0)  # Shape: (1, 1, 4, 4)

pool = nn.MaxPool2d(kernel_size=2, stride=2)
pooled = pool(feature_map)

print("Before max pooling (4x4):")
print(feature_map.squeeze().numpy())
print(f"\nAfter max pooling (2x2):")
print(pooled.squeeze().numpy())
print(f"\nShape: {feature_map.shape} -> {pooled.shape}")
print("Spatial dimensions halved, strongest features preserved!")

## 5. Your Turn

### TODO 1: Implement a Horizontal Edge Detector

Complete the function below to create a horizontal edge detection filter and apply it to a test image. A horizontal edge detector should detect transitions from dark to bright going top-to-bottom.

In [None]:
def detect_horizontal_edges(image):
    """
    Apply a horizontal edge detection filter to the image.

    Args:
        image: 2D numpy array

    Returns:
        output: 2D numpy array (feature map with horizontal edges highlighted)

    Hint: The horizontal edge filter is the transpose of the vertical edge filter.
          It should have negative values in the top row and positive values in the bottom row.
    """
    # TODO: Define a 3x3 horizontal edge detection filter
    # horizontal_filter = np.array([...])

    # TODO: Apply convolution using conv2d_manual
    # output = ...

    # return output
    pass

# Test image with horizontal edge
test_h = np.array([
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [1, 1, 1, 1, 1],
    [1, 1, 1, 1, 1],
    [1, 1, 1, 1, 1],
], dtype=np.float32)

# Uncomment when ready:
# result = detect_horizontal_edges(test_h)
# plt.figure(figsize=(8, 3))
# plt.subplot(1, 2, 1); plt.imshow(test_h, cmap='gray'); plt.title('Input')
# plt.subplot(1, 2, 2); plt.imshow(result, cmap='hot'); plt.title('Horizontal Edges')
# plt.show()

In [None]:
# Verification cell for TODO 1
# Expected: The center row of the output should have high positive values (edge detected)
# Run this after completing TODO 1 to verify:
# assert result is not None, "Function returned None"
# assert result.shape == (3, 3), f"Expected (3,3), got {result.shape}"
# print("Center row values:", result[1, :])
# print("Expected: all values should be 3.0 (strong horizontal edge)")

### TODO 2: Build a Custom CNN

Complete the CNN architecture below. Fill in the missing layers to create a 3-layer CNN for CIFAR-10 classification.

In [None]:
class MyCNN(nn.Module):
    """
    A 3-layer CNN for CIFAR-10 classification.

    Architecture:
        Conv2d(3, 32, 3, padding=1) -> ReLU -> MaxPool2d(2)
        Conv2d(32, 64, 3, padding=1) -> ReLU -> MaxPool2d(2)
        Conv2d(64, 128, 3, padding=1) -> ReLU -> MaxPool2d(2)
        Flatten -> Linear(128*4*4, 256) -> ReLU -> Linear(256, 10)

    Hint: CIFAR-10 images are 32x32x3. After 3 max-pool layers with stride 2,
          the spatial size becomes 32/2/2/2 = 4.
    """
    def __init__(self):
        super().__init__()
        # TODO: Define the convolutional layers
        # self.conv1 = nn.Conv2d(...)
        # self.conv2 = nn.Conv2d(...)
        # self.conv3 = nn.Conv2d(...)
        # self.pool = nn.MaxPool2d(...)
        # self.fc1 = nn.Linear(...)
        # self.fc2 = nn.Linear(...)
        pass

    def forward(self, x):
        # TODO: Implement the forward pass
        # x = self.pool(F.relu(self.conv1(x)))
        # x = self.pool(F.relu(self.conv2(x)))
        # x = self.pool(F.relu(self.conv3(x)))
        # x = x.flatten(1)
        # x = F.relu(self.fc1(x))
        # x = self.fc2(x)
        # return x
        pass

# Verification:
# model = MyCNN()
# test_input = torch.randn(2, 3, 32, 32)
# test_output = model(test_input)
# print(f"Input shape: {test_input.shape}")
# print(f"Output shape: {test_output.shape}")
# assert test_output.shape == (2, 10), f"Expected (2, 10), got {test_output.shape}"
# print("Architecture looks correct!")

## 6. Putting It All Together

Now let us build a complete CNN and train it on CIFAR-10. We will use the pre-built `SimpleCNN` architecture from the article.

In [None]:
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(),
            nn.Linear(256, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# Count parameters
model = SimpleCNN().to(device)
total_params = sum(p.numel() for p in model.parameters())
print(f"SimpleCNN: {total_params:,} parameters")
print(f"\nArchitecture:")
print(model)

In [None]:
# Load CIFAR-10 dataset
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_train)
testset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform_test)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)
testloader = torch.utils.data.DataLoader(testset, batch_size=128, shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
print(f"Training samples: {len(trainset)}, Test samples: {len(testset)}")

## 7. Training and Results

In [None]:
# Training loop
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

train_losses, test_accs = [], []
NUM_EPOCHS = 15

for epoch in range(NUM_EPOCHS):
    model.train()
    running_loss = 0.0
    correct, total = 0, 0

    for images, labels in trainloader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    train_loss = running_loss / len(trainloader)
    train_acc = 100. * correct / total
    train_losses.append(train_loss)

    # Evaluate on test set
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for images, labels in testloader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
    test_acc = 100. * correct / total
    test_accs.append(test_acc)

    print(f'Epoch {epoch+1}/{NUM_EPOCHS}: Loss={train_loss:.3f}, '
          f'Train Acc={train_acc:.1f}%, Test Acc={test_acc:.1f}%')

print(f'\nFinal test accuracy: {test_accs[-1]:.1f}%')

In [None]:
# Visualization checkpoint 1: Training curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(train_losses, 'b-', linewidth=2)
ax1.set_xlabel('Epoch'); ax1.set_ylabel('Loss'); ax1.set_title('Training Loss')
ax1.grid(alpha=0.3)

ax2.plot(test_accs, 'r-', linewidth=2)
ax2.set_xlabel('Epoch'); ax2.set_ylabel('Accuracy (%)'); ax2.set_title('Test Accuracy')
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Final Output

Let us visualize what the CNN actually learned by looking at the first-layer filters and the feature maps they produce. This is one of the most fascinating aspects of CNNs -- the network discovers edge detectors, color detectors, and texture detectors on its own.

In [None]:
# Visualization checkpoint 2: Visualize learned filters
first_conv_weights = model.features[0].weight.data.cpu()
print(f"First conv layer filters: {first_conv_weights.shape}")

fig, axes = plt.subplots(4, 8, figsize=(16, 8))
fig.suptitle('Learned First-Layer Filters (32 filters, 3x3 each)', fontsize=14)

for i in range(32):
    ax = axes[i // 8, i % 8]
    # Normalize filter for visualization
    w = first_conv_weights[i].permute(1, 2, 0).numpy()
    w = (w - w.min()) / (w.max() - w.min() + 1e-8)
    ax.imshow(w)
    ax.axis('off')
    ax.set_title(f'F{i+1}', fontsize=8)

plt.tight_layout()
plt.show()

In [None]:
# Visualization checkpoint 3: Feature maps for a sample image
model.eval()
sample_img, sample_label = testset[0]
sample_img_gpu = sample_img.unsqueeze(0).to(device)

# Get feature maps after each conv layer
with torch.no_grad():
    fm1 = F.relu(model.features[0](sample_img_gpu))
    fm2 = F.relu(model.features[3](model.features[2](fm1)))  # after pool+conv
    fm3 = F.relu(model.features[6](model.features[5](fm2)))  # after pool+conv

# Show original image
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
img_show = sample_img.permute(1, 2, 0).numpy()
img_show = (img_show - img_show.min()) / (img_show.max() - img_show.min())
axes[0].imshow(img_show)
axes[0].set_title(f'Original: {classes[sample_label]}')

# Show feature maps from each layer (average across channels)
for idx, (fm, title) in enumerate([(fm1, 'Layer 1'), (fm2, 'Layer 2'), (fm3, 'Layer 3')]):
    axes[idx+1].imshow(fm.squeeze().mean(0).cpu().numpy(), cmap='viridis')
    axes[idx+1].set_title(f'{title} ({fm.shape[1]} channels)')
    axes[idx+1].axis('off')

plt.suptitle('Feature Maps: From Edges to Semantic Features', fontsize=14)
plt.tight_layout()
plt.show()

print("\nNotice how:")
print("- Layer 1: captures edges and simple color patterns (high spatial resolution)")
print("- Layer 2: captures textures and combinations of edges (lower resolution)")
print("- Layer 3: captures high-level object features (very abstract)")

## 9. Reflection and Next Steps

**What we built:** We started with the raw convolution operation, implemented it from scratch, understood its mathematics, and then built a complete CNN that classifies images with 75-80% accuracy on CIFAR-10.

**Key takeaways:**
1. A convolution is simply a weighted sum over a local neighborhood
2. The filter values are learned -- the network discovers useful patterns automatically
3. ReLU adds non-linearity; max pooling reduces spatial dimensions
4. Stacking convolutions creates a feature hierarchy: edges -> textures -> parts -> objects

**Reflection questions:**
- Why do you think the first-layer filters look like edge detectors? What would happen if we initialized them differently?
- What happens to the test accuracy if we remove all max pooling layers? Why?
- How would the feature maps change if we trained on a different dataset (e.g., medical images)?

**Next:** In the next notebook, we will explore the limitations of CNNs and build a Vision Transformer from scratch -- an architecture that processes the *entire image at once* through self-attention.