# Computer Vision 

**Framework:** PyTorch 

**Environment:** Jupyter Notebook, Python


## 1. Introduction to Computer Vision with MNIST



### 1.1 Problem: Image Classification

In traditional programming, the developer defines the rules and conditions that the program follows to perform the desired tasks. This approach works well for many problems.

However, image classification - which requires the program to correctly identify the class of an image it has never encountered before, is extremely difficult to achieve with traditional programming methods. It is nearly impossible for a programmer to manually prepare rules that will work across the vast diversity of images.


### 1.2 Solution: Deep Learning

Deep learning excels at recognizing patterns through a process of trial and error. By training a deep neural network on a sufficient amount of data and providing feedback on its performance during training, the network learns set of rules it needs to correctly interpret and act on new unseen inputs. 

### 1.3 The MNIST dataset 

In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

This code checks whether a CUDA-enabled GPU is available and, if so, sets the device to use the GPU for computation. Otherwise, it defaults to the CPU.

What is CUDA?

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It allows developers to leverage the power of NVIDIA GPUs to accelerate computation-heavy tasks, such as deep learning model training and inference.

Difference between CUDA and GPU:

GPU (Graphics Processing Unit): Hardware component specialized for parallel processing, initially designed for rendering graphics but highly effective for general-purpose computations.

CUDA: Software and programming framework that enables applications to utilize the computational power of NVIDIA GPUs. It provides APIs and tools for writing software that runs on the GPU.

In short, CUDA is the bridge that allows software (like PyTorch) to execute on the GPU hardware.

Further reading:

[What is CUDA?](https://developer.nvidia.com/blog/even-easier-introduction-cuda)


In [None]:
import torchvision

train_dataset = torchvision.datasets.MNIST("./data/", train=True, download=True)
test_dataset = torchvision.datasets.MNIST("./data/", train=False, download=True)
print(train_dataset[0])
print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")

In [None]:
import matplotlib.pyplot as plt
def visualize_mnist_samples(dataset, num_samples=8):
    fig, axes = plt.subplots(2, 4, figsize=(12, 6))
    axes = axes.ravel()

    for i in range(num_samples):
        image, label = dataset[i]
        axes[i].imshow(image, cmap='gray')
        axes[i].set_title(f'Label: {label}')
        axes[i].axis('off')

    plt.tight_layout()
    plt.suptitle('MNIST Dataset Samples', y=1.02, fontsize=14)
    plt.show()

visualize_mnist_samples(train_dataset)

In [None]:
import numpy as np

def visualize_first_sample_with_array(dataset):
    image, label = dataset[0]
    image_array = np.array(image)

    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    axes[0].imshow(image_array, cmap='gray')
    axes[0].set_title(f'MNIST Sample Image - Label: {label}')
    axes[0].axis('off')

    axes[1].axis('off')
    table = axes[1].table(cellText=image_array, loc='center', cellLoc='center')
    table.scale(0.85, 0.85)
    axes[1].set_title('Pixel Values (28x28)')

    plt.show()

visualize_first_sample_with_array(train_dataset)

### 1.3 Tensors

We succesfully loaded our dataset, now we should convert it to tensor.

What is tensor?

If a vector is a 1-dimensional array, and a matrix is a 2-dimensional array, a tensor is an n-dimensional array representing any number of dimensions.

Why convert data to tensors in Deep Learning?

Because tensors are the fundamental data structure used by deep learning frameworks like PyTorch. They efficiently represent and manipulate multi-dimensional data such as images, enabling optimized mathematical operations on CPU and GPU. 

[More about tensors](https://docs.pytorch.org/docs/stable/tensors.html)


In [None]:
import torchvision.transforms.v2 as transforms
transform = transforms.Compose([transforms.ToTensor()])

x_0, y_0 = train_dataset[0]

Note: When working with deep learning you will often encounter images marked as X and labels marked as Y. 


In [None]:
x_0

In [None]:
type(x_0)

In [None]:
y_0

In [None]:
type(y_0)

In [None]:
x_0_tensor = transform(x_0)

In [None]:
x_0_tensor.min()

In [None]:
x_0_tensor.max()

PyTorch's ToTensor() transform converts a PIL image or a NumPy array with pixel values in the range of [0-255] into a floating-point tensor of shape (C x H x W) with values normalized to the range [0.0, 1.0].

For MNIST, which consists of grayscale 28x28 pixel images, ToTensor() will transform each image from a 28x28 pixel format with integer values 0-255 into a 1x28x28 tensor with floating point values between 0.0 and 1.0.



In [None]:
x_0_tensor.size()

In [None]:
x_0_tensor.device

Moving data and models to a GPU is a crucial step in deep learning. While tensors and models default to being stored and computed on the CPU, GPUs offer more efficient parallel processing capabilities that significantly accelerate training and inference.

Consistently ensuring that both data and model reside on the same device also prevents runtime errors caused by device mismatches.



In [None]:
x_0_tensor_gpu = x_0_tensor.to(device)
x_0_tensor_gpu.device

### 1.4 Build MLP Model

A Multi-Layer Perceptron (MLP) is the most basic type of neural network. For MNIST digit classification, we need:

1. **Input Layer**: Flattens the 28x28 image into a 784-dimensional vector
2. **Hidden Layer**: Layer with neurons to learn patterns
3. **Output Layer**: 10 neurons (one for each digit 0-9)


In [None]:
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_size=784, hidden1=128, hidden2=64, num_classes=10):
        super(MLP, self).__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(input_size, hidden1)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(hidden1, hidden2)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(hidden2, num_classes)

    def forward(self, x):
        x = self.flatten(x)
        x = self.relu1(self.fc1(x))  # 784 -> 128
        x = self.relu2(self.fc2(x))  # 128 -> 64
        x = self.fc3(x)              # 64 -> 10
        return x

model = MLP().to(device)
print("MLP Architecture:")
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
import torch.nn as nn
from torch.optim import Adam
from torch.utils.data import DataLoader
import torchvision

transform = transforms.Compose([transforms.RandomRotation(degrees=15),
                               transforms.ToTensor()])

train_dataset = torchvision.datasets.MNIST("./data/", train=True, download=False, transform=transform)
test_dataset = torchvision.datasets.MNIST("./data/", train=False, download=False, transform=transform)

criterion = nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=0.001)

batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print(f"Batch size: {batch_size}")
print(f"Number of batches (train): {len(train_loader)}")
print(f"Number of batches (test): {len(test_loader)}")

In [None]:
def train_epoch(model, train_loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        pred = output.argmax(dim=1, keepdim=True)
        correct += pred.eq(target.view_as(pred)).sum().item()
        total += target.size(0)

    avg_loss = total_loss / len(train_loader)
    accuracy = 100. * correct / total
    return avg_loss, accuracy

epochs = 5

print("Training MNIST classifier...")
for epoch in range(epochs):
    print(f"\nEpoch {epoch+1}/{epochs}")
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    print(f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%")

In [None]:
def visualize_predictions(model, dataset, device, num_correct=3, num_wrong=2):
    model.eval()

    correct_samples = []
    wrong_samples = []

    with torch.no_grad():
        for i in range(len(dataset)):
            if len(correct_samples) >= num_correct and len(wrong_samples) >= num_wrong:
                break

            image, true_label = dataset[i]
            output = model(image.unsqueeze(0).to(device))
            pred_label = output.argmax(dim=1).item()

            if pred_label == true_label and len(correct_samples) < num_correct:
                correct_samples.append((image, true_label, pred_label))
            elif pred_label != true_label and len(wrong_samples) < num_wrong:
                wrong_samples.append((image, true_label, pred_label))

    all_samples = wrong_samples + correct_samples

    fig, axes = plt.subplots(1, 5, figsize=(15, 3))

    for idx, (image, true_label, pred_label) in enumerate(all_samples):
        ax = axes[idx]
        ax.imshow(image.squeeze().cpu().numpy(), cmap='gray')

        if true_label == pred_label:
            ax.set_title(f'True: {true_label}\nPred: {pred_label}', color='green')
        else:
            ax.set_title(f'True: {true_label}\nPred: {pred_label}', color='red')
        ax.axis('off')

    plt.tight_layout()
    plt.suptitle('Model Predictions (Red=Wrong, Green=Correct)', y=1.1, fontsize=14)
    plt.show()

visualize_predictions(model, test_dataset, device)

### 1.5 Convolutional Neural Networks

While the MLP achieved good results, it has a fundamental limitation: it treats images as flat vectors, losing spatial information. 

**CNN** (Convolutional Neural Network) is designed specifically for images:

**Why CNN is better for images:**
- **Preserves spatial structure**: Understands that nearby pixels are related
- **Parameter sharing**: Uses same filters across the image (fewer parameters)
- **Translation invariance**: Detects features regardless of position
- **Hierarchical learning**: Learns simple features (edges) -> complex features (shapes)

**Key components:**
- **Convolutional layers**: Apply filters to detect patterns
- **Pooling layers**: Reduce spatial dimensions, keep important features
- **Fully connected layers**: Final classification based on learned features

In [None]:
import torch.nn.functional as F

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 8, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(16 * 7 * 7, 32)
        self.fc2 = nn.Linear(32, num_classes)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 7 * 7)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

cnn_model = CNN().to(device)
print("CNN Architecture:")
print(cnn_model)

mlp_params = sum(p.numel() for p in model.parameters())
cnn_params = sum(p.numel() for p in cnn_model.parameters())

print(f"\nParameter Comparison:")
print(f"  MLP: {mlp_params:,} parameters")
print(f"  CNN: {cnn_params:,} parameters")


In [None]:
cnn_criterion = nn.CrossEntropyLoss()
cnn_optimizer = Adam(cnn_model.parameters(), lr=0.001)

epochs = 5

print("Training CNN on MNIST...")
for epoch in range(epochs):
    print(f"\nEpoch {epoch+1}/{epochs}")
    train_loss, train_acc = train_epoch(cnn_model, train_loader, cnn_criterion, cnn_optimizer, device)
    print(f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%")

In [None]:
print("CNN Model Predictions:")
visualize_predictions(cnn_model, test_dataset, device)