- Loads the **CIFAR-100** image dataset(50k train, 10k test; images are 3 * 32 * 32).
- Builds a small a neural network (an MLP with one hidden layer).
- Trains it to predict one of **100 classes**.
- **Evaluates** accuracy on the test set.
- **Saves** the trained weights to `model.ckpt`.

```
images -> numbers (tensors) -> model computes scores -> compare to correct label -> adjust model -> repeat -> test.
```

In [1]:
## Imports
import torch
import torch.nn as nn
import torchvision.transforms as transforms
from sympy import false


- `torch`: core PyTorch (tensors, GPU, autograd).
- `torch.nn as nn`: neural-network layers & losses.
- `torchvision`: ready-made datasets/models for vision.
- `transformers`: image preprocessing (convert to tensor, normalize, etc).
- `from sympy import false`: not needed here - safe to delete.

In [None]:
# Picking a compute device (CPU/NVIDIA GPU/ APPLE GPU)

In [None]:
def get_device():
    if torch.backends.mps.is_available():
        return torch.device("mps")
    if torch.cuda.is_available():
        return  torch.device("cuda")
    return torch.device("cpu")

- On Apple silicon, MPS runs on the built-in GPU.
- on PCs with NVIDIA, CUDA runs on the GPU.
- otherwise CPU.
- we must move both the model and the data to same device.

In [None]:
# 3) Defining the model (a tiny MLP)
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)   # weights + bias [3072 -> 500]
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)  # [500 -> 100]

    def forward(self, x):
        x = x.view(x.size(0), -1)   # flatten image [B, 3, 32, 32] -> [B, 3072]
        x = self.fc1(x)     # linear transform
        x = self.relu(x)    # ReLU activation
        x = self.fc2(x)     # final logits (raw scores) for 100 classes.
        return x



- **Why flatten?** The MLP expects a 1-D vector per image (3072 numbers = 3 * 32 * 32).
- **Why ReLU?** Adds non-linearity so the model can learn complex patterns.
- **Why return logits (no softmax)?** The loss function you use (CrossEntropy) expects **raw scores** and handles softmax internally (more stable numerically).

In [None]:
# 4) main() -> set hyperparameters & data pipeline
device = get_device()
print(f"Using device: {device}")

# Hyper-parameters for CIFAR-100
input_size = 3 * 32 * 32    # 3072 number per image
hidden_size = 500
num_classes = 100
num_epochs = 5
batch_size = 128
learning_rate = 0.001



- **Hyperparameters** are knobs you choose (not learned): learning rate, batch size, etc.

In [None]:
# 5) Transforms (preprocessing)
transform = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize(
            mean=(0.5071, 0.4867, 0.4408),
            std=(0.2675, 0.2565, 0.2761)
        )
    ]
)

- `transforms.Compose([...])` = build a pipeline of steps. Each image passes through steps in order.
- `ToTensor()`:
    - Converts a PIL image (HxWxC, 0-255 integers) -> PyTorch tensor (CxHxW, float32) with values in [0,1].
    - `Normalize(mean, std)` per channel (R, G, B):
        -  For every pixel value `x` (already in [0,1]),
                `x_norm = (x - mean[channel]) / std[channel])
        - This is standardization: **center**(subtract mean) and **scale** (divide by std).
        - Result: each channel's distribution is roughly mean ≈ 0 and std ≈ 1 over the training set.
### Why do we normalize?
- **Faster, more stable training.** Centered, similarly-scaled features make gradients behave better.
- **Help optimization**(Adam/SGD) and can improve final accuracy.
- It's a long-standing best practice for image models.

### Why these numbers (mean = (0.5071, 0.4867, 0.4408) and std= (0.2675, 0.2565, 0.2761)?
    - These are empirical channel means & stds computed on CIFAR-100 training set after `ToTensor()` (so they're in the 0-1 scale).
### tl;dr
- `Compose` builds a preprocessing pipeline.
- `ToTensor()` -> tensor in [0,1].
- `Normalize(mean, std)` -> per-channel `(x-mean)/std` using CIFAR-100's own mean/std, which speeds up and stabilized learning.