<a href="https://colab.research.google.com/github/JordanDCunha/Hands-On-Machine-Learning-with-Scikit-Learn-and-PyTorch/blob/main/Chapter10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTorch Fundamentals

The core data structure of PyTorch is the **tensor**. It‚Äôs a multidimensional array with a shape and a data type, used for numerical computations.

At first glance, tensors look a lot like NumPy arrays ‚Äî and that‚Äôs true ‚Äî but they have two major advantages:

1. They can live on a **GPU or other hardware accelerators**
2. They support **automatic differentiation (autograd)**

Every neural network we build in PyTorch will take tensors as input and output tensors, much like Scikit-Learn models work with NumPy arrays.

Let‚Äôs start by learning how to create and manipulate tensors.


## PyTorch Tensors

First, let‚Äôs import PyTorch.


In [None]:
import torch


You can create a PyTorch tensor much like a NumPy array. For example, here‚Äôs a 2 √ó 3 tensor:


In [None]:
X = torch.tensor([[1.0, 4.0, 7.0],
                  [2.0, 3.0, 6.0]])
X


A tensor has a **shape** and a **data type**:


In [None]:
X.shape, X.dtype


Indexing works just like NumPy:


In [None]:
X[0, 1]


In [None]:
X[:, 1]


PyTorch supports a wide range of mathematical operations, with an API very similar to NumPy.


In [None]:
10 * (X + 1.0)


In [None]:
X.exp()


In [None]:
X.mean()


In [None]:
X.max(dim=0)


In [None]:
X @ X.T


> **Note**  
PyTorch prefers the argument name `dim`, but it also supports `axis` like NumPy.


## Converting Between NumPy and PyTorch


In [None]:
import numpy as np

X.numpy()


In [None]:
torch.tensor(np.array([[1., 4., 7.],
                       [2., 3., 6.]]))


PyTorch defaults to **32-bit floats**, while NumPy defaults to **64-bit floats**.

For deep learning, 32-bit precision is usually preferred because it:
- Uses less memory
- Runs faster
- Is precise enough for neural networks


In [None]:
torch.FloatTensor(np.array([[1., 4., 7.],
                            [2., 3., 6.]]))


> **Tip**  
`torch.from_numpy()` avoids copying data, but changes to one will affect the other.


In [None]:
X[:, 1] = -99
X


In [None]:
X.relu_()
X


In-place operations end with an underscore (`_`), such as `relu_()` or `sqrt_()`.

They save memory, but must be used carefully when working with autograd.


## Hardware Acceleration


In [None]:
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

device


In [None]:
M = torch.tensor([[1., 2., 3.],
                  [4., 5., 6.]])
M = M.to(device)
M.device


In [None]:
M = torch.tensor([[1., 2., 3.],
                  [4., 5., 6.]], device=device)


In [None]:
R = M @ M.T
R


Let‚Äôs compare CPU vs GPU performance for matrix multiplication.


In [None]:
M = torch.rand((1000, 1000))
%timeit M @ M.T


In [None]:
M = torch.rand((1000, 1000), device="cuda")
%timeit M @ M.T


GPUs shine when operations are large and parallelizable. For very small tensors, the CPU can actually be faster.


## Autograd (Automatic Differentiation)


In [None]:
x = torch.tensor(5.0, requires_grad=True)
f = x ** 2
f.backward()
x.grad


PyTorch dynamically builds a computation graph during the forward pass and uses it to compute gradients during the backward pass.


In [None]:
learning_rate = 0.1

with torch.no_grad():
    x -= learning_rate * x.grad


In [None]:
x.grad.zero_()


## Full Gradient Descent Loop


In [None]:
learning_rate = 0.1
x = torch.tensor(5.0, requires_grad=True)

for iteration in range(100):
    f = x ** 2
    f.backward()

    with torch.no_grad():
        x -= learning_rate * x.grad

    x.grad.zero_()

x


Be careful with **in-place operations** when using autograd. Some operations store their outputs or inputs for the backward pass, and modifying them in place can break gradient computation.


### Summary

You now know how to:
- Create and manipulate PyTorch tensors
- Move tensors between CPU and GPU
- Use autograd to compute gradients
- Implement gradient descent manually

Next up: **building and training models with PyTorch** üöÄ


## Implementing Linear Regression

We will start by implementing linear regression using tensors and autograd directly.
Then we will simplify the code using PyTorch‚Äôs high-level API and add GPU support.


In [None]:
# Assumes the California housing dataset is already loaded and split:
# X_train, X_valid, X_test
# y_train, y_valid, y_test
import torch


### Converting the Data to Tensors and Normalizing

We convert the NumPy arrays to PyTorch tensors and normalize the input features
using tensor operations instead of a StandardScaler.


In [None]:
X_train = torch.FloatTensor(X_train)
X_valid = torch.FloatTensor(X_valid)
X_test = torch.FloatTensor(X_test)

means = X_train.mean(dim=0, keepdims=True)
stds = X_train.std(dim=0, keepdims=True)

X_train = (X_train - means) / stds
X_valid = (X_valid - means) / stds
X_test = (X_test - means) / stds


### Preparing the Target Vectors

Our predictions will be column vectors, so we reshape the targets
to have shape (n_samples, 1).


In [None]:
y_train = torch.FloatTensor(y_train).reshape(-1, 1)
y_valid = torch.FloatTensor(y_valid).reshape(-1, 1)
y_test = torch.FloatTensor(y_test).reshape(-1, 1)


### Initializing Model Parameters

We initialize the weights randomly and the bias to zero.
Random initialization is important to break symmetry.


In [None]:
torch.manual_seed(42)

n_features = X_train.shape[1]
w = torch.randn((n_features, 1), requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)


### Training with Batch Gradient Descent and Autograd

We use mean squared error as the loss function and perform batch
gradient descent using automatic differentiation.


In [None]:
learning_rate = 0.4
n_epochs = 20

for epoch in range(n_epochs):
    y_pred = X_train @ w + b
    loss = ((y_pred - y_train) ** 2).mean()

    loss.backward()

    with torch.no_grad():
        w -= learning_rate * w.grad
        b -= learning_rate * b.grad
        w.grad.zero_()
        b.grad.zero_()

    print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {loss.item()}")


### Making Predictions with the Trained Model

During inference, we disable gradient tracking using `torch.no_grad()`.


In [None]:
X_new = X_test[:3]

with torch.no_grad():
    y_pred = X_new @ w + b

y_pred


## Linear Regression Using PyTorch‚Äôs High-Level API

PyTorch provides the `nn.Linear` module, which greatly simplifies
model definition and training.


In [None]:
import torch.nn as nn


### Defining the Model

We create a linear regression model with one output neuron.


In [None]:
torch.manual_seed(42)

model = nn.Linear(in_features=n_features, out_features=1)


### Inspecting Model Parameters

The model automatically creates and initializes weights and bias terms.


In [None]:
model.weight, model.bias


### Defining the Optimizer and Loss Function

We use stochastic gradient descent (SGD) and mean squared error loss.


In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss()


### Training Loop Using the High-Level API

The optimizer handles parameter updates and gradient clearing automatically.


In [None]:
def train_bgd(model, optimizer, criterion, X_train, y_train, n_epochs):
    for epoch in range(n_epochs):
        y_pred = model(X_train)
        loss = criterion(y_pred, y_train)

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {loss.item()}")


### Training the Model


In [None]:
train_bgd(model, optimizer, criterion, X_train, y_train, n_epochs)


### Making Predictions with the High-Level Model


In [None]:
X_new = X_test[:3]

with torch.no_grad():
    y_pred = model(X_new)

y_pred


## Implementing a Regression MLP

PyTorch provides the `nn.Sequential` module, which chains multiple modules together.
When called, the input is passed through each module in sequence.
This makes it ideal for building multilayer perceptrons (MLPs).


### Defining the MLP Architecture

We build an MLP with:
- Two hidden layers
- ReLU activation functions
- A single output neuron for regression


In [None]:
import torch
import torch.nn as nn

torch.manual_seed(42)

model = nn.Sequential(
    nn.Linear(n_features, 50),
    nn.ReLU(),
    nn.Linear(50, 40),
    nn.ReLU(),
    nn.Linear(40, 1)
)


### Understanding the Model Structure

- The first linear layer takes `n_features` inputs and outputs 50 features.
- `ReLU` applies a non-linear activation function element-wise.
- The second linear layer maps 50 inputs to 40 outputs.
- Another `ReLU` introduces non-linearity.
- The final linear layer outputs a single value, matching the regression target.


### Setting Up the Optimizer and Loss Function

We use:
- Stochastic Gradient Descent (SGD) as the optimizer
- Mean Squared Error (MSE) as the loss function


In [None]:
learning_rate = 0.1

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
mse = nn.MSELoss()


### Training the MLP Using Batch Gradient Descent

We reuse the `train_bgd` function defined earlier.


In [None]:
train_bgd(model, optimizer, mse, X_train, y_train, n_epochs)


### Observing Training Progress

You should see the loss decrease over epochs, for example:

Epoch 1/20, Loss: 5.045  
Epoch 2/20, Loss: 2.052  
...  
Epoch 20/20, Loss: 0.565  

This confirms that the neural network is learning.


### Summary

You have successfully trained a regression MLP using PyTorch.
The model can now capture nonlinear relationships in the data.
However, we are still using batch gradient descent, which does not scale well.


## Implementing Mini-Batch Gradient Descent Using DataLoaders

To efficiently implement mini-batch gradient descent, PyTorch provides the
`DataLoader` class in `torch.utils.data`. It loads data in batches, optionally
shuffles it at each epoch, and can parallelize data loading.


### Preparing the Dataset

The `DataLoader` expects a dataset object implementing:
- `__len__()` ‚Üí number of samples
- `__getitem__(index)` ‚Üí returns one sample and its target

PyTorch provides `TensorDataset` to easily wrap tensors.


In [None]:
from torch.utils.data import TensorDataset, DataLoader

train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)


### Using Hardware Acceleration (GPU)

To leverage a GPU, we must:
1. Move the model to the GPU
2. Move each mini-batch to the GPU during training


In [None]:
torch.manual_seed(42)

model = nn.Sequential(
    nn.Linear(n_features, 50),
    nn.ReLU(),
    nn.Linear(50, 40),
    nn.ReLU(),
    nn.Linear(40, 1)
)

model = model.to(device)


### Creating the Optimizer and Loss Function

‚ö†Ô∏è Important: create the optimizer **after** moving the model to the GPU,
since optimizers may store internal state on the same device.


In [None]:
learning_rate = 0.02

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
mse = nn.MSELoss()


### Training Function with Mini-Batch Gradient Descent

This function:
- Iterates over epochs
- Processes one mini-batch at a time
- Accumulates the mean loss per epoch
- Uses `model.train()` to enable training mode


In [None]:
def train(model, optimizer, criterion, train_loader, n_epochs):
    model.train()
    for epoch in range(n_epochs):
        total_loss = 0.0

        for X_batch, y_batch in train_loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)

            y_pred = model(X_batch)
            loss = criterion(y_pred, y_batch)

            total_loss += loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        mean_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {mean_loss:.4f}")


### Training the Model

We now train the model using mini-batch gradient descent on the GPU.


In [None]:
train(model, optimizer, mse, train_loader, n_epochs)


### Observing Training Results

Typical output:

Epoch 1/20, Loss: 0.6958  
Epoch 2/20, Loss: 0.4480  
...  
Epoch 20/20, Loss: 0.3227  

The loss is significantly lower than with batch gradient descent.


### Performance Optimization Tips

1. **Pinned Memory (CUDA only)**  
   Use `pin_memory=True` in the DataLoader and `non_blocking=True` when calling
   `.to(device)` to speed up CPU ‚Üí GPU transfers.

2. **Parallel Data Loading**  
   Use `num_workers > 0` to load batches in parallel.
   Tune `prefetch_factor` and consider `persistent_workers=True`.


### Summary

You can now:
- Train neural networks using mini-batch gradient descent
- Use GPUs efficiently with PyTorch
- Scale training to larger datasets and models

Next, we will focus on **model evaluation and validation**.


## Model Evaluation

Let‚Äôs write a function to evaluate the model. It takes the model and a `DataLoader` for the dataset that we want to evaluate the model on, as well as a function to compute the metric for a given batch, and lastly a function to aggregate the batch metrics (by default, it just computes the mean).


In [None]:
import torch

def evaluate(model, data_loader, metric_fn, aggregate_fn=torch.mean):
    model.eval()
    metrics = []
    with torch.no_grad():
        for X_batch, y_batch in data_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            metric = metric_fn(y_pred, y_batch)
            metrics.append(metric)
    return aggregate_fn(torch.stack(metrics))


Now let‚Äôs build a `TensorDataset` and a `DataLoader` for our validation set, and pass it to our `evaluate()` function to compute the validation MSE.


In [None]:
from torch.utils.data import TensorDataset, DataLoader

valid_dataset = TensorDataset(X_valid, y_valid)
valid_loader = DataLoader(valid_dataset, batch_size=32)

valid_mse = evaluate(model, valid_loader, mse)
valid_mse


It works fine. But now suppose we want to use the RMSE instead of the MSE (as we saw in Chapter 2, it can be easier to interpret).

PyTorch does not have a built-in function for RMSE, but it‚Äôs easy enough to write.


In [None]:
def rmse(y_pred, y_true):
    return ((y_pred - y_true) ** 2).mean().sqrt()

evaluate(model, valid_loader, rmse)


But wait a second! The RMSE should be equal to the square root of the MSE. However, when we compute the square root of the MSE that we found earlier, we get a different result.


In [None]:
valid_mse.sqrt()


The reason is that instead of calculating the RMSE over the whole validation set, we computed it over each batch and then averaged the batch RMSEs. This is not mathematically equivalent.

To solve this, we can use the MSE as our `metric_fn`, and use the `aggregate_fn` to compute the square root of the mean MSE.


In [None]:
evaluate(
    model,
    valid_loader,
    mse,
    aggregate_fn=lambda metrics: torch.sqrt(torch.mean(metrics))
)


That‚Äôs much better!

Rather than implementing metrics yourself, you may prefer to use the **TorchMetrics** library (made by the same team as PyTorch Lightning), which provides many well-tested streaming metrics.

A streaming metric is an object that keeps track of a given metric and can be updated one batch at a time.

TorchMetrics is not preinstalled on Colab, so we need to install it first.


In [None]:
%pip install torchmetrics


Now we can implement an evaluation function using TorchMetrics.


In [None]:
import torchmetrics

def evaluate_tm(model, data_loader, metric):
    model.eval()
    metric.reset()
    with torch.no_grad():
        for X_batch, y_batch in data_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            metric.update(y_pred, y_batch)
    return metric.compute()


Next, we create an RMSE streaming metric, move it to the GPU, and use it to evaluate the validation set.


In [None]:
rmse_metric = torchmetrics.MeanSquaredError(squared=False).to(device)

evaluate_tm(model, valid_loader, rmse_metric)


Sure enough, we get the correct result!

Next steps:
- Update the `train()` function to evaluate performance during training
- Measure metrics on the training set during each epoch
- Measure metrics on the validation set at the end of each epoch
- Plot learning curves to detect overfitting (using Matplotlib or TensorBoard)

Now you know how to build, train, and evaluate a regression MLP using PyTorch, and how to make predictions with a trained model.

So far, we‚Äôve only worked with simple sequential models composed of linear layers and ReLU activations. To build more complex, nonsequential models, we‚Äôll need to create **custom PyTorch modules**.


## Building Nonsequential Models Using Custom Modules

One example of a nonsequential neural network is a **Wide & Deep neural network**. This architecture was introduced in a 2016 paper by Heng-Tze Cheng et al.

It connects all or part of the inputs directly to the output layer. This allows the model to learn:
- **Deep patterns** via the deep path
- **Simple rules** via the wide (shortcut) path

The short path can also include manually engineered features. In contrast, a regular MLP forces all data through the full stack of layers, which may distort simple patterns.


### Wide & Deep Architecture

We will build a Wide & Deep neural network for the California housing dataset. Since this architecture is nonsequential, we must create a **custom PyTorch module**.


In [None]:
import torch
import torch.nn as nn


In [None]:
class WideAndDeep(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features, 50),
            nn.ReLU(),
            nn.Linear(50, 40),
            nn.ReLU(),
        )
        self.output_layer = nn.Linear(40 + n_features, 1)

    def forward(self, X):
        deep_output = self.deep_stack(X)
        wide_and_deep = torch.concat([X, deep_output], dim=1)
        return self.output_layer(wide_and_deep)


### Explanation

- We use `nn.Sequential` to build the deep part of the model.
- The output layer receives the concatenation of:
  - the original inputs (wide path)
  - the deep stack‚Äôs output (deep path)
- Therefore, the output layer has `40 + n_features` inputs.


### Creating and Using the Model


In [None]:
torch.manual_seed(42)

model = WideAndDeep(n_features).to(device)
learning_rate = 0.002  # adjusted for the new architecture

# Train, evaluate, and use the model exactly like before


## Splitting Features Inside the Model

Sometimes we want only **part of the features** to go through the wide path and a (possibly overlapping) subset to go through the deep path.


In [None]:
class WideAndDeepV2(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features - 2, 50),
            nn.ReLU(),
            nn.Linear(50, 40),
            nn.ReLU(),
        )
        self.output_layer = nn.Linear(40 + 5, 1)

    def forward(self, X):
        X_wide = X[:, :5]
        X_deep = X[:, 2:]
        deep_output = self.deep_stack(X_deep)
        wide_and_deep = torch.concat([X_wide, deep_output], dim=1)
        return self.output_layer(wide_and_deep)


## Building Models with Multiple Inputs

Some models require multiple inputs that cannot be combined into a single tensor (e.g., images + text).

To support this, we modify the `forward()` method to accept multiple tensors.


In [None]:
class WideAndDeepV3(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features - 2, 50),
            nn.ReLU(),
            nn.Linear(50, 40),
            nn.ReLU(),
        )
        self.output_layer = nn.Linear(40 + 5, 1)

    def forward(self, X_wide, X_deep):
        deep_output = self.deep_stack(X_deep)
        wide_and_deep = torch.concat([X_wide, deep_output], dim=1)
        return self.output_layer(wide_and_deep)


### Preparing Data for Multiple Inputs


In [None]:
from torch.utils.data import TensorDataset, DataLoader

train_data_wd = TensorDataset(
    X_train[:, :5],
    X_train[:, 2:],
    y_train
)

train_loader_wd = DataLoader(train_data_wd, batch_size=32, shuffle=True)


### Updating the Training / Evaluation Loop


In [None]:
for X_batch_wide, X_batch_deep, y_batch in train_loader_wd:
    X_batch_wide = X_batch_wide.to(device)
    X_batch_deep = X_batch_deep.to(device)
    y_batch = y_batch.to(device)

    y_pred = model(X_batch_wide, X_batch_deep)


### Flexible Input Handling with `*` Unpacking


In [None]:
for *X_batch_inputs, y_batch in train_loader_wd:
    X_batch_inputs = [X.to(device) for X in X_batch_inputs]
    y_batch = y_batch.to(device)

    y_pred = model(*X_batch_inputs)


for *X_batch_inputs, y_batch in train_loader_wd:
    X_batch_inputs = [X.to(device) for X in X_batch_inputs]
    y_batch = y_batch.to(device)

    y_pred = model(*X_batch_inputs)


In [None]:
class WideAndDeepDataset(torch.utils.data.Dataset):
    def __init__(self, X_wide, X_deep, y):
        self.X_wide = X_wide
        self.X_deep = X_deep
        self.y = y

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        inputs = {
            "X_wide": self.X_wide[idx],
            "X_deep": self.X_deep[idx]
        }
        return inputs, self.y[idx]


In [None]:
train_data_named = WideAndDeepDataset(
    X_wide=X_train[:, :5],
    X_deep=X_train[:, 2:],
    y=y_train
)

train_loader_named = DataLoader(
    train_data_named,
    batch_size=32,
    shuffle=True
)


### Training with Named Inputs


In [None]:
for inputs, y_batch in train_loader_named:
    inputs = {name: X.to(device) for name, X in inputs.items()}
    y_batch = y_batch.to(device)

    y_pred = model(**inputs)


## Building Models with Multiple Outputs

Multiple outputs are useful for:
- Multitask learning
- Combining regression and classification
- Regularization via auxiliary outputs


In [None]:
class WideAndDeepV4(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features - 2, 50),
            nn.ReLU(),
            nn.Linear(50, 40),
            nn.ReLU(),
        )
        self.output_layer = nn.Linear(40 + 5, 1)
        self.aux_output_layer = nn.Linear(40, 1)

    def forward(self, X_wide, X_deep):
        deep_output = self.deep_stack(X_deep)
        wide_and_deep = torch.concat([X_wide, deep_output], dim=1)
        main_output = self.output_layer(wide_and_deep)
        aux_output = self.aux_output_layer(deep_output)
        return main_output, aux_output


### Training with an Auxiliary Loss


In [None]:
for inputs, y_batch in train_loader_named:
    inputs = {name: X.to(device) for name, X in inputs.items()}
    y_batch = y_batch.to(device)

    y_pred, y_pred_aux = model(**inputs)

    main_loss = criterion(y_pred, y_batch)
    aux_loss = criterion(y_pred_aux, y_batch)

    loss = 0.8 * main_loss + 0.2 * aux_loss


### Evaluation (Ignoring Auxiliary Output)


In [None]:
for inputs, y_batch in train_loader_named:
    inputs = {name: X.to(device) for name, X in inputs.items()}
    y_batch = y_batch.to(device)

    y_pred, _ = model(**inputs)


## Summary

You now know how to:
- Build **sequential and nonsequential** models
- Handle **multiple inputs**
- Handle **multiple outputs**
- Use **auxiliary losses for regularization**

Next up: **classification models** üöÄ


## Building an Image Classifier with PyTorch

As in Chapter 9, we will tackle the **Fashion MNIST** dataset. This time, instead of using `fetch_openml()`, we will load the dataset using the **TorchVision** library.


## Using TorchVision to Load the Dataset

TorchVision is an important part of the PyTorch ecosystem. It provides:
- Utility functions to download common datasets (MNIST, Fashion MNIST, etc.)
- Pretrained models
- Image transformation utilities (crop, resize, rotate, etc.)

TorchVision is preinstalled on Colab.


In [None]:
import torch
import torchvision
import torchvision.transforms.v2 as T
from torch.utils.data import DataLoader


### Defining the Image Transform

By default, FashionMNIST images are loaded as PIL images with pixel values from 0 to 255.
We need:
- PyTorch tensors
- `float32` values
- Pixel values scaled to `[0.0, 1.0]`


In [None]:
toTensor = T.Compose([
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True)
])


### Loading the Dataset

The dataset is already split into:
- 60,000 training images
- 10,000 test images

We will further split the training set into:
- 55,000 training images
- 5,000 validation images


In [None]:
train_and_valid_data = torchvision.datasets.FashionMNIST(
    root="datasets",
    train=True,
    download=True,
    transform=toTensor
)

test_data = torchvision.datasets.FashionMNIST(
    root="datasets",
    train=False,
    download=True,
    transform=toTensor
)

torch.manual_seed(42)
train_data, valid_data = torch.utils.data.random_split(
    train_and_valid_data, [55_000, 5_000]
)


### Creating DataLoaders


In [None]:
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
valid_loader = DataLoader(valid_data, batch_size=32)
test_loader = DataLoader(test_data, batch_size=32)


## Inspecting the Data


In [None]:
X_sample, y_sample = train_data[0]

X_sample.shape, X_sample.dtype


Each image has shape:

- `[1, 28, 28]`
  - 1 channel (grayscale)
  - 28 √ó 28 pixels

PyTorch expects the **channel dimension first**, unlike many other libraries.


### Inspecting the Label


In [None]:
train_and_valid_data.classes[y_sample]


## Building the Classifier

We will build a **classification MLP** with:
- Two hidden layers
- ReLU activations
- A linear output layer with 10 outputs (one per class)


In [None]:
import torch.nn as nn


In [None]:
class ImageClassifier(nn.Module):
    def __init__(self, n_inputs, n_hidden1, n_hidden2, n_classes):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Flatten(),
            nn.Linear(n_inputs, n_hidden1),
            nn.ReLU(),
            nn.Linear(n_hidden1, n_hidden2),
            nn.ReLU(),
            nn.Linear(n_hidden2, n_classes)
        )

    def forward(self, X):
        return self.mlp(X)


### Creating the Model and Loss Function


In [None]:
torch.manual_seed(42)

model = ImageClassifier(
    n_inputs=28 * 28,
    n_hidden1=300,
    n_hidden2=100,
    n_classes=10
)

xentropy = nn.CrossEntropyLoss()


### Key Notes

- `nn.Flatten()` reshapes images from `[batch, 1, 28, 28]` to `[batch, 784]`
- No activation function is used after the output layer
- `nn.CrossEntropyLoss` expects **raw logits**, not probabilities


## Training and Evaluation

We can train the model using the same `train()` function as before.
For evaluation, we use the **Accuracy** streaming metric from TorchMetrics.


In [None]:
import torchmetrics

accuracy = torchmetrics.Accuracy(
    task="multiclass",
    num_classes=10
).to(device)


‚ö†Ô∏è Training will take a few minutes on GPU (much longer on CPU).

Typical results:
- Training accuracy ‚âà **92.8%**
- Validation accuracy ‚âà **87.2%**

This indicates slight overfitting.


## Making Predictions


In [None]:
model.eval()

X_new, y_new = next(iter(valid_loader))
X_new = X_new[:3].to(device)

with torch.no_grad():
    y_pred_logits = model(X_new)

y_pred = y_pred_logits.argmax(dim=1)
y_pred


### Predicted Class Names


In [None]:
[train_and_valid_data.classes[index] for index in y_pred]


## Computing Class Probabilities with Softmax


In [None]:
import torch.nn.functional as F

y_proba = F.softmax(y_pred_logits, dim=1)
y_proba.round(decimals=3)


## Top-K Predictions


In [None]:
y_top4_logits, y_top4_indices = torch.topk(
    y_pred_logits, k=4, dim=1
)

y_top4_probas = F.softmax(y_top4_logits, dim=1)
y_top4_probas.round(decimals=3), y_top4_indices


### Interpretation

For each image:
- The model‚Äôs prediction is the class with the highest logit
- Top-K predictions show alternative plausible classes and confidence levels


## Handling Class Imbalance

Fashion MNIST is balanced, but for imbalanced datasets, you should weight classes using the `weight` argument of `nn.CrossEntropyLoss`.


## Summary

You can now:
- Load image datasets using TorchVision
- Build image classifiers in PyTorch
- Train and evaluate multiclass models
- Compute probabilities and top-K predictions

Next up: **hyperparameter tuning** üöÄ


# Fine-Tuning Neural Network Hyperparameters with Optuna

So far, we have manually chosen reasonable values for our model‚Äôs hyperparameters. However, manual tuning can be slow and suboptimal. A better approach is to use an automated hyperparameter optimization library.

Popular libraries for this include:
- **Optuna**
- Ray Tune
- Hyperopt

In this section, we will use **Optuna**, a powerful and flexible hyperparameter optimization framework.

Optuna is not preinstalled on Google Colab, so we must install it first.


In [None]:
%pip install optuna


## Defining the Objective Function

Optuna works by repeatedly calling an **objective function**. This function:
1. Receives a `Trial` object
2. Uses it to sample hyperparameter values
3. Builds and trains a model using those values
4. Evaluates the model on the validation set
5. Returns a metric (higher is better in our case)

We will tune:
- The **learning rate**
- The **number of neurons** in the hidden layers (same size for both layers)


In [None]:
import optuna
import torch
import torch.nn as nn

def objective(trial):
    # Sample hyperparameters
    learning_rate = trial.suggest_float(
        "learning_rate", 1e-5, 1e-1, log=True
    )
    n_hidden = trial.suggest_int("n_hidden", 20, 300)

    # Build model
    model = ImageClassifier(
        n_inputs=28 * 28,
        n_hidden1=n_hidden,
        n_hidden2=n_hidden,
        n_classes=10
    ).to(device)

    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    loss_fn = nn.CrossEntropyLoss()

    # Training loop (simplified)
    n_epochs = 5
    for epoch in range(n_epochs):
        model.train()
        for X_batch, y_batch in train_loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)

            logits = model(X_batch)
            loss = loss_fn(logits, y_batch)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    # Validation
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for X_batch, y_batch in valid_loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)

            logits = model(X_batch)
            preds = logits.argmax(dim=1)
            correct += (preds == y_batch).sum().item()
            total += y_batch.size(0)

    validation_accuracy = correct / total
    return validation_accuracy


## Running the Hyperparameter Search

To start optimization, we create a **Study** object. Since we want to maximize validation accuracy, we set `direction="maximize"`.

We also:
- Fix PyTorch‚Äôs random seed
- Use Optuna‚Äôs **TPE sampler** for smarter search


In [None]:
torch.manual_seed(42)

sampler = optuna.samplers.TPESampler(seed=42)

study = optuna.create_study(
    direction="maximize",
    sampler=sampler
)

study.optimize(objective, n_trials=5)


Optuna uses the **Tree-structured Parzen Estimator (TPE)** algorithm.

This is a sequential, model-based optimization strategy:
- Early trials are mostly random
- Later trials focus on promising regions of the search space

This usually finds better hyperparameters than random search in the same amount of time.


In [None]:
study.best_params, study.best_value


The output shows:
- The best learning rate
- The best number of hidden neurons
- The corresponding validation accuracy

Increasing `n_trials` (e.g., to 50 or more) will usually improve results, but at the cost of much longer runtimes.


## Passing Data Explicitly to the Objective Function

Rather than relying on global variables, it is cleaner to pass the data loaders explicitly.

One way is to use a `lambda` function.


In [None]:
objective_with_data = lambda trial: objective(trial)

study.optimize(objective_with_data, n_trials=5)


Another cleaner option is to use `functools.partial`, which creates a wrapped function with fixed arguments.


In [None]:
from functools import partial

objective_with_data = partial(objective)
study.optimize(objective_with_data, n_trials=5)


## Pruning Bad Trials Early

Some hyperparameter combinations are obviously bad:
- Loss explodes early
- Accuracy barely improves

To avoid wasting compute, Optuna supports **trial pruning**.

We will use the `MedianPruner`, which:
- Compares each trial‚Äôs performance to the median of past trials
- Stops trials that perform significantly worse


In [None]:
pruner = optuna.pruners.MedianPruner(
    n_startup_trials=5,
    n_warmup_steps=0,
    interval_steps=1
)

study = optuna.create_study(
    direction="maximize",
    sampler=sampler,
    pruner=pruner
)


Inside the objective function, we must report progress after each epoch and allow pruning.


In [None]:
for epoch in range(n_epochs):
    # Train for one epoch
    ...

    # Evaluate on validation set
    validation_accuracy = ...

    trial.report(validation_accuracy, epoch)

    if trial.should_prune():
        raise optuna.TrialPruned()


## Final Notes

Once you find good hyperparameters:
1. Retrain the model on the **full training set**
2. Evaluate on the **test set**
3. Save the trained model
4. Load it later for inference or production use

At this point, you have full control over:
- Model architecture
- Training loop
- Automated hyperparameter optimization

üî• Your PyTorch skills are officially getting serious.


# Saving and Loading PyTorch Models

The simplest way to save a PyTorch model is to use `torch.save()`, passing the model and a file path.

PyTorch uses Python‚Äôs `pickle` module internally to serialize the object, then compresses it before saving. By convention, PyTorch model files use the `.pt` or `.pth` extension.


In [None]:
torch.save(model, "my_fashion_mnist.pt")


## Loading the Entire Model

Loading the model back is just as simple. By setting `weights_only=False`, PyTorch loads the entire model object, not just its parameters.


In [None]:
loaded_model = torch.load("my_fashion_mnist.pt", weights_only=False)


Before using the loaded model for inference, remember to switch it to evaluation mode.


In [None]:
loaded_model.eval()
y_pred_logits = loaded_model(X_new)


## ‚ö†Ô∏è Important Warnings About Saving Full Models

Saving the entire model using `torch.save(model, ...)` has **serious drawbacks**:

1. **Security risk**  
   The `pickle` format can execute arbitrary code during loading. Never load a model file from an untrusted source.

2. **Fragility**  
   Pickle depends on:
   - Python version
   - File paths
   - Exact code structure

   Even small changes can break loading.

Because of these issues, **saving only the model weights is strongly recommended**.


## Saving Model Weights Only (Recommended)

Instead of saving the full model, we save its **state dictionary** using `state_dict()`.

This dictionary contains:
- All model parameters
- Any registered buffers (non-trainable tensors)

This approach is safer and more robust.


In [None]:
torch.save(model.state_dict(), "my_fashion_mnist_weights.pt")


## Loading Model Weights

To load the saved weights:
1. Recreate the model with the **exact same architecture**
2. Load the weights using `load_state_dict()`
3. Switch the model to evaluation mode


In [None]:
new_model = ImageClassifier(
    n_inputs=28 * 28,
    n_hidden1=300,
    n_hidden2=100,
    n_classes=10
)

loaded_weights = torch.load(
    "my_fashion_mnist_weights.pt",
    weights_only=True
)

new_model.load_state_dict(loaded_weights)
new_model.eval()


This approach is:
- ‚úÖ Secure (data only, no executable code)
- ‚úÖ More stable across Python versions
- ‚úÖ Preferred for deployment

However, it requires knowing the exact model architecture ahead of time.


## Saving Weights + Hyperparameters Together

To make model reconstruction easier, it‚Äôs a good idea to save:
- The model‚Äôs weights
- The model‚Äôs hyperparameters

We can store everything in a single dictionary.


In [None]:
model_data = {
    "model_state_dict": model.state_dict(),
    "model_hyperparameters": {
        "n_inputs": 28 * 28,
        "n_hidden1": 300,
        "n_hidden2": 100,
        "n_classes": 10
    }
}

torch.save(model_data, "my_fashion_mnist_model.pt")


## Loading a Model from Saved Metadata

We can now:
1. Load the dictionary
2. Rebuild the model using the saved hyperparameters
3. Load the state dictionary


In [None]:
loaded_data = torch.load(
    "my_fashion_mnist_model.pt",
    weights_only=True
)

new_model = ImageClassifier(
    **loaded_data["model_hyperparameters"]
)

new_model.load_state_dict(loaded_data["model_state_dict"])
new_model.eval()


## Resuming Training

If you want to continue training later, you should also save:
- The optimizer‚Äôs `state_dict()`
- Optimizer hyperparameters
- Current epoch number
- Training/validation loss history

This allows you to resume training exactly where you left off.


## Alternative: SafeTensors

The **SafeTensors** library (by Hugging Face) is another popular and secure way to store model weights. It avoids pickle entirely and is designed specifically for ML models.


## TorchScript (Preview)

Another way to save a PyTorch model is by converting it to **TorchScript**.

Benefits:
- Faster inference
- Language-agnostic deployment
- No Python dependency at runtime

We‚Äôll explore TorchScript next.


# Compiling and Optimizing a PyTorch Model

PyTorch provides powerful tools to **compile and optimize models** for faster inference and easier deployment.

One major option is **TorchScript**, which converts your model into a statically-typed subset of Python.


## Why Use TorchScript?

TorchScript offers two main benefits:

1. **Performance optimizations**
   - Operator fusion
   - Constant folding (e.g., `2 * 3 ‚Üí 6`)
   - Dead code elimination

2. **Deployment flexibility**
   - Models can be saved and run without Python
   - Can be executed in C++ using LibTorch
   - Useful for embedded and production environments


## Converting a Model to TorchScript: Tracing

Tracing runs the model once using example inputs and records all operations that occur.


In [None]:
torchscript_model = torch.jit.trace(model, X_new)


### Limitations of Tracing

Tracing works well for **static models**, but it has important limitations:

- `if` or `match` statements:
  - Only the executed branch is recorded
- Loops:
  - Only the observed number of iterations is captured

This makes tracing unsuitable for dynamic control flow.


## Converting a Model to TorchScript: Scripting

Scripting parses the Python source code directly and converts it into TorchScript.


In [None]:
torchscript_model = torch.jit.script(model)


### What Scripting Supports

- `if` and `while` statements (tensor-based conditions)
- `for` loops over tensors
- Proper handling of dynamic control flow

### TorchScript Restrictions

- No global variables
- No generators (`yield`)
- No `*args` or `**kwargs`
- No match statements
- Fixed return types
- Only TorchScript-compatible functions allowed

Despite these constraints, most real-world models can be scripted without much trouble.


## Optimizing a TorchScript Model

Once converted to TorchScript (via tracing or scripting), the model can be optimized for inference.


In [None]:
optimized_model = torch.jit.optimize_for_inference(torchscript_model)


### Important Note

TorchScript models:
- ‚úÖ Can be used for inference
- ‚ùå Cannot be trained
- ‚ùå Do not support autograd


## Saving and Loading a TorchScript Model

TorchScript models have their own `save()` and `load()` APIs.


In [None]:
torchscript_model.save("my_fashion_mnist_torchscript.pt")


In [None]:
loaded_torchscript_model = torch.jit.load(
    "my_fashion_mnist_torchscript.pt"
)


## TorchScript Status

TorchScript is no longer under active feature development.
- Bug fixes only
- Still widely used for C++ deployment
- Still fully supported


## PyTorch 2.x: torch.compile()

Since PyTorch 2.0, the recommended way to optimize models is `torch.compile()`.

This provides **Just-In-Time (JIT) compilation** with minimal code changes.


In [None]:
compiled_model = torch.compile(model)


### How `torch.compile()` Works

- Uses **TorchDynamo** to capture Python bytecode
- Handles conditionals and loops correctly
- Captures dynamic runtime information

By default, it uses **TorchInductor** to:
- Generate optimized GPU kernels (via Triton, NVIDIA GPUs)
- Optimize CPU execution (via OpenMP)

Other backends are available, including XLA for TPUs.


### Using a Different Backend (Example: TPU)

To use a different compilation backend, specify the device:


In [None]:
compiled_model = torch.compile(model, device="xla")


## Summary

You now know how to:

- Convert models to TorchScript (tracing & scripting)
- Optimize TorchScript models for inference
- Save and load TorchScript models
- Use `torch.compile()` for modern PyTorch optimization

These tools allow you to build **faster**, **portable**, and **production-ready** PyTorch models.
