***

*Course:* [Math 535](https://people.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS)  
*Chapter:* 8-Deep neural networks, automatic differentiation, and stochastic gradient descent: building blocks of AI   
*Author:* [Sebastien Roch](https://people.math.wisc.edu/~roch/), Department of Mathematics, University of Wisconsin-Madison  
*Updated:* May 26, 2024   
*Copyright:* &copy; 2024 Sebastien Roch

***

In [None]:
# You will need the files:
#     * mmids.py
#     * advertising.csv 
#     * SAHeart.csv 
# from https://github.com/MMiDS-textbook/MMiDS-textbook.github.io/tree/main/utils
#
# IF RUNNING ON GOOGLE COLAB (RECOMMENDED):
# "Upload to session storage" from the Files tab on the left
# Alternative instructions: https://colab.research.google.com/notebooks/io.ipynb

In [None]:
# PYTHON 3
import numpy as np
from numpy import linalg as LA
import matplotlib.pyplot as plt
import pandas as pd
import networkx as nx
import torch
import mmids

## Motivating example:  classifying natural images

In this chapter, we return to the classification problem. This time we consider more complex datasets involving natural images. We have seen an example previously, the MNIST dataset. We use a related dataset known as Fashion-MNIST developed by the [Zalando Research](https://engineering.zalando.com/tags/zalando-research.html). Quoting from their [GitHub repository](https://github.com/zalandoresearch/fashion-mnist):

> Fashion-MNIST is a dataset of Zalando's article images -- consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.

**Figure:** Fashion-MNIST sample images ([Source](https://github.com/zalandoresearch/fashion-mnist))

![Fashion-MNIST sample images](https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png)

$\bowtie$

We first load the data and convert it to an appropriate matrix representation. The data can be accessed with [`torchvision.datasets.FashionMNIST`](https://pytorch.org/vision/stable/generated/torchvision.datasets.FashionMNIST.html).

In [None]:
import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, TensorDataset

In [None]:
# Download and load the MNIST dataset
fashion_mnist = datasets.FashionMNIST(root='./data', 
                                      train=True, 
                                      download=True, 
                                      transform=transforms.ToTensor())

For example, the first image and its label are the following. The [`squeeze()`](https://pytorch.org/docs/stable/generated/torch.Tensor.squeeze.html) below removes the color dimension in the image, which is grayscale.

In [None]:
img, label = fashion_mnist[0]
plt.figure()
plt.imshow(img.squeeze(), cmap='gray')
plt.show()

In [None]:
label

In [None]:
# Define the mapping of label numbers to class names
class_names = [
    "T-shirt/top", 
    "Trouser", 
    "Pullover", 
    "Dress", 
    "Coat", 
    "Sandal", 
    "Shirt", 
    "Sneaker", 
    "Bag", 
    "Ankle boot"
]

# Function to get the class name from a label
def get_class_name(label):
    return class_names[label]

In [None]:
print(f"The label {label} corresponds to the class name '{get_class_name(label)}'.")

Here is a second example.

In [None]:
img, label = fashion_mnist[1]
plt.figure()
plt.imshow(img.squeeze(), cmap='gray')
plt.show()

In [None]:
get_class_name(label)

The purpose of this chapter is to develop some of the mathematical tools used to solve this kind of classification problem:

- deep neural networks
- automatic differentiation
- stochastic gradient descent.

## Background: Jacobian and Chain Rule; an introduction to automatic differentiation

### Brief introduction to automatic differentiation

We illustrate the use of [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) to compute gradients. 

Quoting [Wikipedia](https://en.wikipedia.org/wiki/Automatic_differentiation):

> In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation or computational differentiation, is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program. Automatic differentiation is distinct from symbolic differentiation and numerical differentiation (the method of finite differences). Symbolic differentiation can lead to inefficient code and faces the difficulty of converting a computer program into a single expression, while numerical differentiation can introduce round-off errors in the discretization process and cancellation.

**Automatic differentiation in PyTorch** We will use [PyTorch](https://pytorch.org/tutorials/). It uses [tensors](https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html), which in many ways behave similarly to Numpy arrays. See [here](https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html) for a quick introduction. Here is an example. We first initialize the tensors. Here each tensor corresponds to a single real variable. With the option [`requires_grad=True`](https://pytorch.org/docs/stable/generated/torch.Tensor.requires_grad.html#torch.Tensor.requires_grad), we indicate that these are variables with respect to which a gradient will be taken later. We initialize the tensors at the values where the derivatives will be computed. If derivatives need to be computed at different values, we need to repeat this process.

In [None]:
# Initialize variables
x1 = torch.tensor(1.0, requires_grad=True)
x2 = torch.tensor(2.0, requires_grad=True)

The function [`.backward()`](https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html) computes the gradient using backpropagation, to which we will return later. The partial derivatives are accessed with [`.grad`](https://pytorch.org/docs/stable/generated/torch.Tensor.grad.html). We first define the function. Note that we use
[`torch.exp`](https://pytorch.org/docs/stable/generated/torch.exp.html), the PyTorch implementation of the (element-wise) exponential function. Moreover, as in NumPy, PyTorch allows the use of `**` for [taking a power](https://pytorch.org/docs/stable/generated/torch.pow.html). [Here](https://pytorch.org/docs/stable/name_inference.html) is a list of operations on tensors in PyTorch. 

In [None]:
# Define function
f = 3 * (x1 ** 2) + x2 + torch.exp(x1 * x2)

In [None]:
# Perform automatic differentiation
f.backward()  # Compute gradients

In [None]:
# Print gradients
print(x1.grad)  # df/dx
print(x2.grad)  # df/dy

The input parameters can also be vectors, which allows to consider functions of large numbers of variables. Here we use [`torch.sum`](https://pytorch.org/docs/stable/generated/torch.sum.html#torch.sum) for taking a sum of the arguments.

In [None]:
# New variables for the second example
z = torch.tensor([1., 2., 3.], requires_grad=True)

In [None]:
# Perform automatic differentiation
g = torch.sum(z ** 2)
g.backward()  # Compute gradients

In [None]:
# Print gradient
print(z.grad)  # gradient is (2 z_1, 2 z_2, 2 z_3)

Here is another typical example in a data science context.

In [None]:
# Variables for the third example
X = torch.randn(3, 2)  # Random dataset (features)
y = torch.tensor([[1., 0., 1.]])  # Dataset (labels)
theta = torch.ones(2, 1, requires_grad=True)  # Parameter assignment

In [None]:
# Perform automatic differentiation
predict = X @ theta  # Classifier with parameter vector theta
loss = torch.sum((predict - y)**2)  # Loss function
loss.backward()  # Compute gradients

In [None]:
# Print gradient
print(theta.grad)  # gradient of loss

**Implementing gradient descent in PyTorch** Rather than explicitly specifying the gradient function, we could use PyTorch to compute it automatically. This is done next. Note that the descent update is done within [`with torch.no_grad()`](https://pytorch.org/docs/stable/generated/torch.no_grad.html), which ensures that the update operation itself is not tracked for gradient computation. Here the input `x0` as well as the output `xk.numpy(force=True)` are Numpy arrays. The function [`torch.Tensor.numpy()`](https://pytorch.org/docs/stable/generated/torch.Tensor.numpy.html) converts a PyTorch tensor to a Numpy array (see the documentation for an explanation of the `force=True` option). Also, quoting ChatGPT:

> In the given code, `.item()` is used to extract the scalar value from a tensor. In PyTorch, when you perform operations on tensors, you get back tensors as results, even if the result is a single scalar value. `.item()` is used to extract this scalar value from the tensor.


In [None]:
def gd_with_ad(f, x0, alpha=1e-3, niters=int(1e6)):
    xk = torch.tensor(x0, 
                      requires_grad=True, 
                      dtype=torch.float)
    
    for _ in range(niters):
        # Compute the function value and its gradient
        value = f(xk)
        value.backward()

        # Perform a gradient descent step
        # Temporarily set all requires_grad flags to False
        with torch.no_grad():  
            xk -= alpha * xk.grad

        # Zero the gradients for the next iteration
        xk.grad.zero_()

    return xk.numpy(force=True), f(xk).item()

We revisit a previous example.

In [None]:
def f(x):
    return x**3

In [None]:
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.ylim((-10,10))
plt.legend()
plt.show()

In [None]:
gd_with_ad(f, 2, niters=int(1e4))

In [None]:
gd_with_ad(f, -2, niters=100)

$\unlhd$

## Backpropagation

In [None]:
x = torch.tensor([1.,0.,-1.], requires_grad=True)
y = torch.tensor([0.,1.])
W0 = torch.tensor([[0.,1.,-1.],[2.,0.,1.]])
W1 = torch.tensor([[-1.,0.],[2.,-1.]])

In [None]:
z0 = x
z1 = W0 @ z0
z2 = W1 @ z1
f = 0.5 * (torch.linalg.vector_norm(y-z2) ** 2)

In [None]:
print('z0 =',z0)
print('z1 =',z1)
print('z2 =',z2)
print('f =',f)

In [None]:
with torch.no_grad():
    F0 = W0
    F1 = W1 @ F0
    grad_f = (z2 - y).T @ F1

In [None]:
print('F0 =', F0)
print('F1 =', F1)
print('grad_f =', grad_f)

In [None]:
f.backward()

In [None]:
print('x.grad =', x.grad)

In [None]:
with torch.no_grad():
    G2 = (z2 - y).T
    G1 = G2 @ W1
    grad_f = G1 @ W0

In [None]:
print('G2 =', G2)
print('G1 =', G1)
print('grad_f =', grad_f)

In [None]:
x = torch.tensor([1.,0.,-1.])
y = torch.tensor([0.,1.])
W0 = torch.tensor([[0.,1.,-1.],[2.,0.,1.]], requires_grad=True)
W1 = torch.tensor([[-1.,0.],[2.,-1.]], requires_grad=True)

In [None]:
z0 = x
z1 = W0 @ z0
z2 = W1 @ z1
f = 0.5 * (torch.linalg.vector_norm(y-z2) ** 2)

In [None]:
print('z0 =',z0)
print('z1 =',z1)
print('z2 =',z2)
print('f =',f)

In [None]:
f.backward()

In [None]:
print('W0.grad =', W0.grad)

In [None]:
print('W1.grad =', W1.grad)

In [None]:
with torch.no_grad():
    grad_W0 = torch.kron((z2 - y).T @ W1, z0.T)
    grad_W1 = torch.kron((z2 - y).T, z1.T)

In [None]:
print('grad_W0 =', grad_W0)
print('grad_W1 =', grad_W1)

## Stochastic gradient descent

In [None]:
def sigmoid(z): 
    return 1/(1+np.exp(-z))

def pred_fn(x, A): 
    return sigmoid(A @ x)

def loss_fn(x, A, b): 
    return np.mean(-b*np.log(pred_fn(x, A)) - (1 - b)*np.log(1 - pred_fn(x, A)))

def grad_fn(x, A, b):
    return -A.T @ (b - pred_fn(x, A))/len(b)

def desc_update_for_logreg(grad_fn, A, b, curr_x, beta):
    gradient = grad_fn(curr_x, A, b)
    return curr_x - beta*gradient

In [None]:
seed = 535
rng = np.random.default_rng(seed)

def sgd_for_logreg(loss_fn, grad_fn, A, b, 
                   init_x, beta=1e-3, niters=int(1e5), batch=40):
    
    # initialization
    curr_x = init_x
    
    # until the maximum number of iterations
    nsamples = len(b)
    for _ in range(niters):
        I = rng.integers(nsamples, size=batch)
        curr_x = desc_update_for_logreg(
            grad_fn, A[I,:], b[I], curr_x, beta)
    
    return curr_x

We analyze a dataset from [[ESL](https://web.stanford.edu/~hastie/ElemStatLearn/)], which can be downloaded [here](https://web.stanford.edu/~hastie/ElemStatLearn/data.html). Quoting [[ESL](https://web.stanford.edu/~hastie/ElemStatLearn/), Section 4.4.2] 

> The data [...] are a subset of the Coronary Risk-Factor Study (CORIS) baseline survey, carried out in three rural areas of the Western Cape, South Africa (Rousseauw et al., 1983). The aim of the study was to establish the intensity of ischemic heart disease risk factors in that high-incidence region. The data represent white males between 15 and 64, and the response variable is the presence or absence of myocardial infarction (MI) at the time of the survey (the overall prevalence of MI was 5.1% in this region). There are 160 cases in our data set, and a sample of 302 controls. These data are described in more detail in Hastie and Tibshirani (1987).

We load the data, which we slightly reformatted and look at a summary. 

In [None]:
data = pd.read_csv('SAHeart.csv')
data.head()

Our goal to predict `chd`, which stands for coronary heart disease, based on the other variables (which are briefly described [here](https://web.stanford.edu/~hastie/ElemStatLearn/datasets/SAheart.info.txt)). We use logistic regression again. 

We first construct the data matrices. We only use three of the predictors.

In [None]:
feature = data[['tobacco', 'ldl', 'age']].to_numpy()
print(feature)

In [None]:
label = data['chd'].to_numpy()

In [None]:
A = np.concatenate((np.ones((len(label),1)),feature),axis=1)
print(A)

In [None]:
b = label

We try mini-batch SGD. 

In [None]:
init_x = np.zeros(A.shape[1])

In [None]:
best_x = sgd_for_logreg(loss_fn, grad_fn, A, b, 
                        init_x, beta=1e-3, niters=int(1e6))

In [None]:
print(best_x)

The outcome is harder to vizualize. To get a sense of how accurate the result is, we compare our predictions to the true labels. By prediction, let us say that we mean that we predict label $1$ whenever $\sigma(\boldsymbol{\alpha}^T \mathbf{x}) > 1/2$. We try this on the training set. (A better approach would be to split the data into training and testing sets, but we will not do this here.)

In [None]:
def logis_acc(x, A, b):
    return np.sum((pred_fn(x, A) > 0.5) == b)/len(b)

In [None]:
logis_acc(best_x, A, b)

**The `Advertising` dataset and the least-squares solution** We return to the `Advertising` dataset.

In [None]:
data = pd.read_csv('advertising.csv')
data.head()

In [None]:
n = len(data.index)
print(n)

We first compute the solution using the least-squares approach we detailed previously. We use [`numpy.column_stack`](https://numpy.org/doc/stable/reference/generated/numpy.column_stack.html#numpy.column_stack) to add a column of ones to the feature vectors.

In [None]:
TV = data['TV'].to_numpy()
radio = data['radio'].to_numpy()
newspaper = data['newspaper'].to_numpy()
sales = data['sales'].to_numpy()
features = np.stack((TV, radio, newspaper), axis=-1)
A = np.column_stack((np.ones(n), features))

In [None]:
coeff = mmids.ls_by_qr(A, sales)
print(coeff)

In [None]:
np.mean((A @ coeff - sales)**2)

**Solving the problem using PyTorch** We will be using PyTorch to implement the previous method. We first convert the data into PyTorch tensors. We then use [`torch.utils.data.TensorDataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset) to create the dataset. Finally, [`torch.utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) provides the utilities to load the data in batches for training. We take mini-batches of size `BATCH_SIZE = 64` and we apply a random permutation of the samples on every pass (with the option `shuffle=True`).

In [None]:
from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn
import torch.optim as optim

In [None]:
# Convert data to PyTorch tensors
features_tensor = torch.tensor(features, 
                               dtype=torch.float32)
sales_tensor = torch.tensor(sales, 
                            dtype=torch.float32).view(-1, 1)

In [None]:
# Create a dataset and dataloader for training
BATCH_SIZE = 64
train_dataset = TensorDataset(features_tensor, sales_tensor)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)

Now we construct our model. It is simply an affine map from $\mathbb{R}^3$ to $\mathbb{R}$. Note that there is no need to pre-process the inputs by adding $1$s. A constant term (or "bias variable") is automatically added by PyTorch (unless one chooses the option [`bias=False`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)).

In [None]:
# Define the model using nn.Sequential
model = nn.Sequential(
    nn.Linear(3, 1)  # 3 input features, 1 output value
)

Finally, we are ready to run an optimization method of our choice on the loss function, which are specified next. There are many [optimizers](https://pytorch.org/docs/stable/optim.html#algorithms) available. (See this [post](https://hackernoon.com/demystifying-different-variants-of-gradient-descent-optimization-algorithm-19ae9ba2e9bc) for a brief explanation of many common optimizers.) Here we use SGD as the optimizer. And the loss function is the MSE. A quick tutorial is [here](https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html).

Choosing the right number of passes (i.e. epochs) through the data requires some experimenting. Here $10^4$ suffices. But in the interest of time, we will run it only for $100$ epochs. As you will see from the results, this is not quite enough. On each pass, we compute the output of the current model, use `backward()` to obtain the gradient, and then perform a descent update with `step()`. We also have to reset the gradients first (otherwise they add up by default). 

In [None]:
# Compile the model: define loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-5)

In [None]:
# Train the model
epochs = 100
for epoch in range(epochs):
    for inputs, targets in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
    if (epoch+1) % 10 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}")

The final parameters and loss are:

In [None]:
# Get and print the model weights and bias
weights = model[0].weight.detach().numpy()
bias = model[0].bias.detach().numpy()
print("Weights:", weights)
print("Bias:", bias)

In [None]:
# Evaluate the model
model.eval()
with torch.no_grad():
    total_loss = 0
    for inputs, targets in train_loader:
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        total_loss += loss.item()
        
    print(f"Mean Squared Error on Training Set: {total_loss / len(train_loader)}")

**MNIST dataset** We will use the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset introduced earlier in the chapter. This example is inspired by [these](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html) [tutorials](https://www.tensorflow.org/tutorials/keras/classification).

**Figure:** MNIST sample images ([Source](https://commons.wikimedia.org/wiki/File:MnistExamples.png))

![MNIST sample images](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png)

$\bowtie$

We first load the data. As before, the training dataset is a tensor -- think matrix with $3$ indices. One index runs through the $60,000$ training images, while the other two indices run through the horizontal and vertical pixel axes of each image. Here each image is $28 \times 28$. The training labels are between $0$ and $9$.

In [None]:
from torchvision import datasets, transforms

# Load and normalize the MNIST dataset
train_dataset = datasets.MNIST(root='./data', 
                               train=True, 
                               download=True, 
                               transform=transforms.ToTensor())

test_dataset = datasets.MNIST(root='./data', 
                              train=False, 
                              download=True, 
                              transform=transforms.ToTensor())

In [None]:
BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, 
                          batch_size=BATCH_SIZE, 
                          shuffle=True)

test_loader = DataLoader(test_dataset, 
                         batch_size=BATCH_SIZE, 
                         shuffle=False)

**Implementation** We implement multinomial logistic regression to learn a classifier for the MNIST data. We first check for the availability of GPUs.

In [None]:
# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() 
                      else ("mps" if torch.backends.mps.is_available() 
                            else "cpu"))
print("Using device:", device)

In PyTorch, composition of functions can be achieved with [`torch.nn.Sequential`](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html). Our model is:

In [None]:
# Define the model using nn.Sequential and move it to the device (GPU if available)
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28 * 28, 10)
).to(device)

The [`torch.nn.Flatten`](https://pytorch.org/docs/stable/generated/torch.nn.Flatten.html) layer turns each input image into a vector of size $784$ (where $784 = 28^2$ is the number of pixels in each image). The final output is $10$-dimensional.

Here we use the [`torch.optim.Adam`](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer (you can try SGD, but it is slow). The loss function is the [cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy), as implemented by [`torch.nn.CrossEntropyLoss`](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), which first takes the softmax and expects the labels to be class names rather than their one-hot encoding.

In [None]:
# Compile the model: define loss function and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

In the interest of time, we train for 3 epochs only. An epoch is one training iteration where all samples are iterated once (in a randomly shuffled order).

In [None]:
# Training function
def train(dataloader, model, loss_fn, optimizer, device):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)
        
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)
        
        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Training loop
def training_loop(train_loader, model, loss_fn, optimizer, device, epochs=3):
    for epoch in range(epochs):
        train(train_loader, model, loss_fn, optimizer, device)
        
        if (epoch+1) % 1 == 0:
            print(f"Epoch {epoch+1}/{epochs}")

In [None]:
training_loop(train_loader, model, loss_fn, optimizer, device)

Because of the issue of [overfitting](https://en.wikipedia.org/wiki/Overfitting), we use the *test* images to assess the performance of the final classifier. 

In [None]:
# Evaluation function
def test(dataloader, model, loss_fn, device):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(dim=1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    accuracy = correct / size
    print(f"Test error: {(100*accuracy):>0.1f}% accuracy")

In [None]:
# Evaluate the model
test(test_loader, model, loss_fn, device)

To make a prediction, we take a [`torch.nn.functional.softmax`](https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html) of the output of our model. Recall that it is implicitly included in `torch.nn.CrossEntropyLoss`, but is not actually part of `model`. (Note that the softmax itself has no parameter.) 

As an illustration, we do this for each test image. We use [`torch.cat`](https://pytorch.org/docs/stable/generated/torch.cat.html) to concatenate a sequence of tensors into a single tensor.

In [None]:
import torch.nn.functional as F

def predict_softmax(dataloader, model, device):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()  # Set the model to evaluation mode
    predictions = []
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            probabilities = F.softmax(pred, dim=1)
            predictions.append(probabilities.cpu())  # Move predictions to CPU

    return torch.cat(predictions, dim=0)

predictions = predict_softmax(test_loader, model, device).numpy()

The result for the first test image is shown below. To make a prediction, we choose the label with the highest probability.

In [None]:
print(predictions[0])

In [None]:
predictions[0].argmax(0)

The truth is:

In [None]:
images, labels = next(iter(test_loader))
images = images.squeeze().numpy()
labels = labels.numpy()
labels[0]

Above, `next(iter(test_loader))` loads the first batch of test images. (See [here](https://docs.python.org/3/tutorial/classes.html#iterators) for background on iterators in Python.)

The following code, adapted from [here](https://www.tensorflow.org/tutorials/keras/classification), provides a neat vizualization of the results.

In [None]:
class_names = ['0', '1', '2', '3', '4',
               '5', '6', '7', '8', '9']

def plot_image(predictions_array, true_label, img):
    
    plt.grid(False)
    plt.xticks([])
    plt.yticks([])

    plt.imshow(img, cmap=plt.cm.binary)

    predicted_label = np.argmax(predictions_array)
    if predicted_label == true_label:
        color = 'blue'
    else:
        color = 'red'

    plt.xlabel(f"{class_names[predicted_label]} {100*np.max(predictions_array):2.0f}% ({class_names[true_label]})", 
               color=color)

def plot_value_array(predictions_array, true_label):
    
    plt.grid(False)
    plt.xticks(range(10))
    plt.yticks([])
    thisplot = plt.bar(range(10), predictions_array, color="#777777")
    plt.ylim([0, 1])
    predicted_label = np.argmax(predictions_array)
 
    thisplot[predicted_label].set_color('red')
    thisplot[true_label].set_color('blue')

Here's the first one.

In [None]:
# Visualization code for individual image
i = 0
plt.figure(figsize=(6,3))
plt.subplot(1,2,1)
plot_image(predictions[i], labels[i], images[i])  
plt.subplot(1,2,2)
plot_value_array(predictions[i], labels[i]) 
plt.show()

This one is a little less clear. 

In [None]:
# Visualization code for individual and multiple images
i = 11
plt.figure(figsize=(6,3))
plt.subplot(1,2,1)
plot_image(predictions[i], labels[i], images[i])  
plt.subplot(1,2,2)
plot_value_array(predictions[i], labels[i]) 
plt.show()

This one is wrong.

In [None]:
# Visualization code for individual and multiple images
i = 8
plt.figure(figsize=(6,3))
plt.subplot(1,2,1)
plot_image(predictions[i], labels[i], images[i])  
plt.subplot(1,2,2)
plot_value_array(predictions[i], labels[i]) 
plt.show()

## Neural networks

**NUMERICAL CORNER:** We return to the concrete example from the previous section. We re-write the gradient as

\begin{align*}
\nabla f(\mathbf{w})^T
&= \begin{pmatrix}
[(\mathbf{z}_2 - \mathbf{y})^T
\mathcal{W}_{1} \mathrm{diag}(\mathbf{z}_1 \odot (\mathbf{1} - \mathbf{z}_1))] \otimes \mathbf{z}_0^T &
(\mathbf{z}_2 - \mathbf{y})^T
\otimes \mathbf{z}_1^T
\end{pmatrix}.
\end{align*}

We will use [`torch.nn.functional.sigmoid`](https://pytorch.org/docs/stable/generated/torch.nn.functional.sigmoid.html) and
[`torch.nn.functional.softmax`](https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html) for the sigmoid and softmax functions respectively. We also use [`torch.dot`](https://pytorch.org/docs/stable/generated/torch.dot.html) for the inner product (i.e., dot product) of two vectors (as tensors) and [`torch.diag`](https://pytorch.org/docs/stable/generated/torch.diag.html) for the creation of a diagonal matrix with specified entries on its diagonal. 

In [None]:
import torch.nn.functional as F

In [None]:
x = torch.tensor([1.,0.,-1.])
y = torch.tensor([0.,1.])
W0 = torch.tensor([[0.,1.,-1.],[2.,0.,1.]], requires_grad=True)
W1 = torch.tensor([[-1.,0.],[2.,-1.]], requires_grad=True)

In [None]:
z0 = x
z1 = F.sigmoid(W0 @ z0)
z2 = F.softmax(W1 @ z1)
f = -torch.dot(torch.log(z2), y)

In [None]:
print('z0 =',z0)
print('z1 =',z1)
print('z2 =',z2)
print('f =',f)

We compute the gradient $\nabla f(\mathbf{w})$ using AD.

In [None]:
f.backward()

In [None]:
print('W0.grad =', W0.grad)

In [None]:
print('W1.grad =', W1.grad)

We use our formulas to confirm that they match these results.

In [None]:
with torch.no_grad():
    grad_W0 = torch.kron((z2 - y).T @ W1 @ torch.diag(z1 * (1-z1)), z0.T)
    grad_W1 = torch.kron((z2 - y).T, z1.T)

In [None]:
print('grad_W0 =', grad_W0)
print('grad_W1 =', grad_W1)

The results match with the AD output. $\unlhd$

**Implementation** We implement the training of a neural network in PyTorch. We use the MNIST dataset again. We first load it again. We also check for the availability of GPUs.

In [None]:
from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

In [None]:
train_dataset = datasets.MNIST(root='./data', 
                               train=True, 
                               download=True, 
                               transform=transforms.ToTensor())

test_dataset = datasets.MNIST(root='./data', 
                              train=False, 
                              download=True, 
                              transform=transforms.ToTensor())

In [None]:
BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, 
                          batch_size=BATCH_SIZE, 
                          shuffle=True)

test_loader = DataLoader(test_dataset, 
                         batch_size=BATCH_SIZE, 
                         shuffle=False)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() 
                      else ("mps" if torch.backends.mps.is_available() 
                            else "cpu"))
print("Using device:", device)

We construct a three-layer model.

In [None]:
model = nn.Sequential(
    nn.Flatten(),                      # Flatten the input
    nn.Linear(28 * 28, 32),            # First Linear layer with 32 nodes
    nn.Sigmoid(),                      # Sigmoid activation function
    nn.Linear(32, 10)                  # Second Linear layer with 10 nodes (output layer)
).to(device)

As we did for multinomial logistic regression, we use the Adam optimizer and the cross-entropy loss (which in PyTorch includes the softmax function and expects labels to be class names rather than one-hot encoding).

In [None]:
loss_fn = nn.CrossEntropyLoss()  
optimizer = optim.Adam(model.parameters())

Again, we train for 3 epochs.

In [None]:
mmids.training_loop(train_loader, model, loss_fn, optimizer, device)

On the test data, we get:

In [None]:
mmids.test(test_loader, model, loss_fn, device)

This is a significantly more accurate model than what we obtained using multinomial logistic regression. 

One can do even better using a neural network tailored for images, known as [convolutional neural networks](https://cs231n.github.io/convolutional-networks/). From [Wikipedia](https://en.wikipedia.org/wiki/Convolutional_neural_network):

> In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics.

More background can be found in this excellent [module](http://cs231n.github.io/convolutional-networks/) from Stanford's [CS231n](http://cs231n.github.io/). Our CNN will be a composition of [convolutional layers](http://cs231n.github.io/convolutional-networks/#conv) and [pooling layers](http://cs231n.github.io/convolutional-networks/#pool).

**LEARNING BY CHATTING:** Ask your favorite AI chatbot to explain what are convolutional and pooling layers. $\ddagger$

The new model is the following.

In [None]:
model = nn.Sequential(
    # First convolution, operating upon a 28x28 image
    nn.Conv2d(1, 16, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),

    # Second convolution, operating upon a 14x14 image
    nn.Conv2d(16, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),

    # Third convolution, operating upon a 7x7 image
    nn.Conv2d(32, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),

    # Flatten the tensor
    nn.Flatten(),

    # Fully connected layer
    nn.Linear(32 * 3 * 3, 10),
).to(device)

We train and test.

In [None]:
loss_fn = nn.CrossEntropyLoss()  
optimizer = optim.Adam(model.parameters())

In [None]:
mmids.training_loop(train_loader, model, loss_fn, optimizer, device)

In [None]:
mmids.test(test_loader, model, loss_fn, device)

The accuracy has indeed improved markedly.

Finally, we try the Fashion-MNIST dataset. We use the same CNN.

In [None]:
train_dataset = datasets.FashionMNIST(root='./data', 
                                      train=True, 
                                      download=True, 
                                      transform=transforms.ToTensor())

test_dataset = datasets.FashionMNIST(root='./data', 
                                     train=False, 
                                     download=True, 
                                     transform=transforms.ToTensor())

In [None]:
BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, 
                          batch_size=BATCH_SIZE, 
                          shuffle=True)

test_loader = DataLoader(test_dataset, 
                         batch_size=BATCH_SIZE, 
                         shuffle=False)

In [None]:
loss_fn = nn.CrossEntropyLoss()  
optimizer = optim.Adam(model.parameters())

In [None]:
mmids.training_loop(train_loader, model, loss_fn, optimizer, device)

In [None]:
mmids.test(test_loader, model, loss_fn, device)

The accuracy is not as high, as this is a more difficult dataset.