***

*Course:* [Math 535](https://people.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS)  
*Chapter:* 6-Optimization theory and algorithms   
*Author:* [Sebastien Roch](https://people.math.wisc.edu/~roch/), Department of Mathematics, University of Wisconsin-Madison  
*Updated:* Jan 6, 2024   
*Copyright:* &copy; 2024 Sebastien Roch

***

In [None]:
# IF RUNNING ON GOOGLE COLAB, UNCOMMENT THE FOLLOWING CODE CELL
# When prompted, upload: 
#     * mmids.py
#     * advertising.csv 
# from your local file system
# Files at: https://github.com/MMiDS-textbook/MMiDS-textbook.github.io/tree/main/utils
# Alternative instructions: https://colab.research.google.com/notebooks/io.ipynb

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [None]:
# PYTHON 3
import numpy as np
from numpy import linalg as LA
from numpy.random import default_rng
rng = default_rng(535)
import matplotlib.pyplot as plt
import pandas as pd
import networkx as nx
import tensorflow as tf
from tensorflow import keras
import mmids

## Motivating example:  deciphering handwritten digits

We now turn to classification.

Quoting [Wikipedia](https://en.wikipedia.org/wiki/Statistical_classification):

> In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition. In the terminology of machine learning, classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available.

We will illustrate this problem on the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset. Quoting [Wikipedia](https://en.wikipedia.org/wiki/MNIST_database) again:

> The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels. The MNIST database contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset.

Here is a sample of the images:

![MNIST sample images](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png)

**Figure:** MNIST sample images ([Source](https://commons.wikimedia.org/wiki/File:MnistExamples.png))

We first load the data and convert it to an appropriate matrix representation. The data can be accessed with [`tensorflow.keras.datasets.mnist`](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/mnist).

In [None]:
import tensorflow as tf
from tensorflow import keras

In [None]:
mnist = keras.datasets.mnist
(imgs, labels), (test_imgs, test_labels) = mnist.load_data()
len(imgs)

For example, the first image and its label are:

In [None]:
plt.figure()
plt.imshow(imgs[0])
plt.show()

In [None]:
labels[0]

For now, we look at a subset of the samples: the 0's and 1's.

To find all such samples, we use a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).

In [None]:
i01 = [i for i in range(len(labels)) if (labels[i]==0) or (labels[i]==1)]
imgs01 = imgs[i01]
labels01 = labels[i01]

In this new dataset, the first sample is:

In [None]:
plt.figure()
plt.imshow(imgs01[0])
plt.show()

In [None]:
labels01[0]

Next, we transform the images into vectors. For this we use the [`flatten()`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flatten.html) function, which returns a copy of the array collapsed into one dimension.

In [None]:
X = np.vstack([imgs01[i].flatten() for i in range(len(labels01))])
y = labels01

The input data is now of the form $\{(\mathbf{x}_i, y_i) : i=1,\ldots, n\}$ where $\mathbf{x}_i \in \mathbb{R}^d$ are the features and $y_i \in \{0,1\}$ is the label. Above we use the matrix representation $X \in \mathbb{R}^{d \times n}$ with columns $\mathbf{x}_i$, $i = 1,\ldots, n$ and $\mathbf{y} = (y_1, \ldots, y_n)^T \in \{0,1\}^n$. 

Our goal: 

> to learn a classifier from the examples $\{(\mathbf{x}_i, y_i) : i=1,\ldots, n\}$, that is, a function $\hat{f} : \mathbb{R}^d \to \mathbb{R}$ such that $\hat{f}(\mathbf{x}_i) \approx y_i$.

This problem is referred to as [binary classification](https://en.wikipedia.org/wiki/Binary_classification).

## Background: review of differentiable functions of several variables and introduction to automatic differentiation

**NUMERICAL CORNER:** We illustrate the use of [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) to compute gradients. 

Quoting [Wikipedia](https://en.wikipedia.org/wiki/Automatic_differentiation):

> In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation or computational differentiation, is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program. Automatic differentiation is distinct from symbolic differentiation and numerical differentiation (the method of finite differences). Symbolic differentiation can lead to inefficient code and faces the difficulty of converting a computer program into a single expression, while numerical differentiation can introduce round-off errors in the discretization process and cancellation.

We will use the [TensorFlow](https://www.tensorflow.org/overview), specifically [`tensorflow.GradientTape()`](https://www.tensorflow.org/api_docs/python/tf/GradientTape). See [here](https://www.tensorflow.org/guide/autodiff) for a quick introduction. Here is an example.

In [None]:
import tensorflow as tf

In [None]:
x = tf.Variable(1.0)
y = tf.Variable(2.0)

with tf.GradientTape() as tape:
    f = 3 * x**2 + tf.exp(x) + y

In [None]:
[df_dx, df_dy] = tape.gradient(f, [x, y])
print(df_dx.numpy())
print(df_dy.numpy())

The input parameters can also be vectors, which allows to consider function of large numbers of variables. 

In [None]:
z = tf.Variable([1., 2., 3.])

with tf.GradientTape() as tape:
    g = tf.reduce_sum(z**2)

In [None]:
grad_g = tape.gradient(g, z) # gradient is (2 z_1, 2 z_2, 2 z_3)
print(grad_g.numpy())

Here is another typical example.

In [None]:
X = tf.Variable(tf.random.normal((3, 2))) # dataset (features)
y = tf.Variable([[1., 0., 1.]]) # dataset (labels)
theta = tf.Variable(tf.ones((2,1))) # parameter assignment

with tf.GradientTape() as tape:
    predict = X @ theta # classifier with parameter vector θ
    loss = tf.reduce_sum((predict - y)**2) # loss function

In [None]:
grad_loss = tape.gradient(loss, theta)
print(grad_loss.numpy())

$\unlhd$

**NUMERICAL CORNER:** We return to [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation). 

Each component of the output of `gradient(f, x)` is itself a function and can also be differentiated to obtain the second derivative.

In [None]:
x = tf.Variable(0.0)
y = tf.Variable(0.0)

with tf.GradientTape() as t2:
    with tf.GradientTape() as t1:
        f = x * y + x**2 + tf.exp(x) * tf.cos(y)
    df_dx = t1.gradient(f, x) # needs to be within t2

print(df_dx.numpy()) # answer is 1 (see example is next notebook)

In [None]:
d2f_dx2 = t2.gradient(df_dx, x) # answer is 3 (see example is next notebook)
print(d2f_dx2.numpy())

$\unlhd$

$\newcommand{\bSigma}{\boldsymbol{\Sigma}}$
$\newcommand{\bmu}{\boldsymbol{\mu}}$

## Optimality conditions and convexity

**EXAMPLE:** Consider $f(x) = e^x$. Then $f'(x) = f''(x) = e^x$. Suppose we are interested in approximating $f$ in the interval $[0,1]$. We take $a=0$ and $b=1$ in *Taylor's Theorem*. The linear term is 

$$
f(a) + (x-a) f'(a) = 1 + x e^0 = 1 + x.
$$

Then for any $x \in [0,1]$

$$
f(x) = 1 + x + \frac{1}{2}x^2 e^{\xi_x}
$$

where $\xi_x \in (0,1)$ depends on $x$. We get a uniform bound on the error over $[0,1]$ by replacing $\xi_x$ with its worst possible value over $[0,1]$ 

$$
|f(x) - (1+x)| \leq \frac{1}{2}x^2 e^{\xi_x} \leq \frac{e}{2} x^2.
$$

In [None]:
x = np.linspace(0,1,100)
y = np.exp(x)
taylor = 1 + x
err = (np.exp(1)/2) * x**2

In [None]:
plt.plot(x,y,label='f')
plt.plot(x,taylor,label='taylor')
plt.legend()
plt.show()

If we plot the upper and lower bounds, we see that $f$ indeed falls within them.

In [None]:
plt.plot(x,y,label='f')
plt.plot(x,taylor,label='taylor')
plt.plot(x,taylor-err,linestyle=':',color='green',label='lower')
plt.plot(x,taylor+err,linestyle='--',color='green',label='upper')
plt.legend()
plt.show()

$\lhd$

**EXAMPLE:** Let $f(x) = x^3$. Then $f'(x) = 3 x^2$ and $f''(x) = 6 x$ so that $f'(0) = 0$ and $f''(0) \geq 0$. Hence $x=0$ is a stationary point. But $x=0$ is not a local minimizer. Indeed $f(0) = 0$ but, for any $\delta > 0$, $f(-\delta) < 0$.

In [None]:
x = np.linspace(-2,2,100)
y = x**3

In [None]:
plt.plot(x,y)
plt.ylim(-5,5)
plt.show()

$\lhd$

## Gradient descent and its convergence analysis

**NUMERICAL CORNER:** We implement gradient descent in Python. We assume that a function `f` and its gradient `grad_f` are provided. We first code the basic steepest descent step with a step size $\alpha = $`alpha`.

In [None]:
def desc_update(grad_f, x, alpha):
    return x - alpha*grad_f(x)

In [None]:
def gd(f, grad_f, x0, alpha=1e-3, niters=int(1e6)):
    
    xk = x0
    for _ in range(niters):
        xk = desc_update(grad_f, xk, alpha)

    return xk, f(xk)

We illustrate on a simple example.

In [None]:
def f(x): 
    return (x-1)**2 + 10

In [None]:
xgrid = np.linspace(-5,5,100)
plt.plot(xgrid, f(xgrid))
plt.show()

In [None]:
def grad_f(x):
    return 2*(x-1)

In [None]:
gd(f, grad_f, 0)

We found a global minmizer in this case.

The next example shows that a different local minimizer may be reached depending on the starting point.

In [None]:
def f(x): 
    return 4 * (x-1)**2 * (x+1)**2 - 2*(x-1)

In [None]:
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.ylim((-1,10))
plt.legend()
plt.show()

In [None]:
def grad_f(x): 
    return 8 * (x-1) * (x+1)**2 + 8 * (x-1)**2 * (x+1) - 2

In [None]:
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.plot(xgrid, grad_f(xgrid), label='grad_f')
plt.ylim((-10,10))
plt.legend()
plt.show()

In [None]:
gd(f, grad_f, 0)

In [None]:
gd(f, grad_f, -2)

In the final example, we end up at a stationary point that is not a local minimizer. Here both the first and second derivatives are zero. This is known as a [saddle point](https://en.wikipedia.org/wiki/Saddle_point).

In [None]:
def f(x):
    return x**3

In [None]:
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.ylim((-10,10))
plt.legend()
plt.show()

In [None]:
def grad_f(x):
    return 3 * x**2

In [None]:
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.plot(xgrid, grad_f(xgrid), label='grad_f')
plt.ylim((-10,10))
plt.legend()
plt.show()

In [None]:
gd(f, grad_f, 2)

In [None]:
gd(f, grad_f, -2, niters=100)

$\unlhd$

**NUMERICAL CORNER:** We give a numerical example using a special case of logistic regression. We illustrate it on a random dataset. The functions $\hat{f}$, $\mathcal{L}$ and $\frac{\partial}{\partial x}\mathcal{L}$ are defined next.

In [None]:
rng = np.random.default_rng(535)

In [None]:
def fhat(x,a):
    return 1 / ( 1 + np.exp(-np.outer(x,a)) )

In [None]:
def loss(x,a,b): 
    return np.mean(-b*np.log(fhat(x,a)) - (1 - b)*np.log(1 - fhat(x,a)), axis=1)

In [None]:
def grad(x,a,b):
    return -np.mean((b - fhat(x,a))*a, axis=1)

In [None]:
n = 10000
a = 2*rng.uniform(0,1,n) - 1
b = rng.integers(2, size=n)
x = np.linspace(-1,1,100)

In [None]:
plt.plot(x, loss(x,a,b), label='loss')
plt.legend()
plt.show()

We plot next the upper and lower bounds in the *Quadratic Bound for Smooth Functions* around $x = x_0$. Based on *Exercise 4.17*, we can take $L=1$. Observe that minimizing the upper quadratic bound leads to a decrease in $\mathcal{L}$.

In [None]:
x0 = -0.3
x = np.linspace(x0-0.05,x0+0.05,100)
upper = loss(x0,a,b) + (x - x0)*grad(x0,a,b) + (1/2)*(x - x0)**2 # upper approximation
lower = loss(x0,a,b) + (x - x0)*grad(x0,a,b) - (1/2)*(x - x0)**2 # lower approximation

In [None]:
plt.plot(x, loss(x,a,b), label='loss')
plt.plot(x, upper, label='upper')
plt.plot(x, lower, label='lower')
plt.legend()
plt.show()

$\unlhd$

**NUMERICAL CORNER:** We revisit our first simple single-variable example.

In [None]:
def f(x): 
    return (x-1)**2 + 10

In [None]:
xgrid = np.linspace(-5,5,100)
plt.plot(xgrid, f(xgrid))
plt.show()

Recall that the first derivative is:

In [None]:
def grad_f(x):
    return 2*(x-1)

So the second derivative is $f''(x) = 2$. Hence, this $f$ is $L$-smooth and $m$-strongly convex with $L = m = 2$. The theory we developed suggests taking step size $\alpha_t = \alpha = 1/L = 1/2$. It also implies that

$$
f(x^1) - f(x^*)
\leq \left(1 - \frac{m}{L}\right) [f(x^0) - f(x^*)]
= 0.
$$

We converge in one step! And that holds for any starting point $x^0$.

Let's try this!

In [None]:
gd(f, grad_f, 0, alpha=0.5, niters=1)

Let's try a different starting point.

In [None]:
gd(f, grad_f, 100, alpha=0.5, niters=1)

$\unlhd$

## Backpropagation and application to neural networks

**The `Advertising` dataset and the least-squares solution** We return to the `Advertising` dataset.

In [None]:
df = pd.read_csv('advertising.csv')
df.head()

In [None]:
n = len(df.index)
print(n)

We first compute the solution using the least-squares approach we detailed previously.

In [None]:
TV = df['TV'].to_numpy()
radio = df['radio'].to_numpy()
newspaper = df['newspaper'].to_numpy()
sales = df['sales'].to_numpy()
features = np.stack((TV, radio, newspaper), axis=-1)
A = np.c_[np.ones(n), features]

In [None]:
coeff = mmids.ls_by_qr(A, sales)
print(coeff)

In [None]:
np.mean((A @ coeff - sales)**2)

**Solving the problem using TensorFlow** We will be using [TensorFlow](https://www.tensorflow.org/overview) to implement the previous method. A quick tutorial can be found [here](https://www.tensorflow.org/tutorials/quickstart/beginner).

We use [`tensorflow.data.Dataset.from_tensor_slices()`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) to set up the data. It takes as input the columns of the data matrix. Here we take mini-batches of size `BATCH_SIZE = 64` (using [`batch()`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch)) and we apply a random permutation of the samples on every pass through the data with `SHUFFLE_BUFFER_SIZE = 100` (using [`shuffle`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle)). 

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((features, sales))

In [None]:
BATCH_SIZE = 64
SHUFFLE_BUFFER_SIZE = 100

train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)

Now we construct our model. It is simply an affine map from $\mathbb{R}^3$ to $\mathbb{R}$. Note that there is no need to pre-process the inputs by adding $1$s. A constant term (or "bias variable") is automatically added by Tensorflow (unless one chooses the option [`use_bias=False`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense)).

In [None]:
model = tf.keras.Sequential([
    layers.Dense(input_dim=3, units=1)
])

Finally, the function [`fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) runs an optimization method of our choice on the loss function, which are specified by [`compile()`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile). There are many [optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) available. See this [post](https://hackernoon.com/demystifying-different-variants-of-gradient-descent-optimization-algorithm-19ae9ba2e9bc) for a brief explanation of many common optimizers.) Here we use SGD as the optimizer. And the loss function is the MSE. 

Choosing the right number of passes (i.e. epochs) through the data requires some experimenting. Here $10^4$ suffices. But in the interest of time, we will run it only for $10$ epochs. As you will see from the results, this is far from enough. 

In [None]:
model.compile(
    optimizer=tf.optimizers.SGD(learning_rate=1e-5),
    loss='mean_squared_error'
    )

In [None]:
model.fit(train_dataset, batch_size=64, epochs=10, verbose=2)

The final parameters and loss are:

In [None]:
print(model.layers[0].get_weights())

In [None]:
model.evaluate(train_dataset, verbose=2)

An alternative way to compute the loss is:

In [None]:
sales_pred = model(features).numpy().reshape((n,))

In [None]:
mse = tf.keras.losses.MeanSquaredError()
mse(sales, sales_pred).numpy()

**MNIST dataset** We will use the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset introduced earlier in the chapter. This section is partly inspired by this [tutorial](https://www.tensorflow.org/tutorials/keras/classification).

Here is a sample of the images:

![MNIST sample images](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png)

**Figure:** MNIST sample images ([Source](https://commons.wikimedia.org/wiki/File:MnistExamples.png))

We first load the data.

In [None]:
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

The training dataset is a [tensor](https://en.wikipedia.org/wiki/Tensor) - think matrix with $3$ indices. One index runs through the $60,000$ training images, while the other two indices run through the horizontal and vertical pixel axes of each image. Here each image is $28 \times 28$.

In [None]:
train_images.shape

For example, the first training image follows. Note that the pixels take values between $0$ and $255$.

In [None]:
train_images[0]

The training labels are between $0$ and $9$.

In [None]:
train_labels

We will also use a test dataset provided in MNIST to assess the accuracy of our classifiers.

In [None]:
test_images.shape

In [None]:
len(test_labels)

As is recommended by TensorFlow, before proceeding we first pre-process the images to take values between $0$ and $1$.

In [None]:
train_images = train_images / 255.0
test_images = test_images / 255.0

**Implementation** We implement multinomial logistic regression to learn a classifier for the MNIST data.

In Keras, composition of functions can be achieved with [`Sequential()`](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential). Our model is:

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(10)
])

The [`Flatten`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten) layer turns each input image into a vector of size $784$ ((where $784 = 28^2$ is the number of pixels in each image). The output is $10$-dimensional.

Here we use the [`adam`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) optimizer (you can try SGD, but it is slow). The loss function is the cross-entropy, as implemented by [`tensorflow.keras.losses.SparseCategoricalCrossentropy()`](https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy). To monitor progress, we will keep track of the `accuracy` metric, which calculates how often predictions equal labels.

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

We train for $10$ epochs. An epoch is one training iteration where all samples are iterated once.

In [None]:
model.fit(train_images, train_labels, epochs=10, verbose=2)

The accuracy achieved (here `0.9306`) is measured on the training set, which is misleading because of [overfitting](https://en.wikipedia.org/wiki/Overfitting). We use the test images to assess the performance of the final classifier. 

In [None]:
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
print('\nTest accuracy:', test_acc)

To make a prediction, we add a [softmax](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Softmax) layer to our model. It transforms the output into a probability for each label. We compute it for each test image. The result for the first one is shown below.

In [None]:
probability_model = tf.keras.Sequential([model, 
                                         tf.keras.layers.Softmax()])

In [None]:
predictions = probability_model.predict(test_images, verbose=2)

The result for the first test image is shown below. To make a prediction, we choose the label with the highest probability.

In [None]:
predictions[0]

In [None]:
np.argmax(predictions[0])

The truth is:

In [None]:
test_labels[0]

The following code from this [excellent tutorial](https://www.tensorflow.org/tutorials/keras/classification) provides a neat vizualization of the results.

In [None]:
class_names = ['0', '1', '2', '3', '4',
               '5', '6', '7', '8', '9']

In [None]:
def plot_image(i, predictions_array, true_label, img):
    true_label, img = true_label[i], img[i]
    plt.grid(False)
    plt.xticks([])
    plt.yticks([])

    plt.imshow(img, cmap=plt.cm.binary)

    predicted_label = np.argmax(predictions_array)
    if predicted_label == true_label:
        color = 'blue'
    else:
        color = 'red'

    plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label],
                                100*np.max(predictions_array),
                                class_names[true_label]),
                                color=color)

def plot_value_array(i, predictions_array, true_label):
    true_label = true_label[i]
    plt.grid(False)
    plt.xticks(range(10))
    plt.yticks([])
    thisplot = plt.bar(range(10), predictions_array, color="#777777")
    plt.ylim([0, 1])
    predicted_label = np.argmax(predictions_array)

    thisplot[predicted_label].set_color('red')
    thisplot[true_label].set_color('blue')

In [None]:
i = 0
plt.figure(figsize=(6,3))
plt.subplot(1,2,1)
plot_image(i, predictions[i], test_labels, test_images)
plt.subplot(1,2,2)
plot_value_array(i, predictions[i],  test_labels)
plt.show()

In [None]:
num_rows = 5
num_cols = 3
num_images = num_rows*num_cols
plt.figure(figsize=(2*2*num_cols, 2*num_rows))
for i in range(num_images):
    plt.subplot(num_rows, 2*num_cols, 2*i+1)
    plot_image(i, predictions[i], test_labels, test_images)
    plt.subplot(num_rows, 2*num_cols, 2*i+2)
    plot_value_array(i, predictions[i], test_labels)
plt.tight_layout()
plt.show()

**Implementation** We implement a neural network in TensorFlow. We use the MNIST dataset again. We first load the data and preprocess it.

In [None]:
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

In [None]:
train_images = train_images / 255.0
test_images = test_images / 255.0

We construct a three-layer model.

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(32,activation='sigmoid'),
    tf.keras.layers.Dense(10)
])

As we did for multinomial logistic regression, we use the Adam optimizer and the cross-entropy loss. We also monitor progress by keeping track of the accuracy on the training data.

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

We train for $10$ epochs.

In [None]:
model.fit(train_images, train_labels, epochs=10, verbose=2)

On the test data, we get:

In [None]:
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
print('\nTest accuracy:', test_acc)

If you run it for $20$ epochs instead of $10$, you will see the accuracy improve *on the training set*, but the accuracy on the test set will not improve much. Try it! 

Still this is a significantly more accurate model than what we obtained using multinomial logistic regression. One can do even better using a neural network tailored for images, known as [convolutional neural networks](https://cs231n.github.io/convolutional-networks/).