Book at [deeplearningwithpython.io](https://deeplearningwithpython.io).

In [None]:
%pip install keras keras-hub matplotlib --upgrade -q

In [None]:
import os
os.environ["KERAS_BACKEND"] = "tensorflow"

In [None]:
# @title
import os
from IPython.core.magic import register_cell_magic

@register_cell_magic
def backend(line, cell):
    current, required = os.environ.get("KERAS_BACKEND", ""), line.split()[-1]
    if current == required:
        get_ipython().run_cell(cell)
    else:
        print(
            f"This cell requires the {required} backend. To run it, change KERAS_BACKEND to "
            f"\"{required}\" at the top of the notebook, restart the runtime, and rerun the notebook."
        )

## Chapter 2 - The mathematical building blocks of neural networks
### A first look at a neural network
We here load the mnist dataset, which contains images of handwritten digits (0 to 9). Each image is 28 x 28 pixels, and each pixel is represented by a grayscale value between 0 and 255. The goal of the neural network is to classify these images into their respective digit classes.

In [76]:
from keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

Let's look now at the dataset we imported. It consists of 60,000 training samples and 10,000 test samples. Each sample is a 28x28 grayscale image, represented as a 2D array of pixel values ranging from 0 to 255. With ``shape``, we can see the dimensions of the training and test sets:

In [77]:
print("Training set shape:", train_images.shape)
print("Test set shape:", test_images.shape)

Training set shape: (60000, 28, 28)
Test set shape: (10000, 28, 28)


The first axis (or dimension) always represents the samples, while all the other dimentions represent the data that will be provided to the neural network.
Let's now see the labels for both the training a test set. The label are in the form of an array of digits between 0 and 9, with the i-th label being the digit represented in the i-th image.

In [78]:
print("Number of training labels:", len(train_labels))
print("Training labels:", train_labels)

print("Number of test labels:", len(test_labels))
print("Test labels:", test_labels)

Number of training labels: 60000
Training labels: [5 0 4 ... 5 6 8]
Number of test labels: 10000
Test labels: [7 2 1 ... 4 5 6]


First neural network example! In this moment, not all the details have to be clear. The important point is to get a general idea of what a neural network is doing. We will go through all the details in the next sections.

In [79]:
import keras
from keras import layers

model = keras.Sequential(
    [
        layers.Dense(512, activation="relu"),
        layers.Dense(10, activation="softmax"),
    ]
)

The topology of the network has a first dense layer with 512 units and ReLU activation function, followed by a second dense layer with 10 units and softmax activation function. The first layer takes as input the 784-dimensional vector (28x28 pixels flattened) and outputs a 512-dimensional vector. The second layer takes this 512-dimensional vector and outputs a 10-dimensional vector, which represents the probabilities of each digit class (0-9).

In [80]:
model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

The optimizer used is "adam", which is a popular optimization algorithm for training neural networks. The loss function used is "sparse_categorical_crossentropy", which is suitable for multi-class classification problems where the labels are provided as integers. The metric used to evaluate the model's performance during training and testing is "accuracy", which measures the proportion of correctly classified samples.

In [81]:
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype("float32") / 255

With the ``fit`` method, we train the model for 5 epochs (iterations over the entire training dataset) with a batch size of 128 samples. The training process will output the loss and accuracy for each epoch.

In [82]:
model.fit(train_images, train_labels, epochs=5, batch_size=128)

Epoch 1/5
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - accuracy: 0.9277 - loss: 0.2580
Epoch 2/5
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - accuracy: 0.9694 - loss: 0.1052
Epoch 3/5
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - accuracy: 0.9796 - loss: 0.0687
Epoch 4/5
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.9864 - loss: 0.0476
Epoch 5/5
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9897 - loss: 0.0356


<keras.src.callbacks.history.History at 0x7e48bd6d7290>

We can use the trained model to make predictions on the test set using the ``predict`` method. This will output an array of predicted probabilities for each class (0-9) for each test sample. We can then use ``argmax`` function to get the class with the highest probability for each sample, which gives us the predicted digit labels.

In [83]:
test_digits = test_images[0:10]
some_labels = test_labels[0:10]
predictions = model.predict(test_digits)
print("Prediction vector:", predictions[0])
print("Prediction:", predictions[0].argmax())
print("Label:", some_labels[0])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step
Prediction vector: [3.1338152e-06 5.9011097e-07 3.3277516e-05 1.0968582e-03 1.7676082e-09
 8.5597021e-06 2.7874744e-10 9.9867558e-01 1.5543135e-05 1.6653792e-04]
Prediction: 7
Label: 7


Finally, to evaluate if our model learned well, we use the ``evaluate`` method on the test set. This will output the loss and accuracy on the test data, which gives us an idea of how well the model generalizes to unseen data.

In [84]:
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f"test_acc: {test_acc}")

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9790 - loss: 0.0649
test_acc: 0.9789999723434448


### Data representations for neural networks
The most common way to represent data for neural networks is using multi-dimensional arrays, also known as **tensors**. Tensors are generalizations of matrices to higher dimensions. For example, a 2D tensor is a matrix, a 3D tensor can be thought of as a stack of matrices (a cube of numbers) and so on. Tensors can have any number of dimensions, and each dimension is called an **axis**. The number of axes is called the **rank** of the tensor. For example, a scalar (single number) is a rank-0 tensor, a vector (1D array) is a rank-1 tensor, a matrix (2D array) is a rank-2 tensor, and so on.

To get the rank of a tensor in NumPy, we can use the ``ndim`` attribute of a NumPy array. While, using ``shape`` attribute, we can get the dimensions of the tensor along each axis.

#### Scalars (rank-0 tensors)

In [None]:
import numpy as np
x = np.array(12)
print("x:", x)
print("Rank of x:", x.ndim)
print("Shape of x:", x.shape)

As you can see from the output, the rank of the scalar tensor is 0, and its shape is an empty tuple, indicating that it has no dimensions.

#### Vectors (rank-1 tensors)

In [None]:
x = np.array([12, 3, 6, 14, 7])
print("x:", x)
print("Rank of x:", x.ndim)
print("Shape of x:", x.shape)

#### Matrices (rank-2 tensors)

In [None]:
x = np.array([[5, 78, 2, 34, 0],
              [6, 79, 3, 35, 1],
              [7, 80, 4, 36, 2]])
print("x:", x)
print("Rank of x:", x.ndim)
print("Shape of x:", x.shape)

#### Rank-3 tensors and higher-rank tensors

In [None]:
x = np.array([[[5, 78, 2, 34, 0],
               [6, 79, 3, 35, 1],
               [7, 80, 4, 36, 2]],
              [[5, 78, 2, 34, 0],
               [6, 79, 3, 35, 1],
               [7, 80, 4, 36, 2]],
              [[5, 78, 2, 34, 0],
               [6, 79, 3, 35, 1],
               [7, 80, 4, 36, 2]]])
print("x:", x)
print("Rank of x:", x.ndim)
print("Shape of x:", x.shape)

#### Key attributes
Three are the key attributes of tensors that are important to understand when working with neural networks:
- **Rank**: The number of axes (dimensions) of the tensor.
- **Shape**: A tuple of integers representing the size of the tensor along each axis.
- **Data type**: The type of data stored in the tensor (e.g., float32, int32, etc.).

In [None]:
from keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
print("Rank of dataset:", train_images.ndim)
print("Shape of dataset:", train_images.shape)
print("Datatype of dataset:", train_images.dtype)

#### Off-topic: show images from the dataset using matplotlib
``matplotlib`` is a popular library for data visualization in Python. We can use it to display images from the MNIST dataset.

In [None]:
import matplotlib.pyplot as plt

digit = train_images[4]
plt.imshow(digit, cmap=plt.cm.binary)
plt.show()
print("Label:", train_labels[4])

#### Slicing tensors in NumPy
Slicing consists in selecting specific portions of a tensor. In NumPy, we can slice tensors using the colon (`:`) operator along with indices. The syntax for slicing is `start:stop:step`, where `start` is the index to start the slice (inclusive), `stop` is the index to end the slice (exclusive), and `step` is the step size for the slice. Default values for `start`, `stop`, and `step` are 0, the size of the axis (number of elements along the axis), and 1, respectively. We can slice along multiple axes by separating the slice specifications with commas or using multiple square brackets.

In [None]:
my_slice = train_images[10:100]
print("Slice shape:", my_slice.shape)

In [None]:
my_slice = train_images[10:100, :, :]
print("Equivalent slice shape:", my_slice.shape)

In [None]:
my_slice = train_images[10:100, 0:28, 0:28]
print("Another equivalent slice shape:", my_slice.shape)

In [None]:
print("Image sliced along the pixels dimensions. It slices the bottom-right part of all images.")
my_slice = train_images[:, 14:, 14:]
print("Slice shape:", my_slice.shape)

In [None]:
print("Image sliced along the pixels dimensions. It slices the central part of all images.")
my_slice = train_images[:, 7:-7, 7:-7]
print("Slice shape:", my_slice.shape)

#### The notion of data batches
Usually, along the first axis of a tensor, we have the samples (or data points). When training neural networks, it is common to process the data in smaller groups called **batches**. Batching helps to reduce memory consumption and can lead to faster convergence during training.

When we slice a tensor along the first axis, we are effectively selecting a batch of samples. For example, if we have a tensor of shape (60000, 28, 28) representing 60,000 images of size 28x28 pixels, slicing it as `tensor[0:128]` will give us a batch of the first 128 images, resulting in a new tensor of shape (128, 28, 28).

In [None]:
# First batch
batch = train_images[:128]

# Second batch
batch = train_images[128:256]

# Third batch, with generic implementation
n = 3
batch = train_images[128 * n : 128 * (n + 1)]


#### Real-world examples of data tensors
Real-world data can come in various forms, and neural networks can handle different types of data tensors. Here are some examples:
##### Vector data
When we have real-world data represented as vectors, each sample is a 1D array of features. For example, in a dataset of house prices, each sample could represent a house with features such as size, number of bedrooms, location, etc. If we have 1000 houses and each house has 10 features, the data tensor would have a shape of (1000, 10).
##### Timeseries data or sequence data
Timeseries data consists of sequences of data points collected over time. Each sample is a sequence of values, and the length of the sequence can vary. For example, in a dataset of stock prices, we collect bid and ask prices every minute. If we want to track an entire market-day session (6.5 hours), we would have 390 minutes of data per day. If we have data for 1000 days, the data tensor would have a shape of (1000, 390, 2), where 2 represents the bid and ask prices.
##### Image data
Image data is usually a 3D tensor, where each sample is a 2D array of pixel values with multiple channels (e.g., RGB channels for color images). For example, in the MNIST dataset, each image is a grayscale image of size 28x28 pixels. If we have 60,000 images, the data tensor would have a shape of (60000, 28, 28, 1), where 1 represents the single channel for grayscale images. For color images, each image would have 3 channels (Red, Green, Blue), resulting in a shape of (60000, 28, 28, 3).
##### Video data
Video data can be represented as a 4D tensor, where each sample is a sequence of frames (images, see above) over time. For example, if we have a dataset of 1000 videos, each video consisting of 30 frames of size 64x64 pixels with 3 color channels (RGB), the data tensor would have a shape of (1000, 30, 64, 64, 3).

### Tensor operations
Of course, neural networks are just a set of parametrized (differentiable) functions applied to the input tensors. These functions are made by composing a set of basic tensor operations. In this section, we will explore some of the most common tensor operations used in neural networks.
#### Element-wise operations
The simplest tensor operations are element-wise operations, which are applied independently to each element of the tensor. They can be unary (involving a single tensor) or binary (involving two tensors). For eample, Rectified Linear Unit (ReLU) activation function is an element-wise unary operation that sets all negative values in the tensor to zero.

In [None]:
def naive_relu(x):
    x = x.copy()
    if len(x.shape) == 1:
        for i in range(x.shape[0]):
            x[i] = max(x[i], 0)
    else:
        for i in range(x.shape[0]):
            x[i] = naive_relu(x[i])
    return x

array = np.array([[1, 2, -3], [2, -1, 3], [0, -32, -1]])
print("Array:\n", array)
naive_relu(array)
print("\nArray w/ ReLU:\n", array)

Another simple example is the addition of two tensors of the same shape, which is an element-wise binary operation that adds corresponding elements from both tensors.

In [None]:
def naive_add(x, y):
    assert len(x.shape) == len(y.shape)
    assert x.shape == y.shape
    x = x.copy()
    if len(x.shape) == 1:
        for i in range(x.shape[0]):
            x[i] += y[i]
    else:
        for i in range(x.shape[0]):
            x[i] = naive_add(x[i], y[i])
    return x

x = np.array([[1, 2, 3], [4, 5, 6]])
print("x:\n", x)

y = np.array([[10, 20, 30], [40, 50, 60]])
print("\ny:\n", y)

result = naive_add(x, y)
print("\nx + y:\n", result)


Of course the real implementation of these operations is optimized and done in low-level languages like C or C++ for performance reasons. You can see the speed difference when working with large tensors.

In [None]:
import time

x = np.random.random((20, 100))
y = np.random.random((20, 100))

t0 = time.time()
for _ in range(1000):
    z = x + y
    z = np.maximum(z, 0.0)
print("NumPy took: {0:.2f} s".format(time.time() - t0))

In [None]:
t0 = time.time()
for _ in range(1000):
    z = naive_add(x, y)
    z = naive_relu(z)
print("Naive took: {0:.2f} s".format(time.time() - t0))

#### Broadcasting
Element-wise operations can also be performed on tensors of different shapes, thanks to a mechanism called **broadcasting**. Broadcasting is a new tensor operation that expands the smaller tensor along the dimensions of the larger tensor so that they have compatible shapes for the operation. This can be done only when there is no ambiguity in the expansion process. For example, when adding a tensor of shape (32, 10) with a tensor of shape (10,), the smaller tensor is broadcasted to match the shape of the larger tensor, resulting in a new tensor of shape (32, 10).

In [None]:
import numpy as np

X = np.random.random((32, 10))
y = np.random.random((10,))

y = np.expand_dims(y, axis=0)
Y = np.tile(y, (32, 1))
print("Shape of Y:", Y.shape)

In reality, when performing operations with broadcasting-compatible tensors, the smaller tensor is not actually copied in memory to match the shape of the larger tensor, it would be terribly inefficient. Instead, the operation is performed as if the smaller tensor was expanded, without actually creating a new tensor in memory. See the following example with ``np.maximum`` function.

In [None]:
import numpy as np

x = np.random.random((64, 3, 32, 10))
y = np.random.random((32, 10))
z = np.maximum(x, y)
print("Shape max:", z.shape)

#### Tensor dot products
Another important tensor operation is the **dot product** (or matrix multiplication) between two tensors. The dot product is a generalization of the dot product between vectors, as also simple 2D matrix multiplication can be seen as a dot product between rows and columns of the matrices. In NumPy, we can perform dot products using the ``np.dot`` function or the ``@`` operator.

Remember, the dot product in general is not commutative, meaning that the order of the operands matters. For example, ``A @ B`` is not the same as ``B @ A``.

The condition of compatibility for the dot product is that the inner dimensions of the two tensors must match.
For example, when multiplying a tensor of shape (32, 10) with a tensor of shape (10, 64), the inner dimensions (10) match, and the resulting tensor will have a shape of (32, 64).
This generalizes to higher-rank tensors as well. For example, when multiplying a tensor of shape (4, 3, 2) with a tensor of shape (2, 4, 5), the inner dimensions (2) match, and the resulting tensor will have a shape of (4, 3, 4, 5).

### Tensor reshaping
Sometimes, we need to change the shape of a tensor without changing its data. This is called **reshaping**. In NumPy, we can reshape tensors using the ``reshape`` method or the ``np.reshape`` function.

In [None]:
x = np.array([[0., 1.],
              [2., 3.],
              [4., 5.]])
print("x:\n", x)
print("Shape of x:", x.shape)

x = x.reshape((6, 1))
print("\nReshaped x:\n", x)
print("Shape of x:", x.shape)

x = x.reshape((2, 3))
print("\nReshaped x:\n", x)
print("Shape of x:", x.shape)


Transposing a tensor means swapping its axes. In NumPy, we can transpose tensors using the ``T`` attribute or the ``np.transpose`` function. Transposing is particularly useful when we need to change the order of dimensions for operations like dot products.

In [None]:
x = np.zeros((300, 10, 50, 20))
x = np.transpose(x)
print("Shape of x transpose:", x.shape)

#### Reimplementation of the first neural network example
Let's reimplement the first neural network example replacing some utilities from Keras with more low-level operations. This will help us understand better what is happening under the hood when training a neural network.

##### Dense layer implementation
We start with the Dense layer implementation. A Dense layer is a fully connected layer where each neuron is connected to every neuron in the previous layer. The output of a Dense layer is computed as the dot product between the input tensor and the weight matrix, followed by the addition of a bias vector.

In [None]:
import keras
from keras import ops

class NaiveDense:
    def __init__(self, input_size, output_size, activation=None):
        self.activation = activation
        self.W = keras.Variable(
            shape=(input_size, output_size), initializer="uniform"
        )
        self.b = keras.Variable(shape=(output_size,), initializer="zeros")

    def __call__(self, inputs):
        x = ops.matmul(inputs, self.W)
        x = x + self.b
        if self.activation is not None:
            x = self.activation(x)
        return x

    @property
    def weights(self):
        return [self.W, self.b]

##### Sequential model implementation
Now, we can implement a simple Sequential model that stacks multiple Dense layers together. The Sequential model will store the layers and provide a method to perform the forward pass through the network.

In [None]:
class NaiveSequential:
    def __init__(self, layers):
        self.layers = layers

    def __call__(self, inputs):
        x = inputs
        for layer in self.layers:
            x = layer(x)
        return x

    @property
    def weights(self):
        weights = []
        for layer in self.layers:
            weights += layer.weights
        return weights

##### Building the model
Now, we can build the model by creating an instance of the Sequential class and adding Dense layers to it.

In [None]:
model = NaiveSequential(
    [
        NaiveDense(input_size=28 * 28, output_size=512, activation=ops.relu),
        NaiveDense(input_size=512, output_size=10, activation=ops.softmax),
    ]
)
assert len(model.weights) == 4

##### Drawing batches from the dataset
To train the model, we need to draw batches of data from the training dataset. We can implement a class that handles the batching process.

In [None]:
import math

class BatchGenerator:
    def __init__(self, images, labels, batch_size=128):
        assert len(images) == len(labels)
        self.index = 0
        self.images = images
        self.labels = labels
        self.batch_size = batch_size
        self.num_batches = math.ceil(len(images) / batch_size)

    def next(self):
        # No check needed for overflow, NumPy does it for us
        images = self.images[self.index : self.index + self.batch_size]
        labels = self.labels[self.index : self.index + self.batch_size]
        self.index += self.batch_size
        return images, labels

##### Weight update
Finally, we can implement the weight update process. A naive implementation would simply go in the direction of the negative gradient scaled by the learning rate. This is called Stochastic Gradient Descent (SGD). We can implement it or we can use the optimizer provided by Keras.

In [None]:
from keras import optimizers
optimizer = optimizers.SGD(learning_rate=1e-3)

def update_weights(gradients, weights):
    optimizer.apply_gradients(zip(gradients, weights))
    
learning_rate = 1e-3
def update_weights_manual(gradients, weights):
    for g, w in zip(gradients, weights):
        w.assign(w - g * learning_rate)


##### Gradient computation and training loop
The last step is to compute the gradients and implement the training loop. We can use TensorFlow's automatic differentiation capabilities to compute the gradients of the loss with respect to the model's weights. Then, we can update the weights using the optimizer.

In TensorFlow, we can use the `tf.GradientTape` context manager to record the operations for automatic differentiation. Inside the `GradientTape` context, we perform the forward pass and compute the loss. After exiting the context, we can use the `tape.gradient` method to compute the gradients of the loss with respect to the model's weights.

In [None]:
%%backend tensorflow
import tensorflow as tf

x = tf.zeros(shape=())
with tf.GradientTape() as tape:
    tape.watch(x) # GradientTape watches only Variable objects. This watch forces GT to track operations involving x
    y = 2 * x + 3
grad_of_y_wrt_x = tape.gradient(y, x)

print("Gradient:", grad_of_y_wrt_x)

Following, the training step:

In [None]:
%%backend tensorflow
def one_training_step(model, images_batch, labels_batch):
    with tf.GradientTape() as tape:
        predictions = model(images_batch)
        loss = ops.sparse_categorical_crossentropy(labels_batch, predictions)
        average_loss = ops.mean(loss)
    gradients = tape.gradient(average_loss, model.weights)
    update_weights_manual(gradients, model.weights)
    return average_loss

##### Training loop implementation
Finally, we can implement a simple training loop to train the model on the MNIST dataset. The training loop will perform the following steps for each epoch:
1. Perform a forward pass through the model to compute the predictions.
2. Compute the loss between the predictions and the true labels.
3. Compute the gradients of the loss with respect to the model's weights.
4. Update the model's weights using the optimizer.

In [None]:
%%backend tensorflow
def fit(model, images, labels, epochs, batch_size=128):
    for epoch_counter in range(epochs):
        print(f"Epoch {epoch_counter}")
        batch_generator = BatchGenerator(images, labels)
        for batch_counter in range(batch_generator.num_batches):
            images_batch, labels_batch = batch_generator.next()
            loss = one_training_step(model, images_batch, labels_batch)
            if batch_counter % 100 == 0:
                print(f"loss at batch {batch_counter}: {loss:.2f}")

In [None]:
%%backend tensorflow
from keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype("float32") / 255

fit(model, train_images, train_labels, epochs=10, batch_size=128)

Finally, let's evaluate the trained model on the test dataset to see how well it performs on unseen data.

In [None]:
%%backend tensorflow
predictions = model(test_images)
predicted_labels = ops.argmax(predictions, axis=1)
matches = predicted_labels == test_labels
f"accuracy: {ops.mean(matches):.2f}"