**AUTHOR: RAIHAN SALMAN BAEHAQI (1103220180)**

**PART II** 

**Neural Networks and Deep Learning** 

---

**CHAPTER 10 - Introduction to Artificial Neural Networks with Keras** 

---

Artificial neural networks (ANNs) are Machine Learning models inspired by biological neurons found in our brains. ANNs are at the very core of Deep Learning—versatile, powerful, and scalable, making them ideal to tackle large and highly complex ML tasks such as classifying billions of images (Google Images), powering speech recognition (Apple's Siri), recommending videos to hundreds of millions of users daily (YouTube), or learning to beat the world champion at Go (DeepMind's AlphaGo). 

The first part of this chapter introduces artificial neural networks, starting with a quick tour of the very first ANN architectures and leading up to Multilayer Perceptrons (MLPs), which are heavily used today. The second part covers how to implement neural networks using the popular Keras API—a beautifully designed and simple high-level API for building, training, evaluating, and running neural networks. 

---

## **From Biological to Artificial Neurons** 

ANNs were first introduced in 1943 by neurophysiologist Warren McCulloch and mathematician Walter Pitts. In their landmark paper "A Logical Calculus of Ideas Immanent in Nervous Activity," they presented a simplified computational model of how biological neurons might work together in animal brains to perform complex computations using propositional logic. This was the first artificial neural network architecture. 

### **Biological Neurons** 

A biological neuron is an unusual-looking cell mostly found in animal brains. It's composed of a cell body containing the nucleus, many branching extensions called dendrites, plus one very long extension called the axon. The axon splits off into many branches called telodendria, and at the tip of these branches are minuscule structures called synaptic terminals (or simply synapses), which are connected to the dendrites or cell bodies of other neurons. 

**Figure 10-1. Biological neuron**   
![Figure10-1.jpg](./10.Chapter-10/Figure10-1.jpg) 

Biological neurons produce short electrical impulses called action potentials (or signals) which travel along axons and make synapses release chemical signals called neurotransmitters. When a neuron receives a sufficient amount of these neurotransmitters within a few milliseconds, it fires its own electrical impulses. 

Individual biological neurons behave in a rather simple way, but they are organized in a vast network of billions, with each neuron typically connected to thousands of other neurons. Highly complex computations can be performed by a network of fairly simple neurons. The architecture of biological neural networks (BNNs) is still the subject of active research, but some parts of the brain have been mapped, and it seems neurons are often organized in consecutive layers, especially in the cerebral cortex. 

**Figure 10-2. Multiple layers in a biological neural network (human cortex)**   
![Figure10-2.jpg](./10.Chapter-10/Figure10-2.jpg) 

### **Logical Computations with Neurons** 

McCulloch and Pitts proposed a very simple model of the biological neuron: it has one or more binary (on/off) inputs and one binary output. The artificial neuron activates its output when more than a certain number of its inputs are active. They showed that even with such a simplified model it is possible to build a network of artificial neurons that computes any logical proposition you want. 

**Figure 10-3. ANNs performing simple logical computations**   
![Figure10-3.jpg](./10.Chapter-10/Figure10-3.jpg) 

- The first network (identity function): if neuron A is activated, then neuron C gets activated as well. 
- The second network performs logical AND: neuron C is activated only when both neurons A and B are activated. 
- The third network performs logical OR: neuron C gets activated if either neuron A or B is activated (or both). 
- The fourth network: neuron C is activated only if neuron A is active and neuron B is off (using inhibitory connections). 

---

## **The Perceptron** 

The Perceptron is one of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt. It is based on a slightly different artificial neuron called a threshold logic unit (TLU), or sometimes a linear threshold unit (LTU). The inputs and output are numbers (instead of binary on/off values), and each input connection is associated with a weight. The TLU computes a weighted sum of its inputs (z = w₁x₁ + w₂x₂ + ⋯ + wₙxₙ = x⊺w), then applies a step function to that sum and outputs the result: h_w(x) = step(z). 

**Figure 10-4. Threshold logic unit: an artificial neuron which computes a weighted sum of its inputs then applies a step function**   
![Figure10-4.jpg](./10.Chapter-10/Figure10-4.jpg) 

The most common step function used in Perceptrons is the Heaviside step function. Sometimes the sign function is used instead. 

**Equation 10-1. Common step functions used in Perceptrons (assuming threshold = 0)**   
![Eq10-1.jpg](./10.Chapter-10/Eq10-1.jpg) 

A single TLU can be used for simple linear binary classification. It computes a linear combination of the inputs, and if the result exceeds a threshold, it outputs the positive class. Otherwise it outputs the negative class (just like Logistic Regression or linear SVM classifier). 

A Perceptron is simply composed of a single layer of TLUs, with each TLU connected to all the inputs. When all neurons in a layer are connected to every neuron in the previous layer, the layer is called a fully connected layer, or a dense layer. 

**Figure 10-5. Architecture of a Perceptron with two input neurons, one bias neuron, and three output neurons**   
![Figure10-5.jpg](./10.Chapter-10/Figure10-5.jpg) 

Thanks to linear algebra, it's possible to efficiently compute the outputs of a layer of artificial neurons for several instances at once. 

**Equation 10-2. Computing the outputs of a fully connected layer**   
![Eq10-2.jpg](./10.Chapter-10/Eq10-2.jpg) 

Where:
- X represents the matrix of input features (one row per instance, one column per feature) 
- W contains all connection weights (one row per input neuron, one column per artificial neuron) 
- b is the bias vector (one bias term per artificial neuron) 
- φ is the activation function (step function for TLUs) 

The Perceptron training algorithm was proposed by Rosenblatt and was largely inspired by Hebb's rule. The Perceptron learning rule reinforces connections that help reduce the error. 

**Equation 10-3. Perceptron learning rule (weight update)**   
![Eq10-3.jpg](./10.Chapter-10/Eq10-3.jpg)

Scikit-Learn provides a Perceptron class that implements a single-TLU network. It can be used on the iris dataset:

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2, 3)]  # petal length, petal width
y = (iris.target == 0).astype(np.int)  # Iris setosa?

per_clf = Perceptron()
per_clf.fit(X, y)

y_pred = per_clf.predict([[2, 0.5]])

The Perceptron learning algorithm strongly resembles Stochastic Gradient Descent. Scikit-Learn's Perceptron class is equivalent to using an SGDClassifier with loss="perceptron", learning_rate="constant", eta0=1, and penalty=None. 

Contrary to Logistic Regression classifiers, Perceptrons do not output a class probability; rather, they make predictions based on a hard threshold. This is one reason to prefer Logistic Regression over Perceptrons. 

In their 1969 monograph, Marvin Minsky and Seymour Papert highlighted serious weaknesses of Perceptrons—in particular, the fact that they are incapable of solving some trivial problems (e.g., the Exclusive OR (XOR) classification problem). This is true of any linear classification model, but researchers had expected much more from Perceptrons. 

**Figure 10-6. XOR classification problem and an MLP that solves it**   
![Figure10-6.jpg](./10.Chapter-10/Figure10-6.jpg) 

It turns out that some limitations of Perceptrons can be eliminated by stacking multiple Perceptrons. The resulting ANN is called a Multilayer Perceptron (MLP). An MLP can solve the XOR problem. 

---

## **The Multilayer Perceptron and Backpropagation** 

An MLP is composed of one (passthrough) input layer, one or more layers of TLUs called hidden layers, and one final layer of TLUs called the output layer. The layers close to the input layer are usually called the lower layers, and the ones close to the outputs are usually called the upper layers. Every layer except the output layer includes a bias neuron and is fully connected to the next layer. 

**Figure 10-7. Architecture of a Multilayer Perceptron with two inputs, one hidden layer of four neurons, and three output neurons**   
![Figure10-7.jpg](./10.Chapter-10/Figure10-7.jpg) 

The signal flows only in one direction (from the inputs to the outputs), so this architecture is an example of a feedforward neural network (FNN). When an ANN contains a deep stack of hidden layers, it is called a deep neural network (DNN). The field of Deep Learning studies DNNs, and more generally models containing deep stacks of computations. 

For many years researchers struggled to find a way to train MLPs. But in 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published a groundbreaking paper that introduced the **backpropagation training algorithm**, which is still used today. In short, it is Gradient Descent using an efficient technique for computing the gradients automatically: in just two passes through the network (one forward, one backward), the backpropagation algorithm is able to compute the gradient of the network's error with regard to every single model parameter. In other words, it can find out how each connection weight and each bias term should be tweaked in order to reduce the error. Once it has these gradients, it just performs a regular Gradient Descent step, and the whole process is repeated until the network converges to the solution. 

Automatically computing gradients is called **automatic differentiation**, or autodiff. The one used by backpropagation is called reverse-mode autodiff. It is fast and precise, and is well suited when the function to differentiate has many variables and few outputs. 

**How Backpropagation Works:** 

- It handles one mini-batch at a time (e.g., containing 32 instances each), and it goes through the full training set multiple times. Each pass is called an **epoch**. 

- Each mini-batch is passed to the network's input layer, which sends it to the first hidden layer. The algorithm then computes the output of all the neurons in this layer (for every instance in the mini-batch). The result is passed on to the next layer, its output is computed and passed to the next layer, and so on until we get the output of the last layer, the output layer. This is the **forward pass**: it is exactly like making predictions, except all intermediate results are preserved since they are needed for the backward pass. 

- Next, the algorithm measures the network's output error (i.e., it uses a loss function that compares the desired output and the actual output of the network, and returns some measure of the error). 

- Then it computes how much each output connection contributed to the error. This is done analytically by applying the **chain rule** (perhaps the most fundamental rule in calculus), which makes this step fast and precise. 

- The algorithm then measures how much of these error contributions came from each connection in the layer below, again using the chain rule, working backward until the algorithm reaches the input layer. This reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network (hence the name of the algorithm). 

- Finally, the algorithm performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed. 

**Important:** It is important to initialize all the hidden layers' connection weights randomly, or else training will fail. For example, if you initialize all weights and biases to zero, then all neurons in a given layer will be perfectly identical, and thus backpropagation will affect them in exactly the same way, so they will remain identical. If instead you randomly initialize the weights, you break the symmetry and allow backpropagation to train a diverse team of neurons. 

For backpropagation to work properly, its authors made a key change to the MLP's architecture: they replaced the step function with the **logistic (sigmoid) function**, σ(z) = 1 / (1 + exp(–z)). This was essential because the step function contains only flat segments, so there is no gradient to work with, while the logistic function has a well-defined nonzero derivative everywhere, allowing Gradient Descent to make progress at every step. 

The backpropagation algorithm works well with many other activation functions. Here are two other popular choices: 

**The hyperbolic tangent function:** tanh(z) = 2σ(2z) – 1 
Just like the logistic function, this activation function is S-shaped, continuous, and differentiable, but its output value ranges from –1 to 1 (instead of 0 to 1). That range tends to make each layer's output more or less centered around 0 at the beginning of training, which often helps speed up convergence. 

**The Rectified Linear Unit function:** ReLU(z) = max(0, z) 
The ReLU function is continuous but unfortunately not differentiable at z = 0, and its derivative is 0 for z < 0. In practice, however, it works very well and has the advantage of being fast to compute, so it has become the default. Most importantly, the fact that it does not have a maximum output value helps reduce some issues during Gradient Descent. 

**Figure 10-8. Activation functions and their derivatives**   
![Figure10-8.jpg](./10.Chapter-10/Figure10-8.jpg) 

**Why do we need activation functions?** If you chain several linear transformations, all you get is a linear transformation. For example, if f(x) = 2x + 3 and g(x) = 5x – 1, then chaining these two linear functions gives you another linear function: f(g(x)) = 2(5x – 1) + 3 = 10x + 1. So if you don't have some nonlinearity between layers, then even a deep stack of layers is equivalent to a single layer, and you can't solve very complex problems with that. Conversely, a large enough DNN with nonlinear activations can theoretically approximate any continuous function. 

---

## **Regression MLPs** 

MLPs can be used for regression tasks. If you want to predict a single value (e.g., the price of a house), then you just need a single output neuron: its output is the predicted value. For multivariate regression (to predict multiple values at once), you need one output neuron per output dimension. 

In general, when building an MLP for regression, you do not want to use any activation function for the output neurons, so they are free to output any range of values. If you want to guarantee that the output will always be positive, then you can use the ReLU activation function in the output layer. Alternatively, you can use the softplus activation function, which is a smooth variant of ReLU: softplus(z) = log(1 + exp(z)). Finally, if you want to guarantee that the predictions will fall within a given range of values, then you can use the logistic function or the hyperbolic tangent, and then scale the labels to the appropriate range: 0 to 1 for the logistic function and –1 to 1 for the hyperbolic tangent. 

The loss function to use during training is typically the mean squared error, but if you have a lot of outliers in the training set, you may prefer to use the mean absolute error instead. Alternatively, you can use the **Huber loss**, which is a combination of both. The Huber loss is quadratic when the error is smaller than a threshold δ (typically 1) but linear when the error is larger than δ. The linear part makes it less sensitive to outliers than the mean squared error, and the quadratic part allows it to converge faster and be more precise than the mean absolute error. 

**Table 10-1. Typical regression MLP architecture**   
![Table10-1.jpg](./10.Chapter-10/Table10-1.jpg) 

---

## **Classification MLPs** 

MLPs can also be used for classification tasks. For a **binary classification** problem, you just need a single output neuron using the logistic activation function: the output will be a number between 0 and 1, which you can interpret as the estimated probability of the positive class. The estimated probability of the negative class is equal to one minus that number. 

MLPs can also easily handle **multilabel binary classification** tasks. For example, you could have an email classification system that predicts whether each incoming email is ham or spam, and simultaneously predicts whether it is an urgent or nonurgent email. In this case, you would need two output neurons, both using the logistic activation function: the first would output the probability that the email is spam, and the second would output the probability that it is urgent. Note that the output probabilities do not necessarily add up to 1. This lets the model output any combination of labels. 

If each instance can belong only to a single class, out of three or more possible classes (e.g., classes 0 through 9 for digit image classification), then you need to have one output neuron per class, and you should use the **softmax activation function** for the whole output layer. The softmax function will ensure that all the estimated probabilities are between 0 and 1 and that they add up to 1 (which is required if the classes are exclusive). This is called **multiclass classification**. 

**Figure 10-9. A modern MLP (including ReLU and softmax) for classification**   
![Figure10-9.jpg](./10.Chapter-10/Figure10-9.jpg) 

Regarding the loss function, since we are predicting probability distributions, the **cross-entropy loss** (also called the log loss) is generally a good choice. 

**Table 10-2. Typical classification MLP architecture**   
![Table10-2.jpg](./10.Chapter-10/Table10-2.jpg) 

---

## **Implementing MLPs with Keras** 

Keras is a high-level Deep Learning API that allows you to easily build, train, evaluate, and execute all sorts of neural networks. Its documentation is available at https://keras.io/. The reference implementation, also called Keras, was developed by François Chollet and released as an open source project in March 2015. To perform the heavy computations required by neural networks, this reference implementation relies on a computation backend (TensorFlow, Microsoft Cognitive Toolkit (CNTK), or Theano). 

TensorFlow now comes bundled with its own Keras implementation, **tf.keras**. It only supports TensorFlow as the backend, but it has the advantage of offering some very useful extra features. For this reason, we will use tf.keras in this book. 

### **Installing TensorFlow 2** 

To install TensorFlow 2, use pip:

In [None]:
# Install TensorFlow 2
# $ python3 -m pip install -U tensorflow

import tensorflow as tf
from tensorflow import keras

print(tf.__version__)
print(keras.__version__)

---

### **Building an Image Classifier Using the Sequential API** 

First, we need to load a dataset. We will tackle Fashion MNIST, which is a drop-in replacement of MNIST. It has the exact same format as MNIST (70,000 grayscale images of 28 × 28 pixels each, with 10 classes), but the images represent fashion items rather than handwritten digits, so each class is more diverse, and the problem turns out to be significantly more challenging than MNIST. 

#### **Using Keras to load the dataset**

In [None]:
fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

When loading Fashion MNIST using Keras, every image is represented as a 28 × 28 array rather than a 1D array of size 784. Moreover, the pixel intensities are represented as integers (from 0 to 255) rather than floats. Let's take a look at the shape and data type:

In [None]:
>>> X_train_full.shape
(60000, 28, 28)
>>> X_train_full.dtype
dtype('uint8')

The dataset is already split into a training set and a test set, but there is no validation set, so we'll create one now. Additionally, since we are going to train the neural network using Gradient Descent, we must scale the input features. For simplicity, we'll scale the pixel intensities down to the 0–1 range by dividing them by 255.0:

In [None]:
X_valid, X_train = X_train_full[:5000] / 255.0, X_train_full[5000:] / 255.0
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.0

For Fashion MNIST, we need the list of class names to know what we are dealing with:

In [None]:
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
               "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

>>> class_names[y_train[0]]
'Coat'

**Figure 10-11. Samples from Fashion MNIST**   
![Figure10-11.jpg](./10.Chapter-10/Figure10-11.jpg) 

#### **Creating the model using the Sequential API** 

Now let's build the neural network! Here is a classification MLP with two hidden layers:

In [None]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="relu"))
model.add(keras.layers.Dense(100, activation="relu"))
model.add(keras.layers.Dense(10, activation="softmax"))

Let's go through this code line by line: 

- The first line creates a Sequential model. This is the simplest kind of Keras model for neural networks that are just composed of a single stack of layers connected sequentially. This is called the **Sequential API**. 

- Next, we build the first layer and add it to the model. It is a **Flatten layer** whose role is to convert each input image into a 1D array: if it receives input data X, it computes X.reshape(-1, 1). This layer does not have any parameters; it is just there to do some simple preprocessing. Since it is the first layer in the model, you should specify the input_shape, which doesn't include the batch size, only the shape of the instances. 

- Next we add a **Dense hidden layer** with 300 neurons. It will use the ReLU activation function. Each Dense layer manages its own weight matrix, containing all the connection weights between the neurons and their inputs. It also manages a vector of bias terms (one per neuron). 

- Then we add a second Dense hidden layer with 100 neurons, also using the ReLU activation function. 

- Finally, we add a Dense **output layer** with 10 neurons (one per class), using the softmax activation function (because the classes are exclusive). 

Instead of adding the layers one by one, you can pass a list of layers when creating the Sequential model:

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="relu"),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])

The model's **summary()** method displays all the model's layers, including each layer's name, its output shape, and its number of parameters:

In [None]:
>>> model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
flatten (Flatten)            (None, 784)               0
_________________________________________________________________
dense (Dense)                (None, 300)               235500
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1010
=================================================================
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________

Note that Dense layers often have a lot of parameters. For example, the first hidden layer has 784 × 300 connection weights, plus 300 bias terms, which adds up to 235,500 parameters! This gives the model quite a lot of flexibility to fit the training data, but it also means that the model runs the risk of overfitting. 

You can easily get a model's list of layers, to fetch a layer by its index, or you can fetch it by name:

In [None]:
>>> hidden1 = model.layers[1]
>>> hidden1.name
'dense'
>>> model.get_layer('dense') is hidden1
True

All the parameters of a layer can be accessed using its get_weights() and set_weights() methods. For a Dense layer, this includes both the connection weights and the bias terms:

In [None]:
>>> weights, biases = hidden1.get_weights()
>>> weights.shape
(784, 300)
>>> biases.shape
(300,)

Notice that the Dense layer initialized the connection weights randomly (which is needed to break symmetry), and the biases were initialized to zeros, which is fine. 

#### **Compiling the model** 

After a model is created, you must call its **compile()** method to specify the loss function and the optimizer to use. Optionally, you can specify a list of extra metrics to compute during training and evaluation:

In [None]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="sgd",
              metrics=["accuracy"])

We use the **"sparse_categorical_crossentropy"** loss because we have sparse labels (i.e., for each instance, there is just a target class index, from 0 to 9 in this case), and the classes are exclusive. If instead we had one target probability per class for each instance (such as one-hot vectors), then we would need to use the "categorical_crossentropy" loss instead. If we were doing binary classification, then we would use the "sigmoid" activation function in the output layer instead of "softmax", and we would use the "binary_crossentropy" loss. 

Regarding the optimizer, **"sgd"** means that we will train the model using simple Stochastic Gradient Descent. In other words, Keras will perform the backpropagation algorithm described earlier (i.e., reverse-mode autodiff plus Gradient Descent). 

Finally, since this is a classifier, it's useful to measure its **"accuracy"** during training and evaluation. 

#### **Training and evaluating the model** 

Now the model is ready to be trained. For this we simply need to call its **fit()** method:

In [None]:
>>> history = model.fit(X_train, y_train, epochs=30,
...                     validation_data=(X_valid, y_valid))
...
Train on 55000 samples, validate on 5000 samples
Epoch 1/30
55000/55000 [======] - 3s - loss: 0.7218 - accuracy: 0.7660 - val_loss: 0.4973 - val_accuracy: 0.8366
Epoch 2/30
55000/55000 [======] - 2s - loss: 0.4840 - accuracy: 0.8327 - val_loss: 0.4456 - val_accuracy: 0.8480
[...]
Epoch 30/30
55000/55000 [======] - 3s - loss: 0.2252 - accuracy: 0.9192 - val_loss: 0.2999 - val_accuracy: 0.8926

We pass it the input features (X_train) and the target classes (y_train), as well as the number of epochs to train. We also pass a validation set (this is optional). Keras will measure the loss and the extra metrics on this set at the end of each epoch, which is very useful to see how well the model really performs. 

The fit() method returns a **History object** containing the training parameters (history.params), the list of epochs (history.epoch), and most importantly a dictionary (history.history) containing the loss and extra metrics measured at the end of each epoch on the training set and on the validation set. If you use this dictionary to create a pandas DataFrame and call its plot() method, you get the learning curves: 

**Figure 10-12. Learning curves**   
![Figure10-12.jpg](./10.Chapter-10/Figure10-12.jpg)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.gca().set_ylim(0, 1)
plt.show()

You can see that both the training accuracy and the validation accuracy steadily increase during training, while the training loss and the validation loss decrease. Good! The validation curves are close to the training curves, which means there is not too much overfitting. 

Once you are satisfied with your model's validation accuracy, you should evaluate it on the test set to estimate the generalization error before you deploy the model to production. You can easily do this using the **evaluate()** method:

In [None]:
>>> model.evaluate(X_test, y_test)
10000/10000 [==========] - 0s
[0.3340, 0.8851]

#### **Using the model to make predictions** 

Next, we can use the model's **predict()** method to make predictions on new instances:

In [None]:
>>> X_new = X_test[:3]
>>> y_proba = model.predict(X_new)
>>> y_proba.round(2)
array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.03, 0.  , 0.01, 0.  , 0.96],
       [0.  , 0.  , 0.98, 0.  , 0.02, 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 1.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ]],
      dtype=float32)

For each instance the model estimates one probability per class. If you only care about the class with the highest estimated probability, use the **predict_classes()** method instead:

In [None]:
>>> y_pred = model.predict_classes(X_new)
>>> y_pred
array([9, 2, 1])
>>> np.array(class_names)[y_pred]
array(['Ankle boot', 'Pullover', 'Trouser'], dtype='<U11')

**Figure 10-13. Correctly classified Fashion MNIST images**   
![Figure10-13.jpg](./10.Chapter-10/Figure10-13.jpg) 

Now you know how to use the Sequential API to build, train, evaluate, and use a classification MLP. 

---

### **Building a Regression MLP Using the Sequential API** 

Let's switch to the California housing problem and tackle it using a regression neural network. We will use Scikit-Learn's fetch_california_housing() function to load the data. After loading the data, we split it into a training set, a validation set, and a test set, and we scale all the features:

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()

X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

Using the Sequential API to build, train, evaluate, and use a regression MLP is quite similar to what we did for classification. The main differences are the fact that the output layer has a single neuron (since we only want to predict a single value) and uses no activation function, and the loss function is the mean squared error. Since the dataset is quite noisy, we just use a single hidden layer with fewer neurons than before, to avoid overfitting:

In [None]:
model = keras.models.Sequential([
    keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
    keras.layers.Dense(1)
])
model.compile(loss="mean_squared_error", optimizer="sgd")
history = model.fit(X_train, y_train, epochs=20,
                    validation_data=(X_valid, y_valid))
mse_test = model.evaluate(X_test, y_test)
X_new = X_test[:3]
y_pred = model.predict(X_new)

---

### **Building Complex Models Using the Functional API** 

Although Sequential models are extremely common, it is sometimes useful to build neural networks with more complex topologies, or with multiple inputs or outputs. For this purpose, Keras offers the **Functional API**. 

One example of a nonsequential neural network is a **Wide & Deep neural network**. This neural network architecture was introduced in a 2016 paper by Heng-Tze Cheng et al. It connects all or part of the inputs directly to the output layer. This architecture makes it possible for the neural network to learn both deep patterns (using the deep path) and simple rules (through the short path). 

**Figure 10-14. Wide & Deep neural network**   
![Figure10-14.jpg](./10.Chapter-10/Figure10-14.jpg) 

Let's build such a neural network to tackle the California housing problem:

In [None]:
input_ = keras.layers.Input(shape=X_train.shape[1:])
hidden1 = keras.layers.Dense(30, activation="relu")(input_)
hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
concat = keras.layers.Concatenate()([input_, hidden2])
output = keras.layers.Dense(1)(concat)
model = keras.Model(inputs=[input_], outputs=[output])

Let's go through each line: 

- First, we need to create an **Input** object. This is a specification of the kind of input the model will get, including its shape. 

- Next, we create a Dense layer with 30 neurons, using the ReLU activation function. As soon as it is created, notice that we call it like a function, passing it the input. This is why this is called the **Functional API**. 

- We then create a second hidden layer, and again we use it as a function. 

- Next, we create a **Concatenate** layer, and once again we immediately use it like a function, to concatenate the input and the output of the second hidden layer. 

- Then we create the output layer, with a single neuron and no activation function, and we call it like a function. 

- Lastly, we create a Keras **Model**, specifying which inputs and outputs to use. 

Once you have built the Keras model, everything is exactly like earlier: you must compile the model, train it, evaluate it, and use it to make predictions. 

**Figure 10-15. Handling multiple inputs**   
![Figure10-15.jpg](./10.Chapter-10/Figure10-15.jpg) 

What if you want to send a subset of the features through the wide path and a different subset through the deep path? In this case, one solution is to use **multiple inputs**:

In [None]:
input_A = keras.layers.Input(shape=[5], name="wide_input")
input_B = keras.layers.Input(shape=[6], name="deep_input")
hidden1 = keras.layers.Dense(30, activation="relu")(input_B)
hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
concat = keras.layers.concatenate([input_A, hidden2])
output = keras.layers.Dense(1, name="output")(concat)
model = keras.Model(inputs=[input_A, input_B], outputs=[output])

Now we can compile the model as usual, but when we call the fit() method, instead of passing a single input matrix X_train, we must pass a pair of matrices (X_train_A, X_train_B):

In [None]:
model.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=1e-3))

X_train_A, X_train_B = X_train[:, :5], X_train[:, 2:]
X_valid_A, X_valid_B = X_valid[:, :5], X_valid[:, 2:]
X_test_A, X_test_B = X_test[:, :5], X_test[:, 2:]
X_new_A, X_new_B = X_test_A[:3], X_test_B[:3]

history = model.fit((X_train_A, X_train_B), y_train, epochs=20,
                    validation_data=((X_valid_A, X_valid_B), y_valid))
mse_test = model.evaluate((X_test_A, X_test_B), y_test)
y_pred = model.predict((X_new_A, X_new_B))

There are many use cases in which you may want to have **multiple outputs**. Adding extra outputs is quite easy: just connect them to the appropriate layers and add them to your model's list of outputs. 

**Figure 10-16. Handling multiple outputs (auxiliary output for regularization)**   
![Figure10-16.jpg](./10.Chapter-10/Figure10-16.jpg)

In [None]:
# Same as above, up to the main output layer
output = keras.layers.Dense(1, name="main_output")(concat)
aux_output = keras.layers.Dense(1, name="aux_output")(hidden2)
model = keras.Model(inputs=[input_A, input_B], outputs=[output, aux_output])

Each output will need its own loss function. Therefore, when we compile the model, we should pass a list of losses. By default, Keras will compute all these losses and simply add them up to get the final loss used for training. We care much more about the main output than about the auxiliary output (as it is just used for regularization), so we want to give the main output's loss a much greater weight:

In [None]:
model.compile(loss=["mse", "mse"], loss_weights=[0.9, 0.1], optimizer="sgd")

history = model.fit(
    [X_train_A, X_train_B], [y_train, y_train], epochs=20,
    validation_data=([X_valid_A, X_valid_B], [y_valid, y_valid]))

---

### **Using the Subclassing API to Build Dynamic Models** 

Both the Sequential API and the Functional API are declarative: you start by declaring which layers you want to use and how they should be connected, and only then can you start feeding the model some data for training or inference. This has many advantages, but the flip side is that it's static. Some models involve loops, varying shapes, conditional branching, and other dynamic behaviors. For such cases, or simply if you prefer a more imperative programming style, the **Subclassing API** is for you. 

Simply subclass the Model class, create the layers you need in the constructor, and use them to perform the computations you want in the call() method:

In [None]:
class WideAndDeepModel(keras.Model):
    def __init__(self, units=30, activation="relu", **kwargs):
        super().__init__(**kwargs)
        self.hidden1 = keras.layers.Dense(units, activation=activation)
        self.hidden2 = keras.layers.Dense(units, activation=activation)
        self.main_output = keras.layers.Dense(1)
        self.aux_output = keras.layers.Dense(1)
        
    def call(self, inputs):
        input_A, input_B = inputs
        hidden1 = self.hidden1(input_B)
        hidden2 = self.hidden2(hidden1)
        concat = keras.layers.concatenate([input_A, hidden2])
        main_output = self.main_output(concat)
        aux_output = self.aux_output(hidden2)
        return main_output, aux_output

model = WideAndDeepModel()

The big difference is that you can do pretty much anything you want in the call() method: for loops, if statements, low-level TensorFlow operations—your imagination is the limit! This makes it a great API for researchers experimenting with new ideas. 

This extra flexibility does come at a cost: your model's architecture is hidden within the call() method, so Keras cannot easily inspect it; it cannot save or clone it; and when you call the summary() method, you only get a list of layers, without any information on how they are connected to each other. Moreover, Keras cannot check types and shapes ahead of time, and it is easier to make mistakes. So unless you really need that extra flexibility, you should probably stick to the Sequential API or the Functional API. 

---

### **Saving and Restoring a Model** 

When using the Sequential API or the Functional API, saving a trained Keras model is as simple as it gets:

In [None]:
model.save("my_keras_model.h5")

Keras will use the HDF5 format to save both the model's architecture (including every layer's hyperparameters) and the values of all the model parameters for every layer. It also saves the optimizer (including its hyperparameters and any state it may have). 

Loading the model is just as easy:

In [None]:
model = keras.models.load_model("my_keras_model.h5")

---

### **Using Callbacks** 

The fit() method accepts a callbacks argument that lets you specify a list of objects that Keras will call at the start and end of training, at the start and end of each epoch, and even before and after processing each batch. For example, the **ModelCheckpoint** callback saves checkpoints of your model at regular intervals during training, by default at the end of each epoch:

In [None]:
checkpoint_cb = keras.callbacks.ModelCheckpoint("my_keras_model.h5")
history = model.fit(X_train, y_train, epochs=10, callbacks=[checkpoint_cb])

Moreover, if you use a validation set during training, you can set save_best_only=True when creating the ModelCheckpoint. In this case, it will only save your model when its performance on the validation set is the best so far. This way, you do not need to worry about training for too long and overfitting the training set. 

Another way to implement early stopping is to use the **EarlyStopping** callback. It will interrupt training when it measures no progress on the validation set for a number of epochs (defined by the patience argument), and it will optionally roll back to the best model:

In [None]:
checkpoint_cb = keras.callbacks.ModelCheckpoint("my_keras_model.h5",
                                                save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,
                                                  restore_best_weights=True)
history = model.fit(X_train, y_train, epochs=100,
                    validation_data=(X_valid, y_valid),
                    callbacks=[checkpoint_cb, early_stopping_cb])

---

### **Using TensorBoard for Visualization** 

TensorBoard is a great interactive visualization tool that you can use to view the learning curves during training, compare learning curves between multiple runs, visualize the computation graph, analyze training statistics, view images generated by your model, and more! 

To use it, you must modify your program so that it outputs the data you want to visualize to special binary log files called event files. The TensorBoard server will monitor the log directory, and it will automatically pick up the changes and update the visualizations. 

Let's define the root log directory we will use for our TensorBoard logs, plus a small function that will generate a subdirectory path based on the current date and time:

In [None]:
import os

root_logdir = os.path.join(os.curdir, "my_logs")

def get_run_logdir():
    import time
    run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
    return os.path.join(root_logdir, run_id)

run_logdir = get_run_logdir()

Keras provides a nice **TensorBoard()** callback:

In [None]:
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
history = model.fit(X_train, y_train, epochs=30,
                    validation_data=(X_valid, y_valid),
                    callbacks=[tensorboard_cb])

After running the program a second time, you will end up with a directory structure where each run has its own subdirectory containing event files. 

Next you need to start the TensorBoard server. You can do this by running a command in a terminal or use TensorBoard directly within Jupyter:

In [None]:
%load_ext tensorboard
%tensorboard --logdir=./my_logs --port=6006

You should see TensorBoard's web interface. Click the SCALARS tab to view the learning curves. 

**Figure 10-17. Visualizing learning curves with TensorBoard**   
![Figure10-17.jpg](./10.Chapter-10/Figure10-17.jpg) 

---

### **Fine-Tuning Neural Network Hyperparameters** 

The flexibility of neural networks is also one of their main drawbacks: there are many hyperparameters to tweak. Not only can you use any imaginable network architecture, but even in a simple MLP you can change the number of layers, the number of neurons per layer, the type of activation function to use in each layer, the weight initialization logic, and much more. How do you know what combination of hyperparameters is the best for your task? 

One option is to simply try many combinations of hyperparameters and see which one works best on the validation set (or use K-fold cross-validation). You can use **GridSearchCV** or **RandomizedSearchCV** to explore the hyperparameter space. To do this, we need to wrap our Keras models in objects that mimic regular Scikit-Learn regressors. 

First, create a function that will build and compile a Keras model, given a set of hyperparameters:

In [None]:
def build_model(n_hidden=1, n_neurons=30, learning_rate=3e-3, input_shape=[8]):
    model = keras.models.Sequential()
    model.add(keras.layers.InputLayer(input_shape=input_shape))
    for layer in range(n_hidden):
        model.add(keras.layers.Dense(n_neurons, activation="relu"))
    model.add(keras.layers.Dense(1))
    optimizer = keras.optimizers.SGD(lr=learning_rate)
    model.compile(loss="mse", optimizer=optimizer)
    return model

Next, create a **KerasRegressor** based on this build_model() function:

In [None]:
keras_reg = keras.wrappers.scikit_learn.KerasRegressor(build_model)

The KerasRegressor object is a thin wrapper around the Keras model. Now we can use this object like a regular Scikit-Learn regressor. Let's use RandomizedSearchCV to explore the hyperparameter space:

In [None]:
from scipy.stats import reciprocal
from sklearn.model_selection import RandomizedSearchCV

param_distribs = {
    "n_hidden": [0, 1, 2, 3],
    "n_neurons": np.arange(1, 100),
    "learning_rate": reciprocal(3e-4, 3e-2),
}

rnd_search_cv = RandomizedSearchCV(keras_reg, param_distribs, n_iter=10, cv=3)
rnd_search_cv.fit(X_train, y_train, epochs=100,
                  validation_data=(X_valid, y_valid),
                  callbacks=[keras.callbacks.EarlyStopping(patience=10)])

When it's over, you can access the best parameters found, the best score, and the trained Keras model:

In [None]:
>>> rnd_search_cv.best_params_
{'learning_rate': 0.0033625641252688094, 'n_hidden': 2, 'n_neurons': 42}
>>> rnd_search_cv.best_score_
-0.3189529188278931
>>> model = rnd_search_cv.best_estimator_.model

**Number of Hidden Layers** 

For many problems, you can begin with a single hidden layer and get reasonable results. An MLP with just one hidden layer can theoretically model even the most complex functions. But for complex problems, deep networks have a much higher parameter efficiency than shallow ones: they can model complex functions using exponentially fewer neurons than shallow nets, allowing them to reach much better performance with the same amount of training data. 

In summary, for many problems you can start with just one or two hidden layers and the neural network will work just fine. For more complex problems, you can ramp up the number of hidden layers until you start overfitting the training set. 

**Number of Neurons per Hidden Layer** 

The number of neurons in the input and output layers is determined by the type of input and output your task requires. As for the hidden layers, it used to be common to size them to form a pyramid, with fewer and fewer neurons at each layer. However, this practice has been largely abandoned because it seems that using the same number of neurons in all hidden layers performs just as well in most cases, or even better. 

In practice, it's often simpler and more efficient to pick a model with more layers and neurons than you actually need, then use early stopping and other regularization techniques to prevent it from overfitting. This has been dubbed the **"stretch pants"** approach: instead of wasting time looking for pants that perfectly match your size, just use large stretch pants that will shrink down to the right size. 

**Learning Rate, Batch Size, and Other Hyperparameters** 

The **learning rate** is arguably the most important hyperparameter. In general, the optimal learning rate is about half of the maximum learning rate (i.e., the learning rate above which the training algorithm diverges). One way to find a good learning rate is to train the model for a few hundred iterations, starting with a very low learning rate (e.g., 10⁻⁵) and gradually increasing it up to a very large value (e.g., 10). If you plot the loss as a function of the learning rate (using a log scale), you should see it dropping at first. But after a while, the learning rate will be too large, so the loss will shoot back up.