# Chapter 10: Introduction to Artificial Neural Networks with Keras

**Artificial neural networks (ANNs)**: A Machine Learning model inspired by the networks of biological neurons found in our brains.

## 10.1 From Biological to Artificial Neurons

### 10.1.1 Biological Neurons

Cell body -> Axon -> Telondendria -> Synaptic terminals (synapses) => Next cell's dendrites/body

> Note: See Figure 10-1 in book

Biological neurons produce short electrical impulses called action potentials (APs, or just signals) which travel along the axons and make the synapses release chemical signals called neurotransmitters.

When a neuron receives a sufficient amount of these neurotransmitters, it fires its own electrical impulses

### 10.1.2 Logical Computations with Neurons

Artificial neuron - It has one or more binary (on/off) inputs and one binary output. The artificial neuron activates its output when more than a certain number of its inputs are active.

> Note: See Figure 10-3 in book.

> Note: 2 input signals are needed to activate neuron C.

1. Identity function: if neuron A is activated, neuron C gets activated as well. But if neuron A is off, then neuron C is off as well.

2. Logical AND: neuron C is activated only when both neurons A and B are activated (a single input signal is not enough to activate neuron C).

3. Logical OR: neuron C gets activated if either neuron A or B is activated (or both).

4. Logicial NOT: neuron C is activated only if neuron A is active and neuron B is off.

### 10.1.3 The Perceptron

**Perceptron** - Simplest ANN architecture based on a *threshold logic unit (TLU)* or sometimes a *linear threshold unit (LTU)*. The inputs and output are numbers (instead of binary on/off values), and each input connection is associated with a weight. 

The TLU computes a weighted sum of its inputs $(z=w_1x_1 + w_2x_2 + ... + w_nx_n = \mathbf{x}^T \mathbf{w})$, then applies a **step function** to that sum and outputs the result: $h_w(\mathbf{x}) = \text{step}(z), \text{where } z=\mathbf{x}^T \mathbf{w}$.

The most common step function used in Perceptrons is the **Heaviside step function** and sometimes the sign function is used instead.

A single TLU can be used for simple linear binary classification. It computes a linear combination of the inputs and if the result exceeds a threshold, it outputs the positive class, else negative.

Hebb's rule (Hebbian learning) - "Cells that fire together, wire together." The connection weight between two neurons tends to increase when they fire simultaneously.

> Note: The decision boundary of each output neuron is linear, so Perceptrons are incapable of learning complex patterns, **unless the training instances are linearly separable** and would then converge to a solution (called the *Perceptron convergence theorem).

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

In [2]:
iris = load_iris()
X = iris.data[:, (2,3)] # petal length, petal width
y = (iris.target == 0).astype(np.int) # Iris setosa?

per_clf = Perceptron()
per_clf.fit(X, y)

y_pred = per_clf.predict([[2, 0.5]])
y_pred

array([0])

> Note: Scikit-Learn's `Perceptron` is equivalent to using `SGDClassifier` with following hyperparameters: 

> - `loss="perceptron"`
> - `learning_rate="constant"`
> - `eta0=1` (the learning rate)
> - `penalty=None` (no regularization)

> Note: Perceptrons do not output a class probability; they make predictions based on a hard threshold.

### 10.1.4 The Multilayer Perceptron and Backpropagation

An MLP is composed of: 
- One (passthrough) **input layer**
- One or more layers of TLUs called **hidden layers**
- One final layer of TLUs called the **output layer**
- Every layer except output layer includes a bias neuron and is fully connected to the next layer (implicit, always true)

> Note: The signal flows only in one direction (from the inputs to the outputs), so this architecture is an example of a **feedforward neural network (FNN)**.

When an ANN contains a deep stack of hidden layers, it is called a **deep neural network (DNN)**, "10s, 100s+ layers => Deep Learning".

**Backpropagation** - It is Gradient Descent using an efficient technique for computing the gradients automatically. In other words, it can find out how each connection weight and each bias term should be tweaked in order to reduce the error. Once it has these gradients, it just performs a regular Gradient Descent step, and the whole process is repeated until the network converges to the solution.

In detail:

1. It handles one mini-batch at a time and goes through the full training set multiple times. Each pass is called an **epoch**.

2. **Forward pass**: Each mini-batch is passed to the network's input layer and into the first hidden layer. It computs the output and passed to next layer. All intermediate results are preserved.

3. Algorithm measures network's output error (ie. loss function that compares the desired output and the actual output of the network, and returns some measure of the error).

4. Computes how much each output contributed to the error. Done analytically by applying the **chain rule** from calculus.

5. Measures how much of these error contributions (**error gradient**) came from each connection in the layer below, using the chain rule, working backward until reaching the input layer.

6. Performs a Gradient Descent step to tweak all connection weights in the network, using the error gradients it just computed.

> Note: It is important to **initialize all the hidden layers' connection weights randomly**, or else training will fail.

> For example, if you initialize all weights and biases to zero, then all neurons in a given layer will be perfectly identical, and backpropagation will affect them in the exactly the same way, so they will remain identical.

> If instead you randomly initialize the weights, you **break the symmetry** and allow backpropagation to train a diverse team of neurons.

Key change is to **replace the step function with the logistic (sigmoid) function**: $ \sigma(z) = 1/ (1 + \text{exp}(-z)) $, since step function (Heaviside/sign) only has flat segments and Gradient Descent cannot move on flat segments.

Other popular choices are:
- Hyperbolic tangent function: $ \tanh (z) = 2\sigma(2z) -1 $:
    - S-shaped, continuous, differentiable
    - Output value ranges from -1 to 1
    - Tends to make each layer's output centered around 0 at beginning of training
    - Often helps speed up convergence

- Rectified Linear Unit function: $ \text{ReLU}(z) = \max(0, z) $:
    - Continuous
    - Not differentiable at $z=0$
    - Derivative is 0 for $z<0$
    - In practice, it works well and fast to compute
    - Become the default
    - Does not have a maximum output value, helps reduce issues during GD

> Note: If you chain several linear transformations, all you get is a linear transformation.

> If $f(x) = 2x + 3$ and $g(x)= 5x - 1$, then $f(g(x)) = 2(5x - 1) + 3 = 10x + 1$.

> So if you don't have some nonlinearity between layers, then even a deep stack of layers is equivalent to a single layer, and you can't solve very complex problems with that.

> Conversely, a large enough DNN with nonlinear activations can theoretically approximate any continuous function.

### 10.1.5 Regression MLPs

### 10.1.6 Classification MLPs

## 10.2 Implementing MLPs with Keras

### 10.2.1 Installing TensorFlow 2

### 10.2.2 Building an Image Classifier Using the Sequential API

### 10.2.3 Building a Regression MLP Using the Sequential API

### 10.2.4 Building Complex Models Using the Functional API

### 10.2.5 Using the Subclassing API to Build Dynamic Models

### 10.2.6 Saving and Restoring a Model

### 10.2.7 Using Callbacks

### 10.2.8 Using TensorBoard for Visualization

## 10.3 Fine-Tuning Neural Networks Hyperparameters

### 10.3.1 Number of Hidden Layers

### 10.3.2 Number of Neurons per Hidden Layer

### 10.3.3 Learning Rate, Batch Size, and Other Hyperparameters