# Chapter 10: Introduction to Artificial Neural Networks

Artificial Neural Networks (ANNs) are a machine learning model which was inspired by our own brains. They are useful for tackling complex machine learning challenges.

## From Biological to Artificial Neurons

The first neural network was designed in 1943 by Warren McCulloch and Walter Pitts. Their paper ["A logical calculus of the ideas immanent in nervous activity"](https://link.springer.com/article/10.1007/BF02478259), they presented a computation model designed loosely after our own brain. Until the 1980s, there was not much work done on Machine Learning. It was not until computers, particularly GPUs, had become powerful enough to train large neural networks.

### Biological Neurons

Biological neurons are cells with many long extensions called <i>dendrites</i> and one very long extension called an <i>axon</i>. The axion splits off into many branches called <i>telodendria</i> whose tips have small structures called <i>synaptic terminals</i>. These synaptic terminals send short electrical impulses called <i>signals</i>. When a neuron receives enough signals in a few milliseconds, it sends its own.

### Logical Computations with Neurons

Warren McCulloch and Walter Pitts proposed a simple model of the neuron which became known as <i>artificial neurons</i>. Each artificial neuron has a binary input and a binary output. A neuron can toggle its output as inactive or active. If enough of a neuron's inputs are active, it toggles its output active. Below is a graph representation of some examples.

<img src="https://drive.google.com/uc?export=view&id=1hGflagjs6QtHxt87AMyEfPq0COIXlzhG" width="500px">

### The Perceptron

<i>The Perceptron</i> is one of the simplest ANN architectures. It is based on a slightly different type of artificial neuron called a <i>linear threshold unit</i> (LTU). The inputs and output are numbers (instead of binary on/off values) and each input is associated a weight. The LTU computes the weighted sum of it's inputs, i.e

$$ z = w_1\,x_1 + w_2\,x_2\; + \;...\; + \;w_n\,x_n = \mathbf{w}^{\,T} \cdot \mathbf{x}.$$

then it applies a <i>step function</i> to that sum and outputs the result. The most common step function is the <i>Heaviside step function</i> given by

$$ \text{heaviside}\,(z) = \left\{ \begin{matrix}
0 && \text{if}\; z < 0 \\
1 && \text{if}\; z \geq 0
\end{matrix} \right. $$

Sometimes the sign function is used, given by

$$ \text{sgn}\,(z) = \left\{ \begin{matrix}
-1 && \text{if}\; z < 0 \\
0 && \text{if}\; z = 0 \\
+1 && \text{if}\; z > 0
\end{matrix} \right. $$

A single LTU can be used for linear binary classification, just like the Logistic Regression classifier. Training a single LTU is finding the optimal weight vector, $\mathbf{w}$. A perceptron is a layer of LTUs connected to a layer of input nodes for each feature.

Perceptrons are trained using Hebb's rule (or <i>Hebbian learning</i>), which strengthens connections which lead to correct predictions and also reduces the influence of connections which lead to incorrect inputs. The weights are initialized at zero, then for each training instance, the weights are updated using the function

$$ w_{i,\,j}^{(\text{next step})} = w_{i,\,j} + \eta \left( y_j - \hat{y}_j \right) x_i $$

where $w_{i,\,j}$ is the weight of the connection from the i<sup>th</sup> input neuron and the j<sup>th</sup> output neuron, $x_i$ is the i<sup>th</sup> input value of the current training instance, $\hat{y}_j$ is the output of the j<sup>th</sup> output node and $y_j$ is the target  output of the j<sup>th</sup> output neuron for the current training instance, and $\eta$ is the learning rate.

A single perceptron has a linear decision boundary, so if the data is complex, it will not work. But if the data is linearly separable, then Frank Rosenblatt showed that this algorithm will converge to the solution. This is known as the <i>Perceptron convergence theorem</i>.

In [1]:
# Scikit-Learn has its own Perceptron implementation.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2,3)] # Petal length, petal width
y = (iris.target == 0) # Iris Setosa?

per_clf = Perceptron(max_iter=10, tol=1e-3)
per_clf.fit(X, y)

per_clf.score(X, y)

1.0

Training a single perceptron is very similar to Stochastic Gradient Descent. In fact, using Scikit-Learn's `Perceptron` is the same as using the `SGDClassifier` with the `loss` hyperparameter set to `'perceptron'`.

Unlike Logistic Regression, Perceptrons do not output a probability that an instance belongs to a class, it only outputs which class the perceptron predicts the instance to be in. Also, perceptrons are not able to predict datasets which are not linearly separable. In order to do that, you need to use an ANN architecture called a <i>Multi-Layer Perceptron</i> (MLP).