# Introduction to Artificial Neural Networks

"In this chapter, we will introduce artificial neural networks, starting with a quick tour of the very first ANN architectures. Then, we will present *Multi-Layer Perceptrons* and implement one using TensorFlow to tackle the MNIST digit classification problem."

In the biological world, "individual biological neurons seem to behave in a rather simple way, but they are organized in a vast network of billions of neurons, each neuron typically connected to thousands of other neurons. Highly complex computations can be performed by a vast network of fairly simple neurons, much like a complex anthill can emerge from the combined efforts of simple ants.

### Logical Computations with Neurons

"Warren McCulloch and Walter Pitts proposed a very simple model of the biological neuron, which later became known as an *artificial neuron*: it has one or more binary (on/off) inputs and one binary output. The artificial neuron simply activates its output when more than a certain number of its inputs are active. McCulloch and Pitts showed that even with such a simplified model, it is possible to build a network of artificial neurons that computes any logical proposition you want."

![ANNs performing simple logical computations](./one.jpg)

### The Perceptron

"The perceptron is one of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt. It is based on a slightly different artificial neuron called a *threshold logic unit (TLU)*, or sometimes a linear threshold unit: the inputs and output are now numbers... and each input connection is associated with a weight. The TLU computs the weighted sum of its inputs (z = w1*x1 + w2*x2 + ... + wnxn = **w**^T times **x**), then applies a *step function* to that sum and outputs the result.

![Threshold Logic Unit](./two.jpg)

"A single TLU can be used for simple lineary binary classification. It computes a linear combination of the inputs and if the result exceeds a threshold, it outputs the positive class or else outputs the negative class." I.e., you could have a linearly seperable dataset with two dimensions (plus the addition of x0 = 1 for the bias term), and the TLU will compute the linear combination of the inputs and run that through the step function."

"So how is a Perceptron trained? The Perceptron training algorithm proposed by Frank Rosenblatt was largely inspired by *Hebb's rule*," which is the idea commonly paraphrased as "cells that fire together, wire together". The idea is that "the connection weight between two neurons is increased whenever they have the same output. Perceptrons are trained using a variant of this rule that takes into account the error made by the network; it does not reinforce connections that lead to the wrong output. 

**"The perceptron is fed one training instance at a time, and for each instance it makes its predictions. For every output neuron that produced a wrong prediction, it reinforces the connection weights from the inputs that would have contributed to the correct prediction."**

### Backpropogation

We can create a multi layer perceptron by stacking layers of neurons and connecting every neuron in the previous layer to every neuron in the new layer. Multi-layer perceptrons or ***MLP's*** are much more capable of solving more complex problems, and are capable of classifying data that is not linearly seperable.

How do we train these perceptrons though? The trick is to use **backpropogation**. "Today we would describe it as Gradient Descent using reverse-mode autodiff."

The idea here is to feed the neural network an instance and compute the output of every neuron in each layer. Now, we measure the error, i.e. abs(y - y_hat), "and it computes how much each neuron in the last hidden layer contributed to each output neuron's error." Now you have the information of the error contributions of the last hidden layer for each neuron. Now, you just compute how much each neuron in the *second to last* hidden layer contributed to each of the error contributions of the last hidden layer.

This is a recursive process that you can do on each hidden layer of the network going backwards until you hit the first. Now, you can just use Gradient Descent to take a small step down the gradient function and then update the weights accordingly.

There is an issue here, though. The step function that we were using before in the TLU clearly doesn't have a helpful derivative. A solution is to instead use the sigmoid function, or of course we can use Tanh or ReLU.

### Training with Sklearn

The easiest way to train a MLP is to use sklearn. "The `MLPClassifier` class makes it fairly easy to train a deep neural network with any number of hidden layers and a softmax output layer to output estimated class probabilities."

In [27]:
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X = iris.data[:, (2,3)] # petal length, petal width
y = (iris.target==0).astype(int) # Iris Setosa?

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=40)

In [28]:
dnn_clf = MLPClassifier(hidden_layer_sizes=(300,100,), activation='relu', max_iter=20, batch_size=50)
dnn_clf.fit(X_train, y_train)



MLPClassifier(batch_size=50, hidden_layer_sizes=(300, 100), max_iter=20)

In [29]:
from sklearn.metrics import accuracy_score
y_pred = dnn_clf.predict(X_test)
accuracy_score(y_test, y_pred)

1.0

### Training a DNN Using Plain TensorFlow

"If you want more control over the architecture of the network, you may prefer to use Tensorflow's lower-level Python API."

In [46]:
import tensorflow as tf

# load data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0

In [47]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(300, activation='relu'),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

In [48]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',metrics=['accuracy'])
model.fit(x_train,y_train, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x258eecc7760>

In [49]:
model.evaluate(x_test,y_test)



[0.07510446012020111, 0.9794999957084656]

In [61]:
sample = x_test[:4]
np.argmax(model.predict(sample), axis=1)

array([7, 2, 1, 0], dtype=int64)