# Introduction to Artificial Neural Networks

## From Biological to Artificial Neurons

* ANNs were first introduced back in 1943 by the neurophysiologist Warren McCulloch and the mathematician Walter Pitts.

* We are now witnessing a wave of interest in ANNs.

* There is now a huge quantity of data available to train neural networks

* The tremendous increase in computing power since the 1990s now makes it posible to train large neural networks in a resonable amount of time.

* The training algorithms have been improved.

* Some theoretical limitations of ANNs have turned out to be benign in practice.

* ANNs seem to have entered a virtuous circle of funding and progress.

### Biological Neurons

* Biological neurons receive short electrical impulses called signals from other neurons via synapses.

### Logical Computations with Neurons

* The artificial neuron simply activates its output when more than a certain number of its inputs are active.

### The Perceptron

* The perceptron is one of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt.

* Liear threshold unit(LTU): the inputs and output are now numbers(instead of binary on/off values) and each input connection is associated with weight. The LTU computes a weighted sum of its inputs, then applies step function to that sum and outputs the result.

* A single LTU can be used for simple linear binary classification.

* A Perceptron trained in this way: the connection weight between two neurons is increased whenever they have the same output.

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2, 3)]
y = (iris.target==0).astype(np.int)

per_clf = Perceptron(random_state=42)
per_clf.fit(X, y )

y_pred = per_clf.predict([[2, 0.5]])

In [2]:
y_pred

array([1])

### Multi-Layer Perceptron and Backpropagation

* An MLP is composed of one input layer, one or more layers of LTUs, called hiiden layers ,and one final layer of LTUs called the output layer.

* Every layer except the output layer includes a bias neuron and is fuly connected to the next layer.

* For each training instance the backpropagation algorithm first makes a predcition(forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection, and finally slightly tweaks the connection weights to reduce the error(Gradient Descent step).

* In order for this algorithm to work properly, the authors made a key change to MLP's architecture: they replaced the step function with the logistic function.

* Activation functions: Step function, Sigmoid function, Tanh function, Relu function.

## Training an MLP with TensorFlow's High-Level API



In [4]:
# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("/tmp/data")

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


In [5]:
X_train = mnist.train.images
X_test = mnist.test.images
y_train = mnist.train.labels.astype("int")
y_test = mnist.test.labels.astype("int")

In [7]:
import tensorflow as tf

config = tf.contrib.learn.RunConfig(tf_random_seed=42)

feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
dnn_clf = tf.contrib.learn.DNNClassifier(hidden_units=[300, 100], n_classes=10, 
                                        feature_columns=feature_columns)

dnn_clf.fit(x=X_train, y=y_train, batch_size=50, steps=40000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x125c26b38>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/var/folders/__/2zbjvd9s6w54xb911mgsqj6w0000gn/T/tmps9a3c8vc'}
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
Instruct

DNNClassifier(params={'head': <tensorflow.contrib.learn.python.learn.estimators.head._MultiClassHead object at 0x125c26d30>, 'hidden_units': [300, 100], 'feature_columns': (_RealValuedColumn(column_name='', dimension=784, default_value=None, dtype=tf.float32, normalizer=None),), 'optimizer': None, 'activation_fn': <function relu at 0x1167bdae8>, 'dropout': None, 'gradient_clip_norm': None, 'embedding_lr_multipliers': None, 'input_layer_min_slice_size': None})

In [9]:
from sklearn.metrics import accuracy_score
y_pred = list(dnn_clf.predict(X_test))
accuracy_score(y_test, y_pred)

Instructions for updating:
Please switch to predict_classes, or set `outputs` argument.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
INFO:tensorflow:Restoring parameters from /var/folders/__/2zbjvd9s6w54xb911mgsqj6w0000gn/T/tmps9a3c8vc/model.ckpt-40000


0.9829

## Training a DNN Using Plain TensorFlow

* The first step is the construction phase, building the TensorFlow graph. The second step is the execution phase, where we actually run the graph to train the model.

### Construction Phase

In [10]:
n_inputs = 28*28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

In [16]:
reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

# create the two hidden layers and the output layer.

def neuron_layer(X, n_neurons, name, activation=None):
    with tf.name_scope(name):
        n_inputs = int(X.get_shape()[1])
        stddev = 2 / np.sqrt(n_inputs)
        init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev)
        W = tf.Variable(init, name="weight")
        b = tf.Variable(tf.zeros([n_neurons]), name="biases")
        z = tf.matmul(X, W) + b
        if activation == "relu":
            return tf.nn.relu(z)
        else:
            return z

1. First we create a name scope using the name of the layer: it will contain all the computaion nodes for this neuron layer.

2. Next, we get the number of inputs by looking up the input matrix's shape and getting the size of the second dimension(the first dimension is for instances)

3. The next three lines create a W variable that will hold the weights matrix. It will be a 2D tensor containing all the connection weights between each input and each neuron. its shape will be (n_inputs, n_neurons). It will be initialized randomly, using a truncated normal(Gaussian) disstribution with a standard deviation of 2/sqrt(n_inputs). Using this specific standard deviation helps the algorithm converge much faster.

4. The next line creates a b variable for biases, initialized to 0 (no symmetry issue in this case), with one bias parameter per neuron.

In [17]:
# create the deep neural network.

with tf.name_scope("dnn"):
    hidden1 = neuron_layer(X, n_hidden1, "hidden1", activation="relu")
    hidden2 = neuron_layer(hidden1, n_hidden2, "hidden2", activation="relu")
    logits = neuron_layer(hidden2, n_outputs, "outputs")



In [19]:
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
    
# The sparse_softmax_cross_entropy_with_logits() function is equivalent to applying the softmax
#activation function and then computing the cross entropy.



In [20]:
# Define a GradientDescentOptimizer
learning_rate = 0.01

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

In [21]:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    

In [22]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

### Execution Phase


In [24]:
n_epochs = 40
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y:y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y:y_batch})
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
        print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)
    save_path = saver.save(sess, "./my_model_final.ckpt")

0 Train accuracy: 0.92 Test accuracy: 0.9111
1 Train accuracy: 0.94 Test accuracy: 0.9301
2 Train accuracy: 0.92 Test accuracy: 0.9388
3 Train accuracy: 0.98 Test accuracy: 0.9453
4 Train accuracy: 0.92 Test accuracy: 0.9499
5 Train accuracy: 0.92 Test accuracy: 0.9547
6 Train accuracy: 0.96 Test accuracy: 0.9578
7 Train accuracy: 1.0 Test accuracy: 0.9602
8 Train accuracy: 0.94 Test accuracy: 0.9615
9 Train accuracy: 1.0 Test accuracy: 0.9637
10 Train accuracy: 0.98 Test accuracy: 0.9658
11 Train accuracy: 0.98 Test accuracy: 0.9663
12 Train accuracy: 1.0 Test accuracy: 0.967
13 Train accuracy: 0.96 Test accuracy: 0.9692
14 Train accuracy: 0.96 Test accuracy: 0.9698
15 Train accuracy: 0.98 Test accuracy: 0.9709
16 Train accuracy: 0.98 Test accuracy: 0.9711
17 Train accuracy: 0.98 Test accuracy: 0.9719
18 Train accuracy: 1.0 Test accuracy: 0.9727
19 Train accuracy: 0.96 Test accuracy: 0.9732
20 Train accuracy: 0.96 Test accuracy: 0.9729
21 Train accuracy: 1.0 Test accuracy: 0.9735
22 T

### Using the Neural Network

* we can use the trained neural network to make predictions

In [26]:
with tf.Session() as sess:
    saver.restore(sess, "./my_model_final.ckpt")
    X_new_scaled = mnist.test.images[:20]
    Z = logits.eval(feed_dict={X: X_new_scaled})
    y_pred = np.argmax(Z, axis=1)

print("PRedicted classed:", y_pred)
print("Actual classes: ", mnist.test.labels[:20])

INFO:tensorflow:Restoring parameters from ./my_model_final.ckpt
PRedicted classed: [7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4]
Actual classes:  [7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4]


## Fine-Tuning Neural Network Hyperparameters

* The flexibility of neural networks is also one of their main drawbacks: there are many hyperparameters to tweak.

### Number of Hidden Layers

* For many problems, we can just begin with a single hidden layer and we will get reasonable results.

* Deep networks have a much higher parameter efficiency than shallow ones: they can model complex functions using exponentially fewer neurons than shallow nets, making them much faster to train.

* In summary, for many problems we can start with just one or two hidden layers and it will work just fine.

* For more complex problems, we can gradually ramp up the number of hidden layers, until we start overfitting the training set.

### Number of Neurons per Hidden Layers

* The number of neurons in the input and output layers is determined by the type of input and output our task requires.

* As for the hidden layers, a common practice is to size them to form a funnel, with fewer and fewer neurons at each layer---the rationable being that many low-level features can coalesce into far fewer high-level features.

* A simpler approach is to pick a model with more layers and neurons than we actually need, then use early stopping to prevent it from overfitting.

### Activation Functions

* In most cases we can use the ReLU activation fucntion in the hidden layers. It is a bit faster to compute than other activation functions, and Gradient Descent does not get stuck as much on plateaus.

* For the output layer, the softmax activation function is generally a good choice for classification tasks.