In this chapter, we will go over the first _artificial neural networks_(ANNs) and then present _Multi-Layer Perceptrons_(MLPs) and implement one in TF to tackle the MNIST dataset.

## The Perceptron
The perceptron is based on an artificial neuron called a _linear threshold unit_(LTU).

An example of using a Linear Perceptron with Scikit:

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2,3)] # Pedal length and width
y = (iris.target == 0).astype(np.int) # Iris Setosa?
per_clf = Perceptron(random_state=42)
per_clf.fit(X,y)

y_pred = per_clf.predict([[2, 0.5]])

print("Predicion: ", y_pred)

Predicion:  [1]




This is equiv to SGDClassifier with the loss set to perceptron. Recall that unlike Logistic Regresion classifers which output probabilities, Perceptrons make predictions on hard thresholds.

Because of the preceptrons limitiations, the study of _connectionism_ (the study of neural networks) was dropped. BUT, it turns out that that some limitations could be eliminated by simply stacking the perceptrons. This is called _Multi-Layer Perceptron_(MLP).

## Multi-Layer Perceptron Backpropagation

An MLP is composed of a passthrough layer, one or more layers of LTUs called hidden layers, and a final layer of LTUs called the output layer. When an _Artificial Neural Network_(ANN) has two or more hidden layers, it is called a _deep neural network_(DNN).

For many years, researchers couldn't find a way to train MLPs, but then they formulated the _backpropagation_ training algorithm. This is known as Gradient Descent.

Aside from the oh so famous logistic function, there are the tanh function, and the ReLU function.

MLP is typically used for binary classification. When used for exclusive classes, there is typically a softmax function. Given that the signal flows only one way, it is called a _feedforward neural network_(FNN).

## Training an MLP with TensorFlow's High-Level API

The easiest way to train an MLP is to use the built-in tensorflow API. The `DNNClassifier` makes it trivial to train a DNN with any number of hidden layers and a softmax output. For example, let's make a DNN for classification with two hidden layers and a softmax output layer with 10 neurons:

In [7]:
# Load MNIST dataset
import os
import pandas as pd

def load_MNIST_data(path='.'):    
    csv_path = os.path.join(path, "mnist_784.csv")
    return pd.read_csv(csv_path)

mnist_pd = load_MNIST_data()
mnist = mnist_pd.values

In [8]:
# Get the data and separate it!
X, y = mnist[:,0:784], mnist[:,784:]

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

In [10]:
import tensorflow as tf

feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
dnn_clf = tf.contrib.learn.DNNClassifier(hidden_units=[300, 100], n_classes=10,
                                        feature_columns=feature_columns)
dnn_clf.fit(x=X_train, y=y_train, batch_size=50, steps=40000)

Instructions for updating:
Please switch to tf.contrib.estimator.*_head.
Instructions for updating:
Please replace uses of any Estimator from tf.contrib.learn with an Estimator from tf.estimator.*
Instructions for updating:
When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f947fe28d68>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/tmpqaz

INFO:tensorflow:global_step/sec: 213.491
INFO:tensorflow:loss = 0.3777306, step = 3201 (0.468 sec)
INFO:tensorflow:global_step/sec: 242.188
INFO:tensorflow:loss = 0.20093453, step = 3301 (0.414 sec)
INFO:tensorflow:global_step/sec: 214.31
INFO:tensorflow:loss = 0.12598404, step = 3401 (0.465 sec)
INFO:tensorflow:global_step/sec: 211.739
INFO:tensorflow:loss = 0.50144124, step = 3501 (0.472 sec)
INFO:tensorflow:global_step/sec: 231.036
INFO:tensorflow:loss = 0.54182696, step = 3601 (0.437 sec)
INFO:tensorflow:global_step/sec: 226.952
INFO:tensorflow:loss = 0.6525817, step = 3701 (0.438 sec)
INFO:tensorflow:global_step/sec: 216.168
INFO:tensorflow:loss = 0.3483754, step = 3801 (0.462 sec)
INFO:tensorflow:global_step/sec: 214.653
INFO:tensorflow:loss = 0.43588853, step = 3901 (0.466 sec)
INFO:tensorflow:global_step/sec: 224.031
INFO:tensorflow:loss = 0.15058702, step = 4001 (0.446 sec)
INFO:tensorflow:global_step/sec: 234.098
INFO:tensorflow:loss = 0.1812924, step = 4101 (0.429 sec)
INFO:

INFO:tensorflow:loss = 0.18160439, step = 11401 (0.415 sec)
INFO:tensorflow:global_step/sec: 243.164
INFO:tensorflow:loss = 0.03378515, step = 11501 (0.411 sec)
INFO:tensorflow:global_step/sec: 241.99
INFO:tensorflow:loss = 0.26104668, step = 11601 (0.413 sec)
INFO:tensorflow:global_step/sec: 243.618
INFO:tensorflow:loss = 0.13653375, step = 11701 (0.412 sec)
INFO:tensorflow:global_step/sec: 194.186
INFO:tensorflow:loss = 0.10583084, step = 11801 (0.514 sec)
INFO:tensorflow:global_step/sec: 244.588
INFO:tensorflow:loss = 0.21622434, step = 11901 (0.409 sec)
INFO:tensorflow:global_step/sec: 237.411
INFO:tensorflow:loss = 0.1935614, step = 12001 (0.421 sec)
INFO:tensorflow:global_step/sec: 246.062
INFO:tensorflow:loss = 0.26123866, step = 12101 (0.409 sec)
INFO:tensorflow:global_step/sec: 217.353
INFO:tensorflow:loss = 0.26927236, step = 12201 (0.457 sec)
INFO:tensorflow:global_step/sec: 242.139
INFO:tensorflow:loss = 0.28119397, step = 12301 (0.414 sec)
INFO:tensorflow:global_step/sec: 

INFO:tensorflow:global_step/sec: 239.677
INFO:tensorflow:loss = 0.25415817, step = 19601 (0.418 sec)
INFO:tensorflow:global_step/sec: 240.469
INFO:tensorflow:loss = 0.23967786, step = 19701 (0.415 sec)
INFO:tensorflow:global_step/sec: 246.438
INFO:tensorflow:loss = 0.06210378, step = 19801 (0.406 sec)
INFO:tensorflow:global_step/sec: 237.673
INFO:tensorflow:loss = 0.059527006, step = 19901 (0.420 sec)
INFO:tensorflow:global_step/sec: 233.649
INFO:tensorflow:loss = 0.15518238, step = 20001 (0.428 sec)
INFO:tensorflow:global_step/sec: 226.836
INFO:tensorflow:loss = 0.06099398, step = 20101 (0.440 sec)
INFO:tensorflow:global_step/sec: 235.1
INFO:tensorflow:loss = 0.15587409, step = 20201 (0.426 sec)
INFO:tensorflow:global_step/sec: 242.379
INFO:tensorflow:loss = 0.19869548, step = 20301 (0.413 sec)
INFO:tensorflow:global_step/sec: 234.102
INFO:tensorflow:loss = 0.047822677, step = 20401 (0.427 sec)
INFO:tensorflow:global_step/sec: 243.04
INFO:tensorflow:loss = 0.04219195, step = 20501 (0.

INFO:tensorflow:loss = 0.044660844, step = 27701 (0.423 sec)
INFO:tensorflow:global_step/sec: 230.724
INFO:tensorflow:loss = 0.02414952, step = 27801 (0.435 sec)
INFO:tensorflow:global_step/sec: 237.831
INFO:tensorflow:loss = 0.051004022, step = 27901 (0.419 sec)
INFO:tensorflow:global_step/sec: 230.385
INFO:tensorflow:loss = 0.08218043, step = 28001 (0.434 sec)
INFO:tensorflow:global_step/sec: 242.393
INFO:tensorflow:loss = 0.043567184, step = 28101 (0.413 sec)
INFO:tensorflow:global_step/sec: 233.277
INFO:tensorflow:loss = 0.020149836, step = 28201 (0.429 sec)
INFO:tensorflow:global_step/sec: 239.069
INFO:tensorflow:loss = 0.040030416, step = 28301 (0.418 sec)
INFO:tensorflow:global_step/sec: 225.235
INFO:tensorflow:loss = 0.2103261, step = 28401 (0.444 sec)
INFO:tensorflow:global_step/sec: 234.762
INFO:tensorflow:loss = 0.088026196, step = 28501 (0.426 sec)
INFO:tensorflow:global_step/sec: 230.041
INFO:tensorflow:loss = 0.020665465, step = 28601 (0.434 sec)
INFO:tensorflow:global_st

INFO:tensorflow:loss = 0.11828631, step = 35801 (0.554 sec)
INFO:tensorflow:global_step/sec: 216.234
INFO:tensorflow:loss = 0.08582124, step = 35901 (0.462 sec)
INFO:tensorflow:global_step/sec: 232.332
INFO:tensorflow:loss = 0.114619635, step = 36001 (0.430 sec)
INFO:tensorflow:global_step/sec: 219.133
INFO:tensorflow:loss = 0.015041902, step = 36101 (0.460 sec)
INFO:tensorflow:global_step/sec: 201
INFO:tensorflow:loss = 0.040207185, step = 36201 (0.495 sec)
INFO:tensorflow:global_step/sec: 226.414
INFO:tensorflow:loss = 0.10164454, step = 36301 (0.441 sec)
INFO:tensorflow:global_step/sec: 213.88
INFO:tensorflow:loss = 0.065593235, step = 36401 (0.467 sec)
INFO:tensorflow:global_step/sec: 205.757
INFO:tensorflow:loss = 0.08369605, step = 36501 (0.486 sec)
INFO:tensorflow:global_step/sec: 204.982
INFO:tensorflow:loss = 0.08001676, step = 36601 (0.487 sec)
INFO:tensorflow:global_step/sec: 211.545
INFO:tensorflow:loss = 0.05954801, step = 36701 (0.473 sec)
INFO:tensorflow:global_step/sec:

DNNClassifier(params={'head': <tensorflow.contrib.learn.python.learn.estimators.head._MultiClassHead object at 0x7f948069efd0>, 'hidden_units': [300, 100], 'feature_columns': (_RealValuedColumn(column_name='', dimension=784, default_value=None, dtype=tf.int64, normalizer=None),), 'optimizer': None, 'activation_fn': <function relu at 0x7f9491575510>, 'dropout': None, 'gradient_clip_norm': None, 'embedding_lr_multipliers': None, 'input_layer_min_slice_size': None})

Running this code on MNIST achives a great accuracy!

In [11]:
from sklearn.metrics import accuracy_score

y_pred = list(dnn_clf.predict(X_test))
print("Accuracy is: ", accuracy_score(y_test, y_pred))

Instructions for updating:
Please switch to predict_classes, or set `outputs` argument.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpqazjp8wa/model.ckpt-40000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Accuracy is:  0.9466


The TF.Learn library also has some convenience functions to evaluate the models...

In [12]:
print("Evlauation is: ", dnn_clf.evaluate(X_test, y_test))

Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
INFO:tensorflow:Starting evaluation at 2019-06-06-15:09:42
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpqazjp8wa/model.ckpt-40000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-06-06-15:09:42
INFO:tensorflow:Saving dict for global step 40000: accuracy = 0.9466, global_s

## Training a DNN Using Plain Tensorflow

We will now use tensorflow's lower lever API to have fun with the MNIST dataset! The first step is the construction phase, then later we will get to the execution phase...

## Construciton Phase

First, specify the inputs.

In [1]:
import tensorflow as tf
import numpy as np

n_inputs = 28*28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

Next, you use a placeholder to represent training data and targets.

In [2]:
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

Now we create the neural network. X will act as the input layer; during execution, it will be replaced one batch at a time. Now we need to create two hidden layers, and an output layer. Let's create a `neuron_layer()` function that we'll use to create one layer at time.

In [7]:
def neuron_layer(X, n_neurons, name, activation=None):
    with tf.name_scope(name):
        n_inputs = int(X.get_shape()[1])
        stddev = 2 / np.sqrt(n_inputs)
        init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev)
        W = tf.Variable(init, name="weights")
        b = tf.Variable(tf.zeros([n_neurons]), name="biases")
        z = tf.matmul(X, W) + b
        if activation == "relu":
            return tf.nn.relu(z)
        else:
            return z

Now that we have the function to create a neuron layer, let's create a DNN!

In [8]:
with tf.name_scope("dnn"):
    hidden1 = neuron_layer(X, n_hidden1, "hidden1", activation="relu")
    hidden2 = neuron_layer(hidden1, n_hidden2, "hidden2", activation="relu")
    logits = neuron_layer(hidden2, n_outputs, "outputs")

Notice that we use another name scope for clarity! Also note that logits is the output of the network _before_ going through softmax activation.

Tensorflow comes with some premade functions for neural network layers though. Often times, there's no need to define your own function. For example, instead of using `neuron_layer()`, you can use `fully_connected()` to create a fully connected layer. Lets tweak the code to be as follows:

In [3]:
from tensorflow.contrib.layers import fully_connected

with tf.name_scope('dnn'):
    hidden1 = fully_connected(X, n_hidden1, scope="hidden1")
    hidden2 = fully_connected(hidden1, n_hidden2, scope="hidden2")
    logits = fully_connected(hidden2, n_outputs, scope="outputs",
                            activation_fn=None)

Now tha we have our neural network, we need to define the cost function to train it. We'll use `max_cross_entropy_with_logits()`.

In [4]:
with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
                    labels = y, logits=logits)
    loss = tf.reduce_mean(xentropy, name='loss')

All we need now in the `GradientDescentOptimizer` to complete the magic!

In [5]:
learning_rate = 0.01

with tf.name_scope('train'):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

Last in the contruction phase is to specify how to evaluate the model. We'll use simple accuracy as our performance measure.

In [6]:
with tf.name_scope('eval'):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

As always, create a node to initialize all variables and create a Saver to save.

In [8]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

First, load MNIST!

In [9]:
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("/tmp/data/")

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use urllib or similar directly.
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


Now we define the number of epochs and the size of the mini-batches:

In [10]:
n_epochs = 400
batch_size = 50

And now we train the model!

In [14]:
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images,
                                            y: mnist.test.labels})
        
        print(epoch,  'Train accuracy:', acc_train, 'Test accuracy:', acc_test)
        
    save_path = saver.save(sess, "./my_model_final.ckpt")

0 Train accuracy: 0.92 Test accuracy: 0.9055
1 Train accuracy: 0.9 Test accuracy: 0.9223
2 Train accuracy: 1.0 Test accuracy: 0.9307
3 Train accuracy: 0.9 Test accuracy: 0.9366
4 Train accuracy: 0.98 Test accuracy: 0.9424
5 Train accuracy: 0.96 Test accuracy: 0.949
6 Train accuracy: 0.98 Test accuracy: 0.9505
7 Train accuracy: 0.9 Test accuracy: 0.9542
8 Train accuracy: 0.96 Test accuracy: 0.9559
9 Train accuracy: 0.98 Test accuracy: 0.959
10 Train accuracy: 0.92 Test accuracy: 0.9604
11 Train accuracy: 0.96 Test accuracy: 0.9613
12 Train accuracy: 1.0 Test accuracy: 0.9627
13 Train accuracy: 1.0 Test accuracy: 0.9645
14 Train accuracy: 0.96 Test accuracy: 0.9656
15 Train accuracy: 0.96 Test accuracy: 0.9674
16 Train accuracy: 0.92 Test accuracy: 0.9669
17 Train accuracy: 0.94 Test accuracy: 0.9684
18 Train accuracy: 1.0 Test accuracy: 0.9689
19 Train accuracy: 0.96 Test accuracy: 0.9693
20 Train accuracy: 1.0 Test accuracy: 0.9699
21 Train accuracy: 0.98 Test accuracy: 0.9706
22 Train

181 Train accuracy: 1.0 Test accuracy: 0.9792
182 Train accuracy: 1.0 Test accuracy: 0.9795
183 Train accuracy: 1.0 Test accuracy: 0.9794
184 Train accuracy: 1.0 Test accuracy: 0.979
185 Train accuracy: 1.0 Test accuracy: 0.9792
186 Train accuracy: 1.0 Test accuracy: 0.9797
187 Train accuracy: 1.0 Test accuracy: 0.9797
188 Train accuracy: 1.0 Test accuracy: 0.979
189 Train accuracy: 1.0 Test accuracy: 0.979
190 Train accuracy: 1.0 Test accuracy: 0.9797
191 Train accuracy: 1.0 Test accuracy: 0.9797
192 Train accuracy: 1.0 Test accuracy: 0.9794
193 Train accuracy: 1.0 Test accuracy: 0.9794
194 Train accuracy: 1.0 Test accuracy: 0.9796
195 Train accuracy: 1.0 Test accuracy: 0.9797
196 Train accuracy: 1.0 Test accuracy: 0.9791
197 Train accuracy: 1.0 Test accuracy: 0.98
198 Train accuracy: 1.0 Test accuracy: 0.9792
199 Train accuracy: 1.0 Test accuracy: 0.9798
200 Train accuracy: 1.0 Test accuracy: 0.9794
201 Train accuracy: 1.0 Test accuracy: 0.9797
202 Train accuracy: 1.0 Test accuracy: 

360 Train accuracy: 1.0 Test accuracy: 0.9793
361 Train accuracy: 1.0 Test accuracy: 0.9789
362 Train accuracy: 1.0 Test accuracy: 0.9791
363 Train accuracy: 1.0 Test accuracy: 0.9793
364 Train accuracy: 1.0 Test accuracy: 0.9792
365 Train accuracy: 1.0 Test accuracy: 0.979
366 Train accuracy: 1.0 Test accuracy: 0.9791
367 Train accuracy: 1.0 Test accuracy: 0.979
368 Train accuracy: 1.0 Test accuracy: 0.979
369 Train accuracy: 1.0 Test accuracy: 0.9791
370 Train accuracy: 1.0 Test accuracy: 0.9793
371 Train accuracy: 1.0 Test accuracy: 0.9792
372 Train accuracy: 1.0 Test accuracy: 0.9792
373 Train accuracy: 1.0 Test accuracy: 0.9792
374 Train accuracy: 1.0 Test accuracy: 0.9794
375 Train accuracy: 1.0 Test accuracy: 0.979
376 Train accuracy: 1.0 Test accuracy: 0.9791
377 Train accuracy: 1.0 Test accuracy: 0.9793
378 Train accuracy: 1.0 Test accuracy: 0.9794
379 Train accuracy: 1.0 Test accuracy: 0.9791
380 Train accuracy: 1.0 Test accuracy: 0.9793
381 Train accuracy: 1.0 Test accuracy:

Once the network is trained, you can use it to make predictions. To do that, you reuse the same construction phase, but change the execution phase to the following:

In [None]:
with tf.Session() as sess:
    saver.restore(sess, "./my_model_final.ckpt")
    X_new_scaled = [] # Some new images
    Z = logits.eval(feed_dict={X: X_new_scaled})
    y_pred = np.argmax(Z, axis=1)

## Fine-Tuning Neural Network Hyperparameters

The flexebility of neural networks is one of their drawbacks, there are many hyperparameters to tweak. Not only can you change the _network topology_, but you can change the amount of layers, neurons per layer, the activation function, the weight initializations, and more.

Of course, you can do the classic grid search, with CV to find the parameters, but givent the rather large datasets, you're better off searching a small area with a randomized search. Another option is to use Oscar.

It helps to have an idea of what values are reasonable for each hyperparameter, so let's go ahead and explore a few.

## Number of Hidden Layers

For many problems, you can start with just one hidden layer, and get decent results. For a long time, researchers thought there was no need to investigate anyer deeper neural networks. They overlooked the fact that deep networks have a much higher _parameter efficiency_ than shallow ones. They can model complex functions using exponentially fewer neurons, making them faster to train.

Not only does this help DNNs converge faster to a good solution, it also helps their ability to genealize to new datasets. For example, you train a model to recognize faces, and now you want to train a new network to recognize hairstyles. You can kickstart the training by reusing the lower layers of the first network. You can also reuse the weights and biases of the lower layer.

In short, start with one or two hidden layers. For more complex problems, ramp up the number of hidden layers until you start overfitting the training set. Very complex tasks require dozens of layers!

## Number of Neurons per Hidden Layer

Obviously the number of neurons in the input and output is determined by what your task requires.For the hidden layers, a common practice is to form a funnel, with fewer and fewer neurons each layer. This practice is not that common now, and you may simply use the same size for all layers: it's just one hyperparameter to tune compared to one for each layer. Just like the number of layers, increase it until you begin to over fit!

A simpler approach is to pick a model with more layers and neurons than needed and use early stopping to prevent it from overfitting or other techniques such as _dropout_.

## Activation Functions

In most cases ReLU is used in the hidden layers. It's a bit faster to comput, and Gradient Descent makes it not get stuck on plateaus.

For the output layer, use softmax when the output is mutually exclusive. Regression can simply use no activation at all.

