# Tuning Neural Networks 

## Number of Hidden layers

For many problems, you can just begin with a single hidden layer and you will get
reasonable results. It has actually been shown that an MLP with just one hidden layer
can model even the most complex functions provided it has enough neurons.


 suppose you are asked to draw a forest using some drawing soft‐
ware, but you are forbidden to use copy/paste. You would have to draw each tree
individually, branch per branch, leaf per leaf. If you could instead draw one leaf,
copy/paste it to draw a branch, then copy/paste that branch to create a tree, and
finally copy/paste this tree to make a forest, you would be finished in no time. Realworld data is often structured in such a hierarchical way and Deep Neural Networks
automatically take advantage of this fact: lower hidden layers model low-level struc‐
tures (e.g., line segments of various shapes and orientations), intermediate hidden
layers combine these low-level structures to model intermediate-level structures (e.g.,
squares, circles), and the highest hidden layers and the output layer combine these
intermediate structures to model high-level structures (e.g., faces).

Not only does this hierarchical architecture help DNNs converge faster to a good sol‐
ution, it also improves their ability to generalize to new datasets. For example, if you
have already trained a model to recognize faces in pictures, and you now want to
train a new neural network to recognize hairstyles, then you can kickstart training by
reusing the lower layers of the first network. Instead of randomly initializing the
weights and biases of the first few layers of the new neural network, you can initialize
them to the value of the weights and biases of the lower layers of the first network.
This way the network will not have to learn from scratch all the low-level structures
that occur in most pictures; it will only have to learn the higher-level structures (e.g.,
hairstyles). This is called transfer learning.


## Number of Neurons per Hidden Layer

Obviously the number of neurons in the input and output layers is determined by the
type of input and output your task requires. For example, the MNIST task requires 28
x 28 = 784 input neurons and 10 output neurons

Just like for the number of layers, you can try increasing the number of neurons grad‐
ually until the network starts overfitting. In general you will get more bang for the
buck by increasing the number of layers than the number of neurons per layer.
Unfortunately, as you can see, finding the perfect amount of neurons is still somewhat
of a dark art.
A simpler approach is to pick a model with more layers and neurons than you
actually need, then use early stopping to prevent it from overfitting (and other regu‐
larization techniques, such as dropout)

depending on the dataset, it can sometimes help
to make the first hidden layer bigger than the others.


## Learning Rate, Batch Size and Other Hyperparameters

- The learning rate is arguably the most important hyperparameter. In general, the optimal learning rate is about half of the maximum learning rate (i.e., the learning rate above which the training algorithm diverges).So a simple approach for tuning the learning rate is to start with a large value that makes the training algorithm diverge, then divide this value by 3 and try again, and repeat until the training algorithm stops diverging. At that point, you generally won’t be too far from the optimal learning rate.
- Choosing a better optimizer than plain old Mini-batch Gradient Descent (and tuning its hyperparameters) is also quite important.
- The batch size can also have a significant impact on your model’s performance and the training time. In general the optimal batch size will be lower than 32.
- in general, the ReLU activation function will be a good default for all hidden layers. For the output layer, it really depends on your task.
- the number of training iterations does not actually need to be tweaked: just use early stopping instead.