# Params

## Dropout

### The theory behind dropout is that neural networks have so much freedom between their numerous layers that it is entirely possible for one layer to evolve a bad behavior and for the next layer to compensate for it. This is not an ideal use of neurons. With dropout, there is a high probability that the neurons “fixing” the problem will not be there in a given training round. The bad behavior of the offending layer therefore becomes obvious, and weights evolve toward a better behavior. Dropout also helps spread the information flow throughout the network, giving all weights fairly equal amounts of training, which can help keep the model balanced.

## Batch normalization

### batch normalization normalizes neuron outputs across a training batch of data by subtracting the average and dividing by the standard deviation. However, doing just that could be swinging the pendulum too far in one direction—with a perfectly centered and normally wide distribution everywhere, all neurons would have the same behavior. The trick is to introduce two additional learnable parameters per neuron, called scale and center, and to normalize the input data to the neuron using these values:

- Normalized = (input - center)/ scale

### This way, the network decides, through machine learning, how much centering and rescaling to apply at each neuron.

### The problem with batch normalization is that at prediction time you do not have training batches over which you can compute the statistics of your neurons’ outputs, but you still need those values. Therefore, during training, neurons’ output statistics are computed across a “sufficient” number of batches using a running exponential average. These stats are then used at inference time.

### Batch normalization is performed on the output of a layer before the activation function is applied. So, rather than set activation='relu' in the Dense layer’s constructor, we’d omit the activation function there and then add a separate Activation layer.

### If you use center=True in batch norm, you do not need biases in your layer. The batch norm offset plays the role of a bias.

### If you use an activation function that is scale-invariant (i.e., does not change shape if you zoom in on it), then you can set scale=False. ReLu is scale-invariant. Sigmoid is not.