# Improving the Way Neural Networks Learn

This chapter discusses an improved cost function, the cross-entropy cost function, regularization methods, initializing the weights better, and choosing better hyper-parameters. 

## Cross-Entropy Cost Function

When we make large mistakes, we learn very quickly; however, when our errors are less well-defined we learn more slowly.
This is not the case with neural networks, when they are grossly wrong they learn slower than if they are only marignally wrong.

### The Equation and Its Properties

The cross-entropy cost function is: $ C = -\frac{1}{n} \sum_x \left[ y \ln a + (1 - y)\ln(1 - a) \right] $, where $ n $ is the total number of training data, and the sum $ x $ is over all inputs, and $ y $ is the desired output.
This is a suitable cost function because:

1. It is non-negative, $ C \gt 0 $.
2. If the neuron's output is close to the desired output for all inputs, $ x $, then this function will be close to zero.

These properties, expecially #2 contribute to the cross-entropy function being less susceptible to learning slowly when compared to the quadratic cost function.

The partial derivative of the cross-entropy function with respect to the weight is: $ \frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x \frac{\sigma'(z)x_j}{\sigma(z)(1-\sigma(z))} (\sigma(z) - y) $, which when simplified is: $ \frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j (\sigma(z) - y) $ because $ \sigma'(z) = \sigma(z)(1 - \sigma(z)) $.
It follows that $ \frac{\partial C}{\partial b} = \frac{1}{n} \sum_x (\sigma(z) - y) $.