# Improving the Way Neural Networks Learn

This chapter discusses an improved cost function, the cross-entropy cost function, regularization methods, initializing the weights better, and choosing better hyper-parameters. 

## Cross-Entropy Cost Function

When we make large mistakes, we learn very quickly; however, when our errors are less well-defined we learn more slowly.
This is not the case with neural networks, when they are grossly wrong they learn slower than if they are only marignally wrong.

### The Equation and Its Properties

The cross-entropy cost function is: $ C = -\frac{1}{n} \sum_x \left[ y \ln a + (1 - y)\ln(1 - a) \right] $, where $ n $ is the total number of training data, and the sum $ x $ is over all inputs, and $ y $ is the desired output.
This is a suitable cost function because:

1. It is non-negative, $ C \gt 0 $.
2. If the neuron's output is close to the desired output for all inputs, $ x $, then this function will be close to zero.

These properties, expecially #2 contribute to the cross-entropy function being less susceptible to learning slowly when compared to the quadratic cost function.

The partial derivative of the cross-entropy function with respect to the weight is: $ \frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x \frac{\sigma'(z)x_j}{\sigma(z)(1-\sigma(z))} (\sigma(z) - y) $, which when simplified is: $ \frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j (\sigma(z) - y) $ because $ \sigma'(z) = \sigma(z)(1 - \sigma(z)) $.
It follows that $ \frac{\partial C}{\partial b} = \frac{1}{n} \sum_x (\sigma(z) - y) $.

The equation for a multi-neuron multi-layer neural network is as follows: $ C = -\frac{1}{n} \sum_x \sum_j \left[y_j ln a^L_j + (1 - y_j)ln(1 - a^L_j) \right] $.

In [1]:
import json, random, sys
import numpy as np

In [2]:
class QuadraticCost(object):
    @staticmethod
    def fn(a, y):
        return 0.5 * np.linalg.norm(a - y) ** 2
    
    @staticmethod
    def delta(z, a, y):
        return (a - y) * sigmoid_prime(z)
    
class CrossEntropyCost(object):
    @staticmethod
    def fn(a, y):
        return np.sum(np.nan_to_num(-y * np.log(a) - (1 - y) * np.log( 1- a)))
    
    @staticmethod
    def delta(z, a, y):
        return (a - y)

In [3]:
#### Loading a Network
def load(filename):
    """Load a neural network from the file ``filename``.  Returns an
    instance of Network.
    """
    f = open(filename, "r")
    data = json.load(f)
    f.close()
    cost = getattr(sys.modules[__name__], data["cost"])
    net = Network(data["sizes"], cost=cost)
    net.weights = [np.array(w) for w in data["weights"]]
    net.biases = [np.array(b) for b in data["biases"]]
    return net

#### Miscellaneous functions
def vectorized_result(j):
    """Return a 10-dimensional unit vector with a 1.0 in the j'th position
    and zeroes elsewhere.  This is used to convert a digit (0...9)
    into a corresponding desired output from the neural network.
    """
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

In [7]:
sys.path.insert(0, './code')
import mnist_loader
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
import network2

In [9]:
net = network2.Network([784, 30, 10], cost=CrossEntropyCost)
net.large_weight_initializer()
net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data, 
       monitor_evaluation_accuracy=True)

Epoch 0 training complete
Accuracy on evaluation data: 9113 / 10000

Epoch 1 training complete
Accuracy on evaluation data: 9235 / 10000

Epoch 2 training complete
Accuracy on evaluation data: 9291 / 10000

Epoch 3 training complete
Accuracy on evaluation data: 9345 / 10000

Epoch 4 training complete
Accuracy on evaluation data: 9373 / 10000

Epoch 5 training complete
Accuracy on evaluation data: 9380 / 10000

Epoch 6 training complete
Accuracy on evaluation data: 9396 / 10000

Epoch 7 training complete
Accuracy on evaluation data: 9387 / 10000

Epoch 8 training complete
Accuracy on evaluation data: 9406 / 10000

Epoch 9 training complete
Accuracy on evaluation data: 9424 / 10000

Epoch 10 training complete
Accuracy on evaluation data: 9453 / 10000

Epoch 11 training complete
Accuracy on evaluation data: 9452 / 10000

Epoch 12 training complete
Accuracy on evaluation data: 9467 / 10000

Epoch 13 training complete
Accuracy on evaluation data: 9419 / 10000

Epoch 14 training complete
Acc

([],
 [9113,
  9235,
  9291,
  9345,
  9373,
  9380,
  9396,
  9387,
  9406,
  9424,
  9453,
  9452,
  9467,
  9419,
  9473,
  9449,
  9478,
  9473,
  9472,
  9483,
  9467,
  9441,
  9430,
  9465,
  9481,
  9487,
  9468,
  9500,
  9480,
  9486],
 [],
 [])

## Softmax

Softmax is a method that acheives the same goal as cross-entropy, to minimize learning slowdown.
Softmax creates a new output layer for the network.
Similar to a sigmoid layer it forms the weighted inputs $ z^L_j = \sum_k w^L_{jk} a^{L-1}_k + b^L_j $, next the *softmax function* is applied to $ z^L_j $.
The activation $ a^L_j $ of the $ j^{th} $ output is $ a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} $.

Note that $ \sum_j a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} = 1 $, which means that when one activation increases, the others will decrease and vice versa.
The output of this function can be thought of as a probability distribution.

### Log Likelihood Cost Function
Let the log-likelihood cost function be $ C \equiv - \ln a^L_y $ where $ C $ is the function, $ a $ is the activation, $ y $ is the desired output, and $ L $ is the layer. 
This means that when the network is confident that the input is close to $ y $ the output of the function is close to $ 1 $.
When the network is not doing a good job, the cost $ - \ln a^L_y $ will be larger.

In conclusion, softmax and log-likelihood cost is used when the output activations are to be used as probabilities.

## Overfitting and Regularization

The more parameters a model has, the more susceptible the model is to overfitting.

### How to Recognize When a Model is Overfitted

There are multiple metrics that can help one realize when a model is overfitted.
The first step is to investigate the accuracy on the test data, if this isn't good, it may mean that the model is overfitted.
The learning rate is suspicious if it plateaus prematurely (the model stops learning with only a small number of epochs).
I don't think that this is a definitive metric for overfitting, but even if the model isn't overfitted, something is wrong because the model isn't acomplishing it's end goal.

Another way that overfitting can be found is by looking at the cost on the test data.
Recall that the purpose of the neural network is to minimize the cost.
If you see that the cost drops, and then increases as the epochs progress, then the model is being overfitted.
This can be confirmed if the cost on the training data decreases as the cost on the test data increases.

Yet another sign of overfitting is in looking into the accuracy of the training data.
If the accuracy approaches 100% too quickly (or even at all), then the model is probably overfitted.

### Methods to Reduce Overfitting

One way to detect overfitting is to keep track of how the accuracy on the test data, and if the accuracy of the test data stops increasing, then stop training.
We use the `validation_data` set to measure overfitting.
The `validation_data` is not found in the `test_data` or the `training_data`.
When the classification accuracy on the `validation_data` has saturated, we stop training the model.
This process is called *early stopping*. 

### `validation_data`

`validation_data` is used to set the hyper-parameters.
Why don't we use the `test_data` to set the hyper-parameters?
This is to make sure that the model doesn't overfit the `test_data`.
After a good set of hyper-parameters are discovered, then the `test_data` is evaluated.
The approach of using the `validation_data` as a type of training data that helps to learn good hyper-parameters is called the *hold out* method because the `validation_data` is held out from the `training_data`.

One of the best ways of reducing overfitting is to increase the size of `training_data`.

In [7]:
import sys
sys.path.insert(0, './code')
import mnist_loader, network2

# train the model with only the first 1,000 training images

training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
net = network2.Network([784, 30, 10], cost = network2.CrossEntropyCost)
net.large_weight_initializer()
net.SGD(
    training_data[:1000], 
    400, 
    10, 
    0.5, 
    evaluation_data = test_data,
    monitor_evaluation_accuracy = True,
    monitor_training_cost = True
)

Epoch 0 training complete
Cost on training data: 1.8937980110453954
Accuracy on evaluation data: 5278 / 10000

Epoch 1 training complete
Cost on training data: 1.3844676075671265
Accuracy on evaluation data: 6636 / 10000

Epoch 2 training complete
Cost on training data: 1.1532897293201265
Accuracy on evaluation data: 7114 / 10000

Epoch 3 training complete
Cost on training data: 0.9592227754186763
Accuracy on evaluation data: 7387 / 10000

Epoch 4 training complete
Cost on training data: 0.8759523682530747
Accuracy on evaluation data: 7423 / 10000

Epoch 5 training complete
Cost on training data: 0.7387848786497457
Accuracy on evaluation data: 7635 / 10000

Epoch 6 training complete
Cost on training data: 0.6459295164151101
Accuracy on evaluation data: 7865 / 10000

Epoch 7 training complete
Cost on training data: 0.5986429580352461
Accuracy on evaluation data: 7871 / 10000

Epoch 8 training complete
Cost on training data: 0.5378963522395405
Accuracy on evaluation data: 7967 / 10000

E

([],
 [5278,
  6636,
  7114,
  7387,
  7423,
  7635,
  7865,
  7871,
  7967,
  7971,
  7993,
  8053,
  8036,
  8090,
  8092,
  8034,
  8094,
  8108,
  8131,
  8141,
  8139,
  8118,
  8118,
  8150,
  8149,
  8152,
  8155,
  8159,
  8164,
  8149,
  8182,
  8164,
  8181,
  8179,
  8175,
  8183,
  8177,
  8183,
  8215,
  8189,
  8194,
  8211,
  8213,
  8229,
  8222,
  8221,
  8207,
  8203,
  8234,
  8236,
  8240,
  8231,
  8226,
  8230,
  8223,
  8228,
  8230,
  8234,
  8215,
  8227,
  8230,
  8219,
  8225,
  8238,
  8228,
  8225,
  8235,
  8232,
  8227,
  8239,
  8231,
  8231,
  8230,
  8245,
  8240,
  8246,
  8232,
  8249,
  8236,
  8243,
  8239,
  8256,
  8247,
  8255,
  8252,
  8255,
  8252,
  8253,
  8254,
  8253,
  8252,
  8259,
  8250,
  8260,
  8254,
  8258,
  8256,
  8251,
  8255,
  8255,
  8253,
  8247,
  8257,
  8252,
  8253,
  8259,
  8258,
  8250,
  8253,
  8256,
  8258,
  8254,
  8252,
  8252,
  8252,
  8253,
  8246,
  8251,
  8250,
  8253,
  8251,
  8260,
  8253,
  8254,
  8

## Regularization

Regularization is a technique used to reduce overfitting when increasing the `training_data` isn't an option and reducing the size of the network isn't an option either (which is another option to reduce overfitting).
One regularization technique is called *weight decay* or *L2 regularization*.

### *Weight Decay* (*L2 Regularization*)

This technique involves adding an extra term to the cost function, which is called the *regularization term*.
The regularized cross-entropy function is: $ C = -\frac{1}{n} \sum_{xj} \left[y_j \ln a^L_j + (1 - y_j) \ln(1 - a^L_j) \right] + \frac{\lambda}{2n} \sum_w w^2 $.
The *regularization term* is $ \frac{\lambda}{2n} \sum_w w^2 $, where $ \lambda \gt 0 $ is the *regularization parameter* and $ n $ is the size of the training set.
To generalize how to *regularize* a cost function, let $ C = C_0 + \frac{\lambda}{2n} \sum_w w^2 $ where $ C $ is the *regularized* cost function, and $ C_0 $ is the original *unregularized* cost function.

One can look at regularization as finding a compromise between finding small weights and minimizing the original cost function; the importance of these two factors depends on the value of $ \lambda $.
When $ \lambda $ is small, minimizing the original cost function is preferred; and when $ \lambda $ is large the small weights are preferred.

#### *Regularized* Gradient Descent

When using backpropagation, the only difference with a *regularized* cost function and an *unregularized* cost function in the partial derivatives is that the *regularlized* partial derivatives are: $ \frac{\partial C}{\partial w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} w $ and $ \frac{\partial C}{\partial b} = \frac{C_0}{\partial b} $.
The learning rule for the biases is the same for *regularized* and *unregularized* cost function, but it changes for the weights: $ w \rightarrow w - \eta \frac{\partial C_0}{\partial w} - \frac{\eta \lambda}{n} w = \left(1 - \frac{\eta \lambda}{n} \right) w - \eta \frac{\partial C_0}{\partial w} $.
The only difference is that the weight $ w $ is rescaled by a factor $ \left( 1 - \frac{\eta \lambda}{n} \right) $.
The name *weight decay* comes from this property, because the factor makes the weight smaller.

#### *Regularized* Stochastic Gradient Descent

The equation for the *regularized* stochastic gradient descent is: $ w \rightarrow \left(1 - \frac{\eta \lambda}{n} \right) w - \frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial w} $, where the sum is over the training examples $ x $ in the mini-batch and $ C_x $ is the *unregularized* cost for each training example.

In [10]:
# train the model with regularization and small traing_data
import mnist_loader, network2

training_data, validation_data, test_data = mnist_loader.load_data_wrapper()

net = network2.Network(
    [784, 30, 10],
    cost = network2.CrossEntropyCost
)
net.large_weight_initializer()
net.SGD(
    training_data[:1000],
    400,
    10,
    0.5,
    evaluation_data = test_data,
    lmbda = 0.1,
    monitor_evaluation_cost = True,
    monitor_evaluation_accuracy = True,
    monitor_training_cost = True,
    monitor_training_accuracy = True
)

Epoch 0 training complete
Cost on training data: 3.1093118559161024
Accuracy on training data: 629 / 1000
Cost on evaluation data: 2.4374085774393746
Accuracy on evaluation data: 5290 / 10000

Epoch 1 training complete
Cost on training data: 2.4756915124491066
Accuracy on training data: 786 / 1000
Cost on evaluation data: 1.8765149887329786
Accuracy on evaluation data: 6690 / 10000

Epoch 2 training complete
Cost on training data: 2.263067702333121
Accuracy on training data: 833 / 1000
Cost on evaluation data: 1.7794002969003386
Accuracy on evaluation data: 6905 / 10000

Epoch 3 training complete
Cost on training data: 2.0616368508614893
Accuracy on training data: 872 / 1000
Cost on evaluation data: 1.6342959983529344
Accuracy on evaluation data: 7336 / 10000

Epoch 4 training complete
Cost on training data: 1.8992574421272295
Accuracy on training data: 901 / 1000
Cost on evaluation data: 1.5198826263664664
Accuracy on evaluation data: 7597 / 10000

Epoch 5 training complete
Cost on tr

([2.4374085774393746,
  1.8765149887329786,
  1.7794002969003386,
  1.6342959983529344,
  1.5198826263664664,
  1.4600010151048566,
  1.4231405794377416,
  1.3957942251437045,
  1.3950703560897051,
  1.3536018442452293,
  1.3536994533152651,
  1.3602396250591193,
  1.3504967367686207,
  1.341945692237333,
  1.310829694956184,
  1.3211836742294636,
  1.3327887177519562,
  1.302170611677719,
  1.3072866391718785,
  1.2999129180713995,
  1.2953852603797671,
  1.2964146300969608,
  1.3251401573192871,
  1.2972418492112934,
  1.305965444272756,
  1.3062330982641408,
  1.3266184227745921,
  1.3119714562962712,
  1.3114572966685203,
  1.3108039840214631,
  1.3110856345223105,
  1.2986315739941245,
  1.3015682774135116,
  1.3118830810975821,
  1.3167083374215169,
  1.3084092524000437,
  1.3042997865592694,
  1.2970537130522122,
  1.3078298636695138,
  1.2999754871564573,
  1.2942152145694825,
  1.309535797380428,
  1.3005896950798275,
  1.3107243311992205,
  1.299577304821208,
  1.286881901722

In [12]:
# train the model with regularization and full training_data
import mnist_loader, network2

training_data, validation_data, test_data = mnist_loader.load_data_wrapper()

net = network2.Network(
    [784, 30, 10],
    cost = network2.CrossEntropyCost
)
net.large_weight_initializer()
net.SGD(
    training_data,
    30,
    10,
    0.5,
    evaluation_data = test_data,
    lmbda = 5.0,
    monitor_evaluation_accuracy = True,
    monitor_training_accuracy = True
)

Epoch 0 training complete
Accuracy on training data: 45571 / 50000
Accuracy on evaluation data: 9173 / 10000

Epoch 1 training complete
Accuracy on training data: 46683 / 50000
Accuracy on evaluation data: 9303 / 10000

Epoch 2 training complete
Accuracy on training data: 47243 / 50000
Accuracy on evaluation data: 9409 / 10000

Epoch 3 training complete
Accuracy on training data: 47603 / 50000
Accuracy on evaluation data: 9471 / 10000

Epoch 4 training complete
Accuracy on training data: 47706 / 50000
Accuracy on evaluation data: 9495 / 10000

Epoch 5 training complete
Accuracy on training data: 47738 / 50000
Accuracy on evaluation data: 9521 / 10000

Epoch 6 training complete
Accuracy on training data: 47975 / 50000
Accuracy on evaluation data: 9586 / 10000

Epoch 7 training complete
Accuracy on training data: 47994 / 50000
Accuracy on evaluation data: 9573 / 10000

Epoch 8 training complete
Accuracy on training data: 48129 / 50000
Accuracy on evaluation data: 9589 / 10000

Epoch 9 tr

([],
 [9173,
  9303,
  9409,
  9471,
  9495,
  9521,
  9586,
  9573,
  9589,
  9535,
  9555,
  9600,
  9610,
  9617,
  9565,
  9534,
  9567,
  9586,
  9584,
  9578,
  9511,
  9623,
  9589,
  9596,
  9593,
  9593,
  9581,
  9575,
  9503,
  9600],
 [],
 [45571,
  46683,
  47243,
  47603,
  47706,
  47738,
  47975,
  47994,
  48129,
  47945,
  48063,
  48308,
  48294,
  48322,
  48139,
  48030,
  48019,
  48231,
  48224,
  48220,
  47795,
  48458,
  48413,
  48239,
  48335,
  48419,
  48322,
  48328,
  47905,
  48392])

## Why Does Regularization Help Reduce Overfitting?

A common story that explains why regularization works is:

> The smaller weights are, the lower their complexity, which means they
provide a simpler and more powerful explanation for the data, therefore
these smaller weights should be preferred.

There are two ways that one can design a model: 

  1. Make a model that fits your data points exactly, usually with a high-order polynomial.
  2. Make a simple model that doesn't fit your data points exactly, but will hopefully scale better than your high-order polynomial model will.
  
Some say that in science we should go with the simplest explanation unless compelled not to.
This is because it is probably not a coincidence that a simple model can explain complex data, therefore the simple model must "be expressing some underlying turth about the phenomenon."
When we use a simplified model, it is not prone to change with small variations here and there; it generalizes better.
"An unregularized network can use large weights to learn a complex model that carries a lot of information about the noise in the trianing data."

People refer to the idea of preferring simipler explanations as "Occam's Razor."
This does not mean that the simpler model will always be true, sometimes the more complex model is better at proedicting than the simpler model.
The ultimate test of a model is not whether a model is simple or not, but rather it is how well the model predicts new phenomena.

It is a fact that regularized neural networks generalize better than unregularized networks.
However, it is to be noted that no one completely knows why regularization works, it really is a magin wand to help the neural network generalize. 

## Other Techniques for Regularization

### L1 Regularization

This technique will modify the unregularized cost function by adding the sum of the absolute values of the weights, which is represented by: $ C = C_0 + \frac{\lambda}{n} \sum_w |w| $, where $ C_0 $ is the original cost function, $ w $ is the weights, $ n $ is the number of samples, and $ \lambda $ is the *regularization parameter*.

Just like L2 regularization, the goal of L1 regularization is to shrink the large weights.
They differ in how the weights are shrunk.
L1 regularization shrinks the weights by a constant amount toward 0, but in L2 regularization the weights shrink by an amount that is proportional to w.
This means that when a weight has a large magnitude, $ |w| $, L1 regularization shrinks the weight less than L2 regularization does.

To conclude, L1 regularization "concentrates the weights of the network in a relatively small number of high-importance connections, while the other weights are driven toward zero."

### Dropout

Dropout does not rely on modifying the cost function, it modifies the network itself.
It will randomly (and temporarily) delete half of the hidden neurons in the network, while leaving the input and output neurons as they are.
After the input is through the modified network, and the results are backpropagated, the neurons which were dropped out are restored and new neurons are subsequently dropped out.
Once training is complete, the weights outgoing from the hidden neurons are halved because there are twice as many hidden neurons than there were during training.

Imagine that dropout is training multiple different neural networks, and then once training is finished it averages all of the neural networks.

### Artificially Increasing the Training Set Size

Intuitively it makes sense that the neural network should perform better when there is more data available.
In the MNIST training data, one can artificially augment the number of training examples by modifying the images slightly.

## Weight Initialization

An *ad hoc* way of initializing weights and biases is to randomly assign them based on an independent Gaussian distribution with mean $ 0 $ and standard deviation $ 1 $.
This can be a problem because with this distribution the output of the sigmoid function can be either $ z \gg 1 $ or $ z \ll -1 $ (mush greater than or much less than), which means that some neurons will be saturated and learning will be slowed down.

A solution to this issue is to "initialize the weights as Gaussian random variables with mean $ 0 $ and standard deviation $ 1 / \sqrt{ n_{in} } $, which makes the neurons less likely to saturate."
Even though this is used for the weights, the biases are initialized in the previous manner, as Gaussian random variables with mean $ 0 $ and standard deviations $ 1 $.

In most cases changing the initial weights will only speed up learning, but in other cases it will also improve the final perform of the network.

In [14]:
import json, random, sys

import numpy as np

# the network with all of the improvements found in this chapter

class CrossEntropyCost(object):
    @staticmethod
    def fn(a, y):
        return np.sum(np.nan_to_num(-y * np.log(a) - (1 - y) * np.log(1 - a)))
    
    @staticmethod
    def delta(z, a, y):
        return (a - y)
    
class QuadraticCost(object):
    @staticmethod
    def fn(a, y):
        return 0.5 * np.linalg.norm(a - y) ** 2
    
    @staticmethod
    def delta(z, a, y):
        return (a - y) * sigmoid_prime(z)

class Network(object):
    def __init__(self, sizes, cost = CrossEntropyCost):
        self.num_layer = len(sizes)
        self.sizes = sizes
        self.default_weight_initializer()
        self.cost = cost
        
    def default_weight_initializer(self):
        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
        self.weights = [np.random.rand(y, x) / np.sqrt(x)
                       for x, y in zip(self.sizes[:-1], self.sizes[1:])]
        
    def large_weight_initializer(self):
        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
        self.weights = [np.random.rand(y, x)
                       for x, y in zip(self.sizes[:-1], self.sizes[1:])]
        
    def feedforward(self, a):
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a) + b)
        return a
    
    def SGD(self, training_data, epochs, mini_batch_size, eta, lmbda = 0.0,
           evaluation_data = None,
           monitor_evaluation_cost = False,
           monitor_evaluation_accuracy = False,
           monitor_training_cost = False,
           monitor_training_accuracy = False
        ):
        if evaluation_data: n_data = len(evaluation_data)
        n = len(training_data)
        evaluation_cost, evaluation_accuracy = [], []
        training_cost, training_accuracy = [], []
        for j in range(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k + mini_batch_size] 
                for k in range(0, n, mini_batch_size)
            ]
            for mini_batch in mini_batches:
                self.update_mini_batch(
                    mini_batch, eta, lmbda, len(training_data)
                )
            print('Epoch {} training complete').format(j)
            if monitor_training_cost:
                cost = self.total_cost(training_data, lmbda)
                training_cost.append(cost)
                print('Cost on training data: {}').format(cost)
            if monitor_training_accuracy:
                accuracy = self.accuracy(training_data, convert = True)
                training_accuracy.append(accuracy)
                print('Accuracy on training data: {} / {}').format(accuracy, n)
            if monitor_evaluation_cost:
                cost = self.total_cost(evaluation_data, lmbda, convert = True)
                evaluation_cost.append(cost)
                print('Cost on evaluation data: {}').format(cost)
            if monitor_evaluation_accuracy:
                accuracy = self.accuracy(evaluation_data)
                evaluation_accuracy.append(accuracy)
                print('Accuracy on evaluation data: {} / {}').format(
                    self.accuracy(evaluation_data), n_data
                )
            print()
        return evaluation_cost, evaluation_accuracy, \
            training_cost, training_accuracy
    
    def update_mini_batch(self, mini_batch, eta, lmbda, n):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        delta = (self.cost).delta(zs[-1], activations[-1], y)
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())

        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def accuracy(self, data, convert=False):
        if convert:
            results = [(np.argmax(self.feedforward(x)), np.argmax(y))
                       for (x, y) in data]
        else:
            results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in data]
        return sum(int(x == y) for (x, y) in results)

    def total_cost(self, data, lmbda, convert=False):
        cost = 0.0
        for x, y in data:
            a = self.feedforward(x)
            if convert: y = vectorized_result(y)
            cost += self.cost.fn(a, y)/len(data)
        cost += 0.5*(lmbda/len(data))*sum(
            np.linalg.norm(w)**2 for w in self.weights)
        return cost

    def save(self, filename):
        data = {"sizes": self.sizes,
                "weights": [w.tolist() for w in self.weights],
                "biases": [b.tolist() for b in self.biases],
                "cost": str(self.cost.__name__)}
        f = open(filename, "w")
        json.dump(data, f)
        f.close()

def load(filename):
    f = open(filename, "r")
    data = json.load(f)
    f.close()
    cost = getattr(sys.modules[__name__], data["cost"])
    net = Network(data["sizes"], cost=cost)
    net.weights = [np.array(w) for w in data["weights"]]
    net.biases = [np.array(b) for b in data["biases"]]
    return net

def vectorized_result(j):
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    return sigmoid(z)*(1-sigmoid(z))

## How to Choose a Network's Hyper-Parameters

### Broad Strategy

One way to help choose hyper-parameters is to scale down the volume of training data.
This will not reduce the time that it takes the network to learn, but it will also help you realize which parameters need to be tuned, and what sort of effect they will have.
Using the example of the MNIST dataset, you can also reduce the data to classify only `0`'s and `1`'s.
One should also start with a small number (or none) of hidden layers, and then increase them over time.
The more hidden layers a network has, the more complex it is and the longer it will take to learn.

### Learning Rate

When analyzing the learning rate, we are mainly concerned with the cost as the epochs progress.
We can visulaize this by plotting the various learning rates with cost on the y-axis, and epochs on the x-axis.
Recall that the goal of the network is to minimize the cost, thus the optimal learning rate will show the cost decreasing as the epochs progress.
If the learning rate is too high, the cost will jump around, and never really decrease because it will overshoot the minimum.
If the learning rate is too low, stochastic gradient descent will be slow.

#### Process

  1. Find a learning rate in which the cost immediatly begins decreasing (as opposed to oscillating or increasing).
  2. Increase the learning rate by order of magnitude e.g. if you started with $ \eta = 0.01 $ then you should try $ \eta = 0.1, 1.0, ... $. 
  Increase the learning rate until the cost oscillates or increases within the first few epochs.
  3. To optimize the learning rate, you can find the largest value of $ \eta $ where the cost decreases during the first few epochs, this gives you a threshold value for $ \eta $.
  
**Note:** Pick a $ \eta $ that is no larger than the threshold value.
The smaller the $ \eta $ the more usable it will be over many epochs (there will not be much of a learning slowdown as the epochs progress).

This is the only hyper-parameter that we will use the performance of the network on test data because $ \eta $ is the only hyper-parameter that is primarily concerned with the test data, it has no implicit relation to the final classification accuracy (to some extent).
The "primary prupose [of the learning rate] is really to control the step size in gradient descent, and monitoring the training cost is the best way to detect if the step size is too big."

### Number of Epochs

We can use the process of early stopping to find the optimal number of epochs to train.
Early stopping is taking the classification accuracy on the validation data after each epoch, and when the accuracy stops imrproving stop training.
To be more precise, a network *stops improving* when it hasn't improved in the last ten epochs because we don't want to stop prematurely.

This approach is good for preliminary exploration of a network, but it may be too aggressive for some networks.

### Learning Rate Schedule

This allows us to vary the learning rate, $ \eta $ as the epochs progress.
When a network begins to learn, the weights are usually badly wrong, but then as the network improves the weights start to get better.
To optimize the learning rate for this kind of behavior, the learning rate should initially be large then it should decrease as the network improves.

One approach is to model the early stopping method and when the validation accuracy starts to decrease then decrease $ \eta $ by some amount (maybe a factor of two or ten).
This schedule allows many different possibilities, which can be difficult to choose from.
Initially, use a single constant value for the learning rate, then experiment with a learning schedule.

### Regularization Parameter

To begin, set the regularization parameter $ \lambda = 0 $ so that you can find an appropriate $ \eta $.
Once an adequate $ \eta $ is found, start with $ \lambda = 1.0 $ and then increase or decrease by factors of $ 10 $.
After this, re-optimze $ \eta $ again.

### Mini-batch Size

*Online learning* is when you have a mini-batch size of $ 1 $.
Naively it may seem that *online learning* is optimal, but depending on your linear algebra library and hardware, it actually could be faster to compute a mini-batch of size $ 100 $ than of size $ 1 $ because it is possible to compute the gradient update for all example in a mini-batch simultaneously.

The issue with choosing a mini-batch size is that if it is too small you don't take advantage of the benefits of good matrix libraries, and if it is too large you are not updating your weights often enough.
The good news is that the other hyper-parameters don't heavily influence the effect of the mini-batch size (the other parameters don't need to be optimized in order to find a good mini-batch size).

As you scale $ \eta $, plot the validation accuracy versus *time* (elapsed, not epoch!) for multiple mini-batch sizes. 
Choose the mini-batch size that gives you the most rapid improvement in performance.

### Automated Techniques

Hand tuning hyper-parameters is a good exercise in learning the idiosyncracies of neural networks and their behavior.
However, this can be very time consuming.
One approach to automated hyper-parameter tuning is a *grid search*, which will systematically search through a grid in hyper-parameter space.
Another method is [hyperopt](https://github.com/jaberg/hyperopt), which is a Bayesian approach to automatically optimize hyper-parameters.