## Regularisation
Regularisation is a set of methods for getting neural networks to <em>generalise</em> rather than <em>memorise</em>. The goal of regularisation is the combat overfitting in neural networks. A neural network that overfits is a neural network that has learned the <em>noise</em> in the dataset rather than the <em>true</em> signal.

In maths, stats and computer science, regularisation generally means to 'add information in order to solve an ill-posed problem'.

In [14]:
import sys
import numpy as np
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

images, labels = (x_train[0 : 1000].reshape(1000, 28 * 28) / 255, y_train[0 : 1000])

one_hot_labels = np.zeros((len(labels), 10))

for i, label in enumerate(labels):
    one_hot_labels[i][label] = 1
labels = one_hot_labels

test_images = x_test.reshape(len(x_test), 28 * 28) / 255
test_labels = np.zeros((len(y_test), 10))
for i, label in enumerate(y_test):
    test_labels[i][label] = 1

    
relu = lambda x: (x > 0) * x
reluDerivative = lambda x: x >= 0

alpha, iterations, hidden_size, pixels_per_image, num_labels = (0.005, 350, 40, 784, 10)

np.random.seed(1)
weights_0_1 = 0.2 * np.random.random((pixels_per_image, hidden_size)) - 0.1
weights_1_2 = 0.2 * np.random.random((hidden_size, num_labels)) - 0.1

for j in range(iterations):
    error, correct_count = (0.0, 0)
    for i in range(len(images)):
        layer_0 = images[i : i + 1]
        layer_1 = relu(np.dot(layer_0, weights_0_1))
        layer_2 = np.dot(layer_1, weights_1_2)
        error += np.sum((labels[i : i + 1] - layer_2) ** 2)
        correct_count += int(np.argmax(layer_2) == np.argmax(labels[i : i + 1]))
        layer_2_delta = labels[i : i + 1] - layer_2
        layer_1_delta = np.dot(layer_2_delta, weights_1_2.T) * reluDerivative(layer_1)
        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
    sys.stdout.write("\r"+ \
                     " I:" + str(j)+ \
                     " Error:" + str(error / float(len(images)))[0:5] +\
                     " Correct:" + str(correct_count / float(len(images))))

 I:349 Error:0.108 Correct:1.099

Now that training is done, we can test that it generalised well enough to handle the unseen dataset: 

In [19]:
if (j % 10 == 0 or j == iterations - 1):
    error, correct_count = (0.0, 0)
    for i in range(len(test_images)):
        layer_0 = test_images[i : i + 1]
        layer_1 = relu(np.dot(layer_0, weights_0_1))
        layer_2 = np.dot(layer_1, weights_1_2)
        error += np.sum((layer_2 - test_labels[i : i + 1]) ** 2)
        correct_count += int(np.argmax(layer_2) == np.argmax(test_labels[i : i + 1]))
    sys.stdout.write(" Test-Err:" + str(error / float(len(test_images)))[0:5] + \
                     " Test-Acc:" + str(correct_count / float(len(test_images))))
    print()

 Test-Err:0.653 Test-Acc:0.7073


<em>Test accuracy</em> &mdash; the accuracy of the predictions made by the neural network on the training dataset.

### Example of Overfitting:
The following snippet trains the neural network on varying number of iterations and logs both the train accuracy and the test accuracy. Note that this takes a while to complete.

In [36]:
import sys
import numpy as np
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

images, labels = (x_train[0 : 1000].reshape(1000, 28 * 28) / 255, y_train[0 : 1000])

one_hot_labels = np.zeros((len(labels), 10))

for i, label in enumerate(labels):
    one_hot_labels[i][label] = 1
labels = one_hot_labels

test_images = x_test.reshape(len(x_test), 28 * 28) / 255
test_labels = np.zeros((len(y_test), 10))
for i, label in enumerate(y_test):
    test_labels[i][label] = 1

    
relu = lambda x: (x > 0) * x
reluDerivative = lambda x: x >= 0

alpha, iterations, hidden_size, pixels_per_image, num_labels = (0.005, 350, 40, 784, 10)

np.random.seed(1)
weights_0_1 = 0.2 * np.random.random((pixels_per_image, hidden_size)) - 0.1
weights_1_2 = 0.2 * np.random.random((hidden_size, num_labels)) - 0.1

for k in [1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]:
    for j in range(k):
        error, correct_count = (0.0, 0)
        for i in range(len(images)):
            layer_0 = images[i : i + 1]
            layer_1 = relu(np.dot(layer_0, weights_0_1))
            layer_2 = np.dot(layer_1, weights_1_2)
            error += np.sum((labels[i : i + 1] - layer_2) ** 2)
            correct_count += int(np.argmax(layer_2) == np.argmax(labels[i : i + 1]))
            layer_2_delta = labels[i : i + 1] - layer_2
            layer_1_delta = np.dot(layer_2_delta, weights_1_2.T) * reluDerivative(layer_1)
            weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
            weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
    test_error, test_correct_count = (0.0, 0)
    for i in range(len(test_images)):
        layer_0 = test_images[i : i + 1]
        layer_1 = relu(np.dot(layer_0, weights_0_1))
        layer_2 = np.dot(layer_1, weights_1_2)
        test_error += np.sum((layer_2 - test_labels[i : i + 1]) ** 2)
        test_correct_count += int(np.argmax(layer_2) == np.argmax(test_labels[i : i + 1]))
    
    print("\r"+ \
          " Iteration: {}".format(j + 1) + \
          " | Train Error: " + str(error / float(len(images)))[0:5] + \
          " | Train Acc: " + str(correct_count / float(len(images))) + \
          " | Test Error: " + str(test_error / float(len(test_images)))[0:5] + \
          " | Test Acc: " + str(test_correct_count / float(len(test_images))))
          
    
    
    


 Iteration: 1 | Train Error: 0.722 | Train Acc:0.537 | Test Error: 0.601 | Test Acc:0.6488
 Iteration: 10 | Train Error: 0.312 | Train Acc:0.901 | Test Error: 0.420 | Test Acc:0.8114
 Iteration: 20 | Train Error: 0.232 | Train Acc:0.946 | Test Error: 0.417 | Test Acc:0.8066
 Iteration: 30 | Train Error: 0.194 | Train Acc:0.967 | Test Error: 0.448 | Test Acc:0.7921
 Iteration: 40 | Train Error: 0.166 | Train Acc:0.984 | Test Error: 0.482 | Test Acc:0.7706
 Iteration: 50 | Train Error: 0.145 | Train Acc:0.991 | Test Error: 0.513 | Test Acc:0.7558
 Iteration: 60 | Train Error: 0.127 | Train Acc:0.998 | Test Error: 0.544 | Test Acc:0.7446
 Iteration: 70 | Train Error: 0.115 | Train Acc:0.999 | Test Error: 0.600 | Test Acc:0.723
 Iteration: 80 | Train Error: 0.108 | Train Acc:0.999 | Test Error: 0.662 | Test Acc:0.704
 Iteration: 90 | Train Error: 0.105 | Train Acc:0.998 | Test Error: 0.712 | Test Acc:0.6932
 Iteration: 100 | Train Error: 0.104 | Train Acc:0.999 | Test Error: 0.745 | Test A

Notice how the test accuracy peaks within the first 20 or so iterations, and then declines as the number of iterations grows. This happens while training accuracy converges to 100% and the training error converges to 0.

How would we get neural networks to train only on the true signal rather than overfit to the noise in the training dataset? A simple strategy is <em>early stopping</em> &mdash; just not letting the network train long enough to overfit the training dataset. To determine when to stop, we run the model against <em>validation datasets</em> and stop at the peak accuracy.



### Dropout Regularisation:
A regularisation technique that involves randomly setting node values to zero (as well as their associated delta). This is simple but happens to be a state-of-the-art technique for combating overfitting. From a very high level, it works by training smaller subsections of the neural network &mdash; which don't overfit as easily (simply because smaller neural networks have less expressive power and can only capture prominent features). The randomness will mean that many different subsections will be targeted. A bigger neural network tends to overfit because it's able to capture, to a greater degree, the finer details of each sample in the training data which prevents it from generalising (the finer details often part of the noise).

An interesting note: any two intially randomised neural networks will learn differently. It's quite unlikely that because of the initial random configuration of their weights that any two neural networks will learn to overfit to <em>same</em> noise.
Now, suppose we have 100 neural networks each with randomised weights. It happens that when neural networks train, they will capture the broader signal first, and then start placing greater weights on the noise. With 100 different neural networks capturing different noisy aspects, taking the average of them will tend to cancel out their mistakes and leave the common signal.


In [None]:
import sys
import numpy as np
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

images, labels = (x_train[0 : 1000].reshape(1000, 28 * 28) / 255, y_train[0 : 1000])

one_hot_labels = np.zeros((len(labels), 10))

for i, label in enumerate(labels):
    one_hot_labels[i][label] = 1
labels = one_hot_labels

test_images = x_test.reshape(len(x_test), 28 * 28) / 255
test_labels = np.zeros((len(y_test), 10))
for i, label in enumerate(y_test):
    test_labels[i][label] = 1

def relu(x):
    return (x >= 0) * x

def relu2deriv(x):
    return (x > 0)

alpha, iterations, hidden_size = (0.005, 300, 100)
pixels_per_image, num_labels = (784, 10)

np.random.seed(1)
weights_0_1 = 0.2 * np.random.random((pixels_per_image,hidden_size)) - 0.1
weights_1_2 = 0.2 * np.random.random((hidden_size,num_labels)) - 0.1

for j in range(iterations):
    error, correct_cnt = (0.0,0)
    for i in range(len(images)):
        layer_0 = images[i:i+1]
        layer_1 = relu(np.dot(layer_0,weights_0_1))
        # NEW CODE: dropout_mask is a randomised matrix of 1s and 0s (50% chance for either 1 or 0 at each entry)
        dropout_mask = np.random.randint(2, size=layer_1.shape)
        # NEW CODE: switching off some nodes in layer_1 by applying the dropout_mask
        # We're multiplying by 2 because we are turning off ~50% of the nodes in layer_1. If we don't, the layer_2
        # prediction will be approximately halved. This means that layer_2_delta will be larger in magnitude. 
        # Multiplying by 2 will cheaply escape this effect.
        layer_1 *= dropout_mask * 2
        layer_2 = np.dot(layer_1,weights_1_2)
        error += np.sum((labels[i:i+1] - layer_2) ** 2)
        correct_cnt += int(np.argmax(layer_2) == np.argmax(labels[i:i+1]))
        layer_2_delta = (labels[i:i+1] - layer_2)
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
        layer_1_delta *= dropout_mask
        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
    if(j % 10 == 0):
        test_error = 0.0
        test_correct_cnt = 0
        for i in range(len(test_images)):
            layer_0 = test_images[i:i+1]
            layer_1 = relu(np.dot(layer_0,weights_0_1))
            layer_2 = np.dot(layer_1, weights_1_2)
            test_error += np.sum((test_labels[i:i+1] - layer_2) ** 2)
            test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
    
        sys.stdout.write("\n" + \
            "I:" + str(j) + \
            " Test-Err:" + str(test_error/ float(len(test_images)))[0:5] +\
            " Test-Acc:" + str(test_correct_cnt/ float(len(test_images)))+\
            " Train-Err:" + str(error/ float(len(images)))[0:5] +\
            " Train-Acc:" + str(correct_cnt/ float(len(images))))



I:0 Test-Err:0.638 Test-Acc:0.6325 Train-Err:0.902 Train-Acc:0.395
I:10 Test-Err:0.371 Test-Acc:0.8314 Train-Err:0.392 Train-Acc:0.819
I:20 Test-Err:0.316 Test-Acc:0.8551 Train-Err:0.320 Train-Acc:0.884
I:30 Test-Err:0.307 Test-Acc:0.8661 Train-Err:0.276 Train-Acc:0.918
I:40 Test-Err:0.293 Test-Acc:0.8721 Train-Err:0.259 Train-Acc:0.931
I:50 Test-Err:0.281 Test-Acc:0.8749 Train-Err:0.234 Train-Acc:0.94
I:60 Test-Err:0.290 Test-Acc:0.8791 Train-Err:0.229 Train-Acc:0.957
I:70 Test-Err:0.295 Test-Acc:0.8758 Train-Err:0.226 Train-Acc:0.958
I:80 Test-Err:0.290 Test-Acc:0.8782 Train-Err:0.210 Train-Acc:0.961
I:90 Test-Err:0.285 Test-Acc:0.8809 Train-Err:0.204 Train-Acc:0.95
I:100 Test-Err:0.277 Test-Acc:0.8808 Train-Err:0.192 Train-Acc:0.97
I:110 Test-Err:0.284 Test-Acc:0.8786 Train-Err:0.183 Train-Acc:0.975
I:120 Test-Err:0.277 Test-Acc:0.8809 Train-Err:0.188 Train-Acc:0.97
I:130 Test-Err:0.281 Test-Acc:0.8784 Train-Err:0.187 Train-Acc:0.963
I:140 Test-Err:0.289 Test-Acc:0.8777 Train-Err:0

#### Differences after implementing dropout:
- Testing accuracy doesn't peak and drop as bad as it did without any regularisation
- Traning accuracy takes longer to converge to 100%
