In [1]:
# default_exp reg
# default_cls_lvl 2

# Learning Signal & Ignoring Noise: Introduction to Regularization & Batching
> 

In this Chapter:
- Overfitting
- Dropout
- Batch Gradient Descent

> "With four Parameters I can fit an Elephant, & with five I can make him wiggle his trunk." — John von Neumann, Mathematician, Physicist, Computer Scientist, & Polymath.

## 3-Layer Network on MNIST
### Let's return to the MNIST Dataset & Attempt to Classify it with the New Network

- How do you know the network is creating good correlation?
- If we froze one weight, train the network until the error was sufficiently small, than unfroze the weight in an attempt to optimize against it, it won't change because the network had already learnt the existing correlation in the data.
- **What if the network had figured out a way to accurately predict the games in the training dataset, but it somehow forgot to include a valuable input?**
- **Overfitting is extremely common in Neural Networks**.
- The More Powerful the Neural Networks expressive power (more layers & weights), the more prone the network is to overfit.
- We're going to study the basics of **Regularization**, which is key to combatting overfitting in neural networks.
- We're going to train our latest & greatest neural network with 3 layers on the MNIST Dataset:

In [114]:
import sys
import numpy as np
import keras

In [115]:
from keras import datasets

In [116]:
(X_train, y_train), (X_test, y_test) = datasets.mnist.load_data()

In [117]:
X_train.shape, y_train.shape

((60000, 28, 28), (60000,))

In [118]:
X, y = X_train[:1000].reshape(1000, 28*28)/255., y_train[:1000]; X.shape

(1000, 784)

In [119]:
# one-hot encoding y
y_ = np.zeros((y.shape[0], 10))

In [120]:
for index, value in enumerate(y):
    y_[index][value] = 1
y = y_

In [121]:
# same for test
X_test, y_test = X_test[:1000].reshape(1000, 28*28)/255., y_test[:1000]
one_hot_y_test = np.zeros((y_test.shape[0], 10))
for index, value in enumerate(y_test):
    one_hot_y_test[index][value] = 1

In [69]:
def softmax(v):
    bottom = np.sum([np.exp(v_i) for v_i in v])
    return np.array([np.exp(v_i)/bottom for v_i in v])
relu = lambda x: (x >= 0) * x
grad_relu = lambda x: (x >= 0)
sigmoid = lambda x: 1/(1+np.exp(-x)) 
lr, epochs, hidden_size, total_pixels, num_labels = 0.005, 300, 100, 784, 10

In [70]:
# Initializing two weight matrices
W_1 = 0.2*np.random.random((total_pixels,hidden_size)) - 0.1
W_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1

In [69]:
for j in range(epochs):
    error, correct_count = (0.0, 0)
    
    for i in range(len(X)):
        layer_0 = X[i:i+1]
        layer_1 = relu(layer_0.dot(W_1))
        layer_2 = softmax(layer_1.dot(W_2))
        
        correct_count += int(np.argmax(layer_2) == np.argmax(one_hot_y[i:i+1]))
        
        layer_2_delta = (layer_2 - one_hot_y[i:i+1])
        layer_1_delta = layer_2_delta.dot(W_2.T)*grad_relu(layer_1)
        
        error += np.sum(layer_2_delta ** 2)
        
        W_2 -= lr * layer_1.T.dot(layer_2_delta) 
        W_1 -= lr * layer_0.T.dot(layer_1_delta)
        
    print("\r"+" I:"+str(j)+" Error:" + str(error/float(len(X)))[0:5] + " Correct:" + str(correct_count/float(len(X))))

 I:0 Error:0.802 Correct:0.501
 I:1 Error:0.451 Correct:0.766
 I:2 Error:0.270 Correct:0.837
 I:3 Error:0.203 Correct:0.88
 I:4 Error:0.168 Correct:0.891
 I:5 Error:0.145 Correct:0.912
 I:6 Error:0.128 Correct:0.921
 I:7 Error:0.115 Correct:0.929
 I:8 Error:0.104 Correct:0.936
 I:9 Error:0.095 Correct:0.945
 I:10 Error:0.086 Correct:0.952
 I:11 Error:0.079 Correct:0.96
 I:12 Error:0.072 Correct:0.962
 I:13 Error:0.065 Correct:0.967
 I:14 Error:0.059 Correct:0.972
 I:15 Error:0.053 Correct:0.974
 I:16 Error:0.048 Correct:0.977
 I:17 Error:0.043 Correct:0.984
 I:18 Error:0.038 Correct:0.986
 I:19 Error:0.034 Correct:0.99
 I:20 Error:0.030 Correct:0.992
 I:21 Error:0.026 Correct:0.993
 I:22 Error:0.023 Correct:0.996
 I:23 Error:0.020 Correct:0.996
 I:24 Error:0.017 Correct:0.996
 I:25 Error:0.014 Correct:0.997
 I:26 Error:0.012 Correct:1.0
 I:27 Error:0.010 Correct:1.0
 I:28 Error:0.009 Correct:1.0
 I:29 Error:0.008 Correct:1.0
 I:30 Error:0.007 Correct:1.0
 I:31 Error:0.006 Correct:1.0
 

## Well, That was easy!
### The Neural Network learned to predict all 1,000 images

- We've reached 100% accuracy on the sample of 1000 images.
- But how well will it do on an image that wasn't part of the original sample of 1,000 images?
- Let's Evaluate the network on the test set:

In [70]:
error, correct_count = (0.0, 0)

for i in range(len(X_test)):
    layer_0 = X_test[i:i+1]
    layer_1 = relu(layer_0.dot(W_1))
    layer_2 = softmax(layer_1.dot(W_2))
    
    correct_count += int(np.argmax(layer_2) == np.argmax(one_hot_y_test[i:i+1]))
    layer_2_delta = (layer_2 - one_hot_y_test[i:i+1])
    error += np.sum(layer_2_delta ** 2)

print("Error:" + str(error/float(len(X_test)))[0:5] + " Correct:" + str(correct_count/float(len(X_test))))

Error:0.292 Correct:0.823


- Book's Results is:
    - Error: .653
    - Correct: .7073
- I think that my results are better due to the use of the softmax function on the last layer.

- **The Network did horribly!**
- Why does it do so terribly on these new testing images when it learned to predict with a 100% accuracy on the training set?
- This **0.823** is called the **test accuracy**.
- This number is important because it simulates how the neural network will do in production (the real world).
    - **This is the score that matters, the test accuracy.**

## Memorization vs. Generalizatio
### Memorizing 1,000 is easier than generalizing to all images

- Let's rewind on how a neural network actually learns:
    - It Adjusts each number on each weight matrix so the overall error is minimized.
- **If we trained the NN to predict labels on 1,000 images, which it did perfectly, why does it work on other images at all**?
- The NN is guaranteed to work well on a new image only if the new image is nearly identical images in the training data set.
    - Because the NN learned to transform the input data to output data for a specific data set with a specific overall configuration
- If The NN works only on nearly identical data points (in comparison to training set), then what's the purpose of it anyway.
    - **We want a Neural Network that can work well on images different from the training set data points**
    - That's what we call: **Generalization**

## Overfitting in Neural Networks
### Neural Networks can get worse if you train them too much!

- For some reason, the test accuracy went up for the first 20 iterations & then slowly decreased as the network trained more and more.
- This is common in neural networks.
- Overfitting is over-optimizing for the training data points,
    - **Just like when you're molding a material for 3 forks you keep molding them until you get a very specific shape that works for all 3 but has nothing to do with the shape of a general-purpose fork.
- **You can Visualize Weights as High Dimensional Shapes**.
- As you train, this shapes molds around the shape of the data, learning one pattern after another.
- A more official definition of a neural network that overfits:
    - A Neural Network that has **learned the noise** in the dataset.

## Where Overfitting Comes from?
### What causes neural networks to overfit?

- Consider these two dog pictures:

<div style="text-align:center;"><img style="width:50%" src="static/imgs/08/Dogs.png" /></div>

- **Everything** that makes these pictures **unique** **beyond** what captures **the essense of "Dog" is** included in the term **"Noise"**.
- On the Left: The Pillow & the background are both Noise.
- On the Right: the Blackness can also be considered Noise.
    - It's really the edges that tells you it's a Dog.
- How do you get NNs to train only on the Signal (the essense of a dog) & Ignore the Noise?
    - One Way is **Early Stopping**.
        - It turns out a large amount of noise comes in the fine grained details of an image.
        - & most of the signal is found on the general shape & perhaps color of the image.

## The Simplest Regularization: Early Stopping
### Stop training the network what it starts getting worse

- **You don't let the network train long enough to learn the details.**
- **Early Stopping** is the cheapest form of regularization, and you're in a pinch, it can be quite effective.
- **Regularization** is a subset of methods for getting a model to generalize to new data points.
    - Instead of just memorizing the training data.
- It's a subset of methods that helps the neural network learn the signal and ignore the noise.
    - Often done by **increasing the difficulty** for a model to learn the fine-grained details in teh training data.
- You know how to stop by using the **validation set** while training, and stop when the validation score gets worse.

## Industry Standard Regularization: Dropout
### The Method: Randomly turn off neurons (setting them to 0) during training

- This causes the neural network to train execlusively using **random subsections of the neural network**.
- This Regularization method is generally accepted as the go-to, state-of-the-art regularization technique for the vast majority of networks.
- Its methodology is simple & inexpensive, although the intuitions behind why it works are a bit more complex.
- **Why Dropout Works?**
    - **Dropout makes a big network train like a little one by randomly training little subsections of the network at a time.
        - **and little networks don't overfit**.
- The Smaller a Neural Network is, the less it's able to overfit.
    - **Because small neural network have a smaller number of weights, meaning the network's hypothesis space is small**
    - **Because small neural networks don't have much expressive power**.
- Small Neural Networks have enough room to only capture the big, obvious, high-level features.
- The Notion of room/capacity is very important to keep in your mind.
- Remember the Clay analogy?
    - **Imagine if the clay was made of sticky rocks the size of dimes**.
        - Those stones are much like **weights**.
    - **Now Imagine a clay made of millions & millions of small stones.**
        - This is a big Model.
- How do you get the power of a large neural network with the resistance to overfitting of the small neural network?
    - **Dropout**.

## Why Dropout works? Ensembling works?
### Dropout is a form of training a bunch of networks and averaging them

- Although it's likely that large, unregularized neural networks will overfit to noise, It's unlikely they will overfit to the same noise.
    - Because they start randomly.
- Neural networks, even though they're randomly generated, still start by learning the biggest, most broadly sweeping features before learning much about the noise
- If you allowed a bunch of overfitted neural networks to vote equally, their noise will tend to cancel out, revealing only what they all learned in common ..
    - **The Signal**.

## Dropout in Code
### Here's how to use dropout in the real world

In [121]:
for j in range(epochs):
    error, correct_count = (0.0, 0)
    
    for i in range(len(X)):
        layer_0 = X[i:i+1]
        layer_1 = relu(layer_0.dot(W_1))
        dropout_mask = np.random.randint(2, size=layer_1.shape)
        layer_1 *= dropout_mask * 2
        layer_2 = layer_1.dot(W_2)
    
        correct_count += int(np.argmax(layer_2) == np.argmax(one_hot_y[i:i+1]))
        
        layer_2_delta = (layer_2 - one_hot_y[i:i+1])
        layer_1_delta = layer_2_delta.dot(W_2.T)*grad_relu(layer_1)
        layer_1_delta *= dropout_mask
        
        error += np.sum(layer_2_delta ** 2)
        
        W_2 -= lr * layer_1.T.dot(layer_2_delta) 
        W_1 -= lr * layer_0.T.dot(layer_1_delta)
    
    if j % 10 == 0:
        test_error, test_correct_count = 0, 0
        
        for i in range(len(X_test)):
            layer_0 = X_test[i:i+1]
            layer_1 = relu(layer_0.dot(W_1))
            layer_2 = softmax(layer_1.dot(W_2))

            test_correct_count += int(np.argmax(layer_2) == np.argmax(one_hot_y_test[i:i+1]))
            layer_2_delta = (layer_2 - one_hot_y_test[i:i+1])
            error += np.sum(layer_2_delta ** 2)
        
    
        print("\n" + "I:" + str(j) + " Test-Err:" + str(test_error/ float(len(X_test)))[0:5] + " Test-Acc:" + str(test_correct_count/ float(len(X_test)))+ " Train-Err:" + str(error/ float(len(X)))[0:5] + " Train-Acc:" + str(correct_count/ float(len(X))))



I:0 Test-Err:0.0 Test-Acc:0.6 Train-Err:1.764 Train-Acc:0.386

I:10 Test-Err:0.0 Test-Acc:0.745 Train-Err:1.286 Train-Acc:0.765

I:20 Test-Err:0.0 Test-Acc:0.776 Train-Err:1.228 Train-Acc:0.81

I:30 Test-Err:0.0 Test-Acc:0.794 Train-Err:1.208 Train-Acc:0.822

I:40 Test-Err:0.0 Test-Acc:0.804 Train-Err:1.174 Train-Acc:0.843

I:50 Test-Err:0.0 Test-Acc:0.825 Train-Err:1.162 Train-Acc:0.854

I:60 Test-Err:0.0 Test-Acc:0.814 Train-Err:1.145 Train-Acc:0.866

I:70 Test-Err:0.0 Test-Acc:0.814 Train-Err:1.144 Train-Acc:0.882

I:80 Test-Err:0.0 Test-Acc:0.807 Train-Err:1.134 Train-Acc:0.864

I:90 Test-Err:0.0 Test-Acc:0.823 Train-Err:1.133 Train-Acc:0.869

I:100 Test-Err:0.0 Test-Acc:0.814 Train-Err:1.133 Train-Acc:0.866

I:110 Test-Err:0.0 Test-Acc:0.809 Train-Err:1.125 Train-Acc:0.893

I:120 Test-Err:0.0 Test-Acc:0.817 Train-Err:1.127 Train-Acc:0.879

I:130 Test-Err:0.0 Test-Acc:0.801 Train-Err:1.125 Train-Acc:0.888

I:140 Test-Err:0.0 Test-Acc:0.796 Train-Err:1.105 Train-Acc:0.896

I:150 Te

- A dropout mask uses what's called a 50% bernoulli distribution.
- You multiply `layer_1` by 2, why do you do this ?
    - Remember that `layer_2` will perform a weighted sum of `layer_1`
        - Even though it's weighted, It's still a sum over the values of `layer_1`.
    - If you turn off half of the nodes in `layer_1`, `layer_2` would increase its sensitivity to `layer_1`.
    - But @ test time, you no longer would need dropout, this would throw off `layer_2` sensitivity.

- After introducing Dropout, Not only the network peak at a score of 80%, it also doesn't overfit nearly as badly.
- Notice also that the dropout slows down training accuracy, it previously converged to 100% pretty fastly, now, it finishes at 90%.
- **Dropout is Noise**, we are introducing noise to the network to help it concentrate its training on the true signal and avoid memorizing data-point-specific noise.

## Batch Gradient Descent
### Here's a Method for increasing the Speed of training & the rate of convergence 

- Also called: mini-batched stochastic gradient descent.
- It's something that's largely taken for granted in neural network training.
- It's a simple concept that doesn't get more advanced even with the most state-of-the-art neural networks.
- Previously, we trained one training example each iteration.
- Now, let's train 100 training examples at a time, **averaging the weight updates among all 100 examples**.
- As it turns out, **individual training examples are very noisy** in terms of the weight updates they generate.
    - Thus, averaging them makes for a smoother learning process.
- Let's do this in code:

In [49]:
batch_size = 100
lr, epochs = .001, 300

- **prev & for every epoch, we predict a data point @ a time**
- **now & for every epoch, we predict a batch @ a time**

In [97]:
# let's do the book implementation from scratch while debugging.
import sys
import numpy as np
import keras

In [98]:
# getting the necessary data.
from keras import datasets

In [99]:
np.random.seed(1)

In [100]:
(X_train, y_train), (X_test, y_test) = datasets.mnist.load_data()

In [101]:
X, y = (X_train[:1000].reshape((1000, -1))/255.), y_train[:1000]
X_test = (X_test.reshape((X_test.shape[0], -1))/255.)

In [102]:
# one-hotting y
one_hots = np.zeros((y.shape[0], 10))
for i, _ in enumerate(one_hots):
    one_hots[i][y[i]] = 1
y = one_hots

In [103]:
# one-hotting y
one_hots = np.zeros((y_test.shape[0], 10))
for i, _ in enumerate(one_hots):
    one_hots[i][y_test[i]] = 1
y_test = one_hots

In [104]:
def relu(x):
    return (x >= 0) * x

def grad_relu(x):
    return x >= 0

In [105]:
batch_size = 100
lr, epochs = .001, 300
pixels_per_image, num_labels, hidden_size = 784, 10, 100

In [106]:
# Initialize Weights
W_1 = .2 * np.random.random((pixels_count, hidden_size)) - .1
W_2 = .2 * np.random.random((hidden_size, num_labels)) - .1

In [107]:
for j in range(epochs):
    error, correct_cnt = (0.0, 0)
    for i in range(int(len(X) / batch_size)):
        batch_start, batch_end = ((i * batch_size),((i+1)*batch_size))

        layer_0 = X[batch_start:batch_end]
        layer_1 = relu(np.dot(layer_0,W_1))
        dropout_mask = np.random.randint(2,size=layer_1.shape)
        layer_1 *= dropout_mask * 2
        layer_2 = np.dot(layer_1,W_2)

        error += np.sum((y[batch_start:batch_end] - layer_2) ** 2)
        for k in range(batch_size):
            correct_cnt += int(np.argmax(layer_2[k:k+1]) == np.argmax(y[batch_start+k:batch_start+k+1]))

            layer_2_delta = (y[batch_start:batch_end]-layer_2)/batch_size
            layer_1_delta = layer_2_delta.dot(W_2.T)* grad_relu(layer_1)
            layer_1_delta *= dropout_mask

            W_2 += lr * layer_1.T.dot(layer_2_delta)
            W_1 += lr * layer_0.T.dot(layer_1_delta)
            
    if(j%10 == 0):
        test_error = 0.0
        test_correct_cnt = 0

        for i in range(len(X_test)):
            layer_0 = X_test[i:i+1]
            layer_1 = relu(np.dot(layer_0, W_1))
            layer_2 = np.dot(layer_1, W_2)

            test_error += np.sum((y_test[i:i+1] - layer_2) ** 2)
            test_correct_cnt += int(np.argmax(layer_2) == np.argmax(y_test[i:i+1]))

        sys.stdout.write("\n" + \
                         "I:" + str(j) + \
                         " Test-Err:" + str(test_error/ float(len(X_test)))[0:5] +\
                         " Test-Acc:" + str(test_correct_cnt/ float(len(X_test)))+\
                         " Train-Err:" + str(error/ float(len(X)))[0:5] +\
                         " Train-Acc:" + str(correct_cnt/ float(len(X))))


I:0 Test-Err:0.815 Test-Acc:0.3832 Train-Err:1.284 Train-Acc:0.165
I:10 Test-Err:0.568 Test-Acc:0.7173 Train-Err:0.591 Train-Acc:0.672
I:20 Test-Err:0.510 Test-Acc:0.7571 Train-Err:0.532 Train-Acc:0.729
I:30 Test-Err:0.485 Test-Acc:0.7793 Train-Err:0.498 Train-Acc:0.754
I:40 Test-Err:0.468 Test-Acc:0.7877 Train-Err:0.489 Train-Acc:0.749
I:50 Test-Err:0.458 Test-Acc:0.793 Train-Err:0.468 Train-Acc:0.775
I:60 Test-Err:0.452 Test-Acc:0.7995 Train-Err:0.452 Train-Acc:0.799
I:70 Test-Err:0.446 Test-Acc:0.803 Train-Err:0.453 Train-Acc:0.792
I:80 Test-Err:0.451 Test-Acc:0.7968 Train-Err:0.457 Train-Acc:0.786
I:90 Test-Err:0.447 Test-Acc:0.795 Train-Err:0.454 Train-Acc:0.799
I:100 Test-Err:0.448 Test-Acc:0.793 Train-Err:0.447 Train-Acc:0.796
I:110 Test-Err:0.441 Test-Acc:0.7943 Train-Err:0.426 Train-Acc:0.816
I:120 Test-Err:0.442 Test-Acc:0.7966 Train-Err:0.431 Train-Acc:0.813
I:130 Test-Err:0.441 Test-Acc:0.7906 Train-Err:0.434 Train-Acc:0.816
I:140 Test-Err:0.447 Test-Acc:0.7874 Train-Err:0

- The Concept behind this is the fact that we don't alter the network's parameters each data point prediction at a time.
    - we instead predict a whole batch, get an average delta and update all parameters.
        - This means we get rid of point level noise, by averging a batch of prediction, we have a better sense of what direction we should move the internal parameters.
- Notice that the learning rate is 20 times larger than before.
    - Because we are much more confident in the direction the weights should take to change.
- Because the example take an average of a noisy signal (the average weight change over 100 training examples), it can take bigger steps.
- You'll generally see **batching ranging from 8 to as high as 256**.
- Generally, Researchers take numbers randomly until they find a `batch_size` & `lr` pair that seems to work well.

- In the Following Chapters, we'll pivot from sets of tools that are universally applicable to nearly all neural networks, to special purpose architectures that are advantageous for modeling specific types of phenomena in data. 