# Batch Normalization

<img src="https://api.ning.com/files/EXPL4V-n0-S-UQNnNq6bext-hycLoK-u6aEjnY7J2UyCWgn3eFDfbFC0T*6wIFSowUo2bxbUThjv1YqkRXddrKjFeLP8ZXqE/N2.jpg" width="400" height="50">


## Introduction

Purpose of Batch Normalization is to reduce overall __Covariant Shift__ that is a result of changing parameters from the previous layers are constantly changing. The effect of utilizing Batch Normalization is the ability to use higher learning rates and be less careful about weight initialization.

In [19]:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers.normalization import BatchNormalization
from keras.initializers import RandomNormal
from keras import metrics
from keras.utils import np_utils
from keras.callbacks import Callback
from keras import backend as K

import numpy as np

## Experiment 1: Activations Over Time

### Data Set 

MNIST dataset

### Neural Network Architecture:
1. 3 Fully Connected Hidden Layers
2. 100 Activations Per Hidden Layer
3. Each Hidden Layer uses Sigmoid 
4. Weights initialized to small Gaussian Values
5. Last Layer is connected to 10 Activation Layers and Cross Entropy

### Training
Training on 50,000 steps with 60 examples each per minibatch. 

### Experimental Observation 
Comparisons Made between Baseline [ Without Batch Norm ] and Batch Norm at Every Layer

### Graphs
1. Test Accuracy of the MNIST Network trained with and without Batch Normalization vs. the number of training steps.
2. The evolution of input distributions to a typical sigmoid over the course of training shown at 15th, 50th and 85th Percentile. 

### Preliminaries


In [15]:
# Setting the seed for reproducibility
seed = 7
np.random.seed( seed )

# Initializing Hyperparameters
NUM_EPOCHS  = 50000
BATCH_COUNT = 60

# Getting the MNIST Data
( X_train, y_train ), ( X_test, y_test ) = mnist.load_data()

### Model Creation

In [26]:
def create_mnist_model( apply_batchnormalization = False ): 
    '''Function creates the model for the first experiment that has optionality
       to enable / disable batch normalization. 
    '''

    mnist_model = Input( )
    
    # Input Layer
    mnist_model.add( Dense( units = 1000,
                            input_shape = ( 28, 28, 1 ),
                            activation  = K.sigmoid
                            ))
    
    if ( apply_batchnormalization ):
        mnist_model.add( BatchNormalization() )

    # 3 Hidden Layers
    for i in range( 3 ):
        mnist_model.add( Dense( units              = 100,
                                activation         = K.sigmoid,
                                kernel_initializer = RandomNormal()))
        if ( apply_batchnormalization ):
            mnist_model.add( BatchNormalization() )
            
        
    # Output Layer with 10 Units for each digit and a Softmax Activation
    mnist_model.add(Dense( units = 10 , activation= K.softmax ))
    
    mnist_model.compile( loss = ''  metrics = [ metrics.categorical_accuracy ])
    
    return mnist_model

### Model Fitting

## Experiment 2: ImageNet Classification