### Dropout is a regularization techinque which can simply prevent overfitting. It drops out some nodes/neuron randomly during training. This helps in avoiding the network to closely align with the input samples(overfitting).  we can even call the dropout as ensemble methods or bagging

### What is Ensemble Methods or Bagging? Why we call Dropout is one of them?

Bagging or Ensemble is an idea to train several different models independent of each other and vote on all model outputs to choose the prediction. 

### How come Ensemble Methods generalize to the test set ?

Before answering this question, let's define how to choose a different model.

1) Using a Diffirent alogrithms or different hyper parameters
2) Using a different constructed datasets(a subsets) from original datasets

As per my exploration., the point 2 provides better generalization but there is no proper definition by it. The objective of point 2 here is to choose different subsets of samples constructed from orginal dataset of same size which means there is a high probability that each dataset missing some examples from original dataset and contains several duplicate samples.Remembers One classic example given by ian goodfellow in his deep learning book. say we need to predict the number 8, Model-1 with distribution 8,6,8 where it learns circle/loop on top is number 8. Model-2 with distribution 9 , 9 , 8 learns circle/loop on bottom is number 8. if we combine and mean the score of 2 models we get the prediction 8. Since each model has slightly different features from one another , this approach seems to be generalize well with test set.

### What is the Problem with the ensemble methods stated above? 
    
Simple, the more memory and computations is needed especially for larger network since it need to train multiple models for prediction. What if we create a approximation of this process in a single training loop i mean in O(N) loop. That's where drop out comes in.

### What is Dropout  and how does it can be acheived?

Dropout provides an inexpensive approximation to training and evaluating the bagged ensemble of exponentially many neural networks. Here our objective is to drop some percentage(is a hyperparameter to be configured) of neurons/node during forward propagation. So to acheive this we create mask vector usually a binomial vector with 0's and 1's and multiply it with the layer outcome. The zeros in mask vector helps to randomly drop features/neurons from the given layer. In other words, Dropout is a regularization technique where during each iteration of gradient descent, we drop a set of neurons selected at random. By drop, what we mean is that we essentially act as if they do not exist.

Each neuron is dropped at random with some fixed probability 1-p, and kept with probability p. The value p may be different for each layer in the neural network. A value of 0.5 for the hidden layers, and 0.0 for input layer (no dropout) has been shown to work well on a wide range of tasks [1].

During evaluation (and prediction), we do not ignore any neurons, i.e. no dropout is applied. Instead, the output of each neuron is multiplied by p. This is done so that the input to the next layer has the same expected value.

To state with the real world example, from the book of deep learning. the power of droput arises from the fact that the masking noise is applied to hidden units. If the model learns a hidden unit h, that detects a face by finding the nose, then dropping h corresponds to erasing the information that there is a nose in the image. The model must learn another h, that either redundantly encodes the presence of a nose or detects the face by another feature, such as the mouth.

Also, it said that dropout is less effective with extremely few labeled training samples. 

### Let's start implementing the dropout , for that let's copy our sequential class from last session

In [2]:
import numpy as np
import ipdb

from scipy.io import loadmat
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
%matplotlib inline

In [62]:


#This cell block has the List of Activation Functions with derivatives
class sigmoid():
    def __init__(self):
        pass
    
    def __call__(self,h):
        return 1/(1+np.exp(-h)) 
    
    def derivative(self,h):
        return h * (1-h)   
    
    def diff(self, h ,y):
        return y - h

class relu():
    def __init__(self):
        pass
    
    def __call__(self,h):
        return h * (h >0)
    
    def derivative(self,h):
        return 1. * (h >0)


class tanh():
    def __init__(self):
        pass
    
    def __call__(self,h):
        return np.tanh(h)
    
    def derivative(self,h):
        return 1. - np.power(h,2)

class leakyrelu():
    def __init__(self):
        pass
    
    def __call__(self,h, alpha=0.1):
        return np.maximum(h ,alpha)
    
    def derivative(self,h):
        d =  1. (h>0)
        d[d <= 0] = alpha
        return d       
#softmax function as an activation function that maps the accumulated evidences to a probability distribution over the classes
class softmax():
    def __init__(self):
        pass
        
    def __call__(self,h):
        expo = np.exp(h)
        result = expo/np.sum(expo,axis=1, keepdims=True)  
        return result
    
    def derivative(self,h):
        return None
    
    def diff(self, h, y):
        dscore = h
        dscore[range(len(y)), y.ravel()-1] -= 1        
        dscore /= len(y)
        return dscore

#This Cell Block has List of Loss functions
def binaryLoss(y, p,lweights = None, reg=1e-3):
    return np.mean(-(y * np.log(p) + (1-y)*np.log(1-p)))    
#y -actual output p - predicted output reg - regularization strength
def crossEntropyForSoftMax(y, p,lweights = None, reg=1e-3): 
    #select the right propbolity for loss  
    correct_prob = -np.log(p[range(len(y)), y.ravel()-1])
    dataloss = np.sum(correct_prob)/len(y)   
    
    #regularization can be defined by 1/2 * Reg * np.sum(w*2)
    regloss= 0
    
    if lweights is not None:
        for weight in lweights:
            regloss +=  0.5*reg* np.sum(np.square(weight))
        
    return dataloss+regloss  


class Sequential:
    def __init__(self, layers, epochs, lr,  loss = binaryLoss, reg =1e-3):
        self.layers,self.epochs, self.lr,self.loss,self.reg = layers,epochs, lr ,loss,reg
        
    def __call__(self, X, y,X_valid, y_valid): 
        #assign weights
        self.X,self.y,self.X_valid,self.y_valid = X,y,X_valid, y_valid    
        
        inputdim = X.shape[1]
        np.random.seed(0)     
        #initialize layers
        for layer in self.layers:
            inputdim = layer(inputdim,self.lr, self.reg) 
            
        return self    
    
    def predict(self, X_input, y_input, training = False):
        
        h = X_input  
        layerweigths = []
        #compute hidden units
        for layer in self.layers:  
            h = layer.forward(h,training) 
            layerweigths.append(layer.w)
        
        loss = self.loss(y_input, h, layerweigths, self.reg)         
        
        return h,loss 
    
    
          
    def fit(self):  
        
        for i in range(self.epochs):
            
            if((i%1000) == 0):
                valid_h, valid_loss = self.predict(self.X_valid,self.y_valid)     
                h, loss = self.predict(self.X,self.y, True)  
                train_accuracy = np.mean(np.argmax(h ,axis=1)+1 == self.y.ravel())
                val_accuracy = np.mean(np.argmax(valid_h ,axis=1)+1 == self.y_valid.ravel())
               
                print(f'Epoch# {i} Training Loss:{loss} Validation Loss: {valid_loss} Training Accuracy:{train_accuracy} Validation Accuracy:{val_accuracy}') 
            else :    h, loss = self.predict(self.X,self.y, True)  
                
            #compute the error  
            error = self.layers[-1].activation.diff(h, self.y) 
            
            #back propagate the error - this formula is influenced by andrew ng course  
            for i in reversed(range(0,len(self.layers))):  
                h = self.X if i == 0 else self.layers[i-1].h
                error = self.layers[i].backward(error, h)          
               
            #update the weights   
            for layer in reversed(self.layers):
                layer.step()                


### We going to provide option to drop out in our layer class, where the forward method will accept 2 extra parameters one is dropout probabilty, other is flag to state whether it is training or not. since dropout vector to be mulitplied only in training mode.

In [84]:
from abc import ABC, abstractmethod

    

class Layer(ABC):
    def __init__(self,   outdim, activation, drop_prob=None): 
        self.outdim = outdim
        self.activation = activation
        self.dropoutprob = drop_prob
          
        
    def __call__(self, inputdim, lr= 0.01, reg=1e-3):  
        self.inputdim,self.lr,self.reg= inputdim,lr,reg         
        self.w = ( 0.01 * np.random.random((self.inputdim, self.outdim)))
        self.b = (0.01 * np.random.random((1,self.outdim)))       
        return self.outdim
        
   
    def forward(self, x, training = False):    
        h = self.activation(x)    
        if(training is False):            
            self.h =  h if self.dropoutprob is None else self.dropoutprob * h
        elif(self.dropoutprob is not None): 
            self.mask = np.random.binomial(1, 1 - self.dropoutprob, size=x.shape)/self.dropoutprob
            self.h = h *  self.mask
            
        return self.h
    
    def backward(self, d, h):   
        #given an output value from a neuron, we need to calculate it’s slope.
      
        #Apply the derivative of our activation function to the output layer error
        derivative = self.activation.derivative(self.h)
        
        if(derivative is None):  delta = d 
        else: delta = d * derivative if self.dropoutprob is None  else d * derivative * self.dropoutprob           
        
            
        self.dw = np.dot(h.T, delta)
        self.db = np.sum(delta, axis =0, keepdims=True)      
        
        #Use the delta output  to figure out how much our hidden layer contributed to the output error 
        #by performing a dot product with our weight matrix
        error = delta.dot(self.w.T)               
        
        self.dw += self.reg * self.w    
        
        return error
    
    def step(self):      
        self.w +=  -self.lr * self.dw
        self.b +=  -self.lr * self.db
        
class Dense(Layer):
    def __init__(self, outdim, activation = sigmoid,drop_prob=None):        
        super().__init__(outdim,activation,drop_prob)        
      
        
    def forward(self,x,training = False):        
        #linear 
        h = np.dot(x, self.w)  + self.b
        return super().forward(h)

In [85]:
data = loadmat("data\ex3data1.mat")
print(data['X'].shape)
print(data['y'].shape)

X = data['X']
y =  data['y']

(5000, 400)
(5000, 1)


In [86]:
X_train, X_valid, y_train, y_valid = train_test_split(
            X,y, test_size=0.20, random_state=42)
print(X_train.shape)
print(X_valid.shape)

(4000, 400)
(1000, 400)


In [87]:
model = Sequential([    
    Dense(100, activation = relu(), drop_prob = 0.25),   
    Dense(10, activation = softmax())    
], epochs =10000, lr= 0.1, reg= 1e-3, loss= crossEntropyForSoftMax)(X_train,y_train,X_valid, y_valid)

In [88]:
model.fit()

Epoch# 0 Training Loss:2.3033357673476655 Validation Loss: 2.3035524398788154 Training Accuracy:0.10375 Validation Accuracy:0.085
Epoch# 1000 Training Loss:0.8832151542393043 Validation Loss: 0.8959211433730334 Training Accuracy:0.801 Validation Accuracy:0.784
Epoch# 2000 Training Loss:0.5078076851084635 Validation Loss: 0.5610542637853289 Training Accuracy:0.889 Validation Accuracy:0.879
Epoch# 3000 Training Loss:0.4417533138054067 Validation Loss: 0.5151852124058112 Training Accuracy:0.912 Validation Accuracy:0.895
Epoch# 4000 Training Loss:0.41323353836784255 Validation Loss: 0.5001637084673154 Training Accuracy:0.92775 Validation Accuracy:0.899
Epoch# 5000 Training Loss:0.3971062961958248 Validation Loss: 0.4936247708886639 Training Accuracy:0.931 Validation Accuracy:0.906
Epoch# 6000 Training Loss:0.3868218364961822 Validation Loss: 0.4903509228238794 Training Accuracy:0.93725 Validation Accuracy:0.91
Epoch# 7000 Training Loss:0.3796409285714771 Validation Loss: 0.4886172989430282

### Without Droupout

In [89]:
model = Sequential([    
    Dense(100, activation = relu()),   
    Dense(10, activation = softmax())    
], epochs =10000, lr= 0.1, reg= 1e-3, loss= crossEntropyForSoftMax)(X_train,y_train,X_valid, y_valid)
model.fit()

Epoch# 0 Training Loss:2.3036310355372773 Validation Loss: 2.3041550030303695 Training Accuracy:0.10475 Validation Accuracy:0.085
Epoch# 1000 Training Loss:0.2948425367948705 Validation Loss: 0.3996565177455286 Training Accuracy:0.93075 Validation Accuracy:0.907
Epoch# 2000 Training Loss:0.2308697192764801 Validation Loss: 0.3840779523124036 Training Accuracy:0.95175 Validation Accuracy:0.92
Epoch# 3000 Training Loss:0.19830037834508418 Validation Loss: 0.38122552253673886 Training Accuracy:0.9685 Validation Accuracy:0.919
Epoch# 4000 Training Loss:0.17708025333789545 Validation Loss: 0.3790356183890529 Training Accuracy:0.977 Validation Accuracy:0.923
Epoch# 5000 Training Loss:0.16288755469974536 Validation Loss: 0.37694162154958005 Training Accuracy:0.983 Validation Accuracy:0.922
Epoch# 6000 Training Loss:0.15380883906196735 Validation Loss: 0.37555439485528996 Training Accuracy:0.9895 Validation Accuracy:0.926
Epoch# 7000 Training Loss:0.14805186356499744 Validation Loss: 0.3738054

### Still the loss is decreasing and it means there will be a increase in accuracy when epochs are increased, if you see here the dropout not performs better when compare to the accuracy without dropout. It will be more effective with slightly deeper networks and with more training samples. But you can notice the difference between the training accuracy and validation accuracy is minimal

### In the next session, let see how do we perform batch norm(a normalization technique) on hidden layers 