In [None]:
import numpy as np

**Module** is an abstract class which defines fundamental methods necessary for a training a neural network. You do not need to change anything here, just read the comments.

# Sequential container

**Define** a forward and backward pass procedures.

# Layers

- input:   **`batch_size x n_feats1`**
- output: **`batch_size x n_feats2`**

This one is probably the hardest but as others only takes 5 lines of code in total. 
- input:   **`batch_size x n_feats`**
- output: **`batch_size x n_feats`**

One of the most significant recent ideas that impacted NNs a lot is [**Batch normalization**](http://arxiv.org/abs/1502.03167). The idea is simple, yet effective: the features should be whitened ($mean = 0$, $std = 1$) all the way through NN. This improves the convergence for deep models letting it train them for days but not weeks. **You are** to implement a part of the layer: mean subtraction. That is, the module should calculate mean value for every feature (every column) and subtract it.

Note, that you need to estimate the mean over the dataset to be able to predict on test examples. The right way is to create a variable which will hold smoothed mean over batches (exponential smoothing works good) and use it when forwarding test examples.

When training calculate mean as folowing: 
```
    mean_to_subtract = self.old_mean * alpha + batch_mean * (1 - alpha)
```
when evaluating (`self.training == False`) set $alpha = 1$.


- input:   **`batch_size x n_feats`**
- output: **`batch_size x n_feats`**

Implement [**dropout**](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf). The idea and implementation is really simple: just multimply the input by $Bernoulli(p)$ mask. 

This is a very cool regularizer. In fact, when you see your net is overfitting try to add more dropout. It is hard to test, since every `forward` requires sampling a new mask, that is the only reason we need `fix_mask` parameter in there. 

While training (`self.training == True`) it should sample a mask on each iteration (for every batch). When testing this module should implement identity transform i.e. `self.output = input`.

- input:   **`batch_size x n_feats`**
- output: **`batch_size x n_feats`**

# Activation functions

Here's the complete example for the **Rectified Linear Unit** non-linearity (aka **ReLU**): 

Implement [**Leaky Rectified Linear Unit**](http://en.wikipedia.org/wiki%2FRectifier_%28neural_networks%29%23Leaky_ReLUs). Expriment with slope. 

In [1]:
%load_ext autoreload
%autoreload 2

Implement [**Exponential Linear Units**](http://arxiv.org/abs/1511.07289) activations.

0.8047189562170507
[[1.e+00 1.e-15 1.e-15]
 [2.e-01 8.e-01 1.e-15]]
[[-0.5  0.   0. ]
 [-2.5  0.   0. ]]


Implement [**SoftPlus**](https://en.wikipedia.org/wiki%2FRectifier_%28neural_networks%29) activations. Look, how they look a lot like ReLU.

# Criterions

Criterions are used to score the models answers. 

In [None]:
class Criterion(object):
    def __init__ (self):
        self.output = None
        self.gradInput = None
    
    def forward(self, inpt, target):
        """
            Given an input and a target, compute the loss function 
            associated to the criterion and return the result.
            
            For consistency this function should not be overrided,
            all the code goes in `updateOutput`.
        """
        return self.updateOutput(inpt, target)

    def backward(self, inpt, target):
        """
            Given an input and a target, compute the gradients of the loss function
            associated to the criterion and return the result. 

            For consistency this function should not be overrided,
            all the code goes in `updateGradInput`.
        """
        return self.updateGradInput(inpt, target)
    
    def updateOutput(self, inpt, target):
        """
        Function to override.
        """
        return self.output

    def updateGradInput(self, inpt, target):
        """
        Function to override.
        """
        return self.gradInput   

    def __repr__(self):
        """
        Pretty printing. Should be overrided in every module if you want 
        to have readable description. 
        """
        return 'Criterion'

The **MSECriterion**, which is basic L2 norm usually used for regression, is implemented here for you.

In [None]:
class MSECriterion(Criterion):
    def __init__(self):
        super(MSECriterion, self).__init__()

    def updateOutput(self, inpt, target):   
        <Your Code Goes Here>
        
        return self.output 
 
    def updateGradInput(self, inpt, target):
        <Your Code Goes Here>
        
        return self.gradInput

    def __repr__(self):
        return 'MSECriterion'

You task is to implement the **ClassNLLCriterion**. It should implement [multiclass log loss](https://www.kaggle.com/wiki/MultiClassLogLoss). Nevertheless there is a sum over `y` (target) in that formula, 
remember that targets are one-hot encoded. This fact simplifies the computations a lot. Note, that criterions are the only places, where you divide by batch size. 

In [None]:
class ClassNLLCriterion(Criterion):
    def __init__(self):
        super(ClassNLLCriterion, self).__init__()
    
    def updateOutput(self, inpt, target): 
        
        # Use this trick to avoid numerical errors
        input_clamp = np.maximum(1e-15, np.minimum(inpt, 1 - 1e-15) )
        
        self.output = -np.multiply(target, np.log(input_clamp)).sum() / inpt.shape[0]
        
        return self.output

    def updateGradInput(self, inpt, target):
        
        # Use this trick to avoid numerical errors
        input_clamp = np.maximum(1e-15, np.minimum(inpt, 1 - 1e-15) )
                
        self.gradInput = -target / input_clamp / inpt.shape[0]
        
        return self.gradInput
    
    def __repr__(self):
        return 'ClassNLLCriterion'