## Introduction:

In this notebook, we gonna apply MLP to classify MNIST dataset .. We gonna start with simple example .. then train MNIST ..

MLP is a feed forward neural network model that depends on one hidden layer and non linear activation function (usually sigmoid or tanh) 

The implementation is guided by this tutorial ==> 
http://deeplearning.net/tutorial/deeplearning.pdf

- Typical hidden layer of a MLP: units are fully-connected and have sigmoidal activation function. Weight matrix W is of shape (n_in,n_out) and the bias vector b is of shape (n_out,)..
- Tanh activiation fun. will be used .. 
- Hidden unit activation is given by: tanh(dot(input,W) + b)
- each neuron operates in a regime of its activation function where information can easily be propagated both upward (activations flowing from inputs to outputs) and backward (gradients flowing from outputs to inputs)

- In this implementation, intermediate layers usually have as activation function tanh or the
   sigmoid function (defined here by a ``HiddenLayer`` class)  while the
   top layer is a softmax layer (defined here by a ``LogisticRegression`` class)

In [69]:
import numpy as np 
import os
import sys
import timeit
from theano import *
import theano.tensor as T
from logistic_sgd import LogisticRegression, load_data

## MLP:

Let's construct the class of MLP that utilizes the previous implementation .. 

MLP class consists of two main functions .. 
1. Init: 
    contains the initialization of the model paramters and loss functions .. 
2. Test MLP:
    train and test the model on MNIST dataset, optimize and update param. using SGD 

In [80]:
class HiddenLayer(object):
    def __init__(self, rng, input, n_in, n_out, W=None, b=None,
                 activation=T.tanh):
    
        self.input = input
      
        ## initialize the weights according to tanh activication function .. 
        if W is None:
            W_values = numpy.asarray(
                rng.uniform(
                    low=-numpy.sqrt(6. / (n_in + n_out)),
                    high=numpy.sqrt(6. / (n_in + n_out)),
                    size=(n_in, n_out)
                ),
                dtype=theano.config.floatX
            )
            
            ## if activiation function is sigmoid .. 
            if activation == theano.tensor.nnet.sigmoid:
                W_values *= 4

            W = theano.shared(value=W_values, name='W', borrow=True)
        
        ## init bias vector .. 
        if b is None:
            b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
            b = theano.shared(value=b_values, name='b', borrow=True)

        self.W = W
        self.b = b
        
        ## defining the shape of the output .. 
        lin_output = T.dot(input, self.W) + self.b
        self.output = (
            lin_output if activation is None
            else activation(lin_output)
        )
        # parameters of the model
        self.params = [self.W, self.b]

In [81]:
class MLP(object):
  
    def __init__(self, rng, input, n_in, n_hidden, n_out):
       ## init hidden layer with tanh activation fun.
        self.hiddenLayer = HiddenLayer(
            rng=rng,
            input=input,
            n_in=n_in,
            n_out=n_hidden,
            activation=T.tanh
        )
        # The logistic regression layer gets as input the hidden units of the hidden layer    
        self.logRegressionLayer = LogisticRegression(
            input=self.hiddenLayer.output,
            n_in=n_hidden,
            n_out=n_out
        )
        
        ## defining loss function .. L1 norm 
        self.L1 = (
            abs(self.hiddenLayer.W).sum()
            + abs(self.logRegressionLayer.W).sum()
        )

       ## L2 norm .. is just L1 square :) 
        self.L2_sqr = (
            (self.hiddenLayer.W ** 2).sum()
            + (self.logRegressionLayer.W ** 2).sum()
        )

       # negative log likelihood of the MLP is given by the negative
        # log likelihood of the output of the model, computed in the
        # logistic regression layer
        self.negative_log_likelihood = (
            self.logRegressionLayer.negative_log_likelihood
        )
        
        
        # same holds for the function computing the number of errors
        self.errors = self.logRegressionLayer.errors

        # the parameters of the model are the parameters of the two layer it is made out of (hidden, output)
         self.params = self.hiddenLayer.params + self.logRegressionLayer.params

        # keep track of model input
        self.input = input

## 2. Test_MLP:

This function is to train and test MLP on MNIST dataset with the use of SGD .. 

This function contains three main steps ..
1. Model Building (includes the update formula)
2. Model training 
3. Model testing 

In [83]:
def test_mlp(learning_rate=0.01, L1_reg=0.00, L2_reg=0.0001, n_epochs=1000,
             dataset='mnist.pkl.gz', batch_size=20, n_hidden=500):
    
    ## load the data .. 
    datasets = load_data(dataset)

    train_set_x, train_set_y = datasets[0]
    valid_set_x, valid_set_y = datasets[1]
    test_set_x, test_set_y = datasets[2]

    # compute number of minibatches for training, validation and testing
    n_train_batches = train_set_x.get_value(borrow=True).shape[0] // batch_size
    n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] // batch_size
    n_test_batches = test_set_x.get_value(borrow=True).shape[0] // batch_size

    
      ######################
     #  MODEL BUILDING #
     ######################
    
     # allocate symbolic variables for the data
    index = T.lscalar()  # index to a [mini]batch
    x = T.matrix('x')  # the data is presented as rasterized images
    y = T.ivector('y')  # the labels are presented as 1D vector of
                        # [int] labels

    rng = numpy.random.RandomState(1234)

    # construct the MLP class
    classifier = MLP(
        rng=rng,
        input=x,
        n_in=28 * 28,
        n_hidden=n_hidden,
        n_out=10
    )

     # construct the MLP class 
    cost = (
        classifier.negative_log_likelihood(y)
        + L1_reg * classifier.L1
        + L2_reg * classifier.L2_sqr
    )
   
     # compiling a Theano function that computes the mistakes that are made by the model on a minibatch
    test_model = theano.function(
        inputs=[index],
        outputs=classifier.errors(y),
        givens={
            x: test_set_x[index * batch_size:(index + 1) * batch_size],
            y: test_set_y[index * batch_size:(index + 1) * batch_size]
        }
    )

    ## model validation 
    validate_model = theano.function(
        inputs=[index],
        outputs=classifier.errors(y),
        givens={
            x: valid_set_x[index * batch_size:(index + 1) * batch_size],
            y: valid_set_y[index * batch_size:(index + 1) * batch_size]
        }
    )

    # compute the gradient of cost with respect to theta (sorted in params)
    # the resulting gradients will be stored in a list gparams
    gparams = [T.grad(cost, param) for param in classifier.params]

   
     ## model updates .. 
    # given two lists of the same length, A = [a1, a2, a3, a4] and
    # B = [b1, b2, b3, b4], zip generates a list C of same size, where each
    # element is a pair formed from the two lists :
    #    C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)]
    updates = [
        (param, param - learning_rate * gparam)
        for param, gparam in zip(classifier.params, gparams)
    ]

    # compiling a Theano function `train_model` that returns the cost, but
#     # in the same time updates the parameter of the model based on the rules
#     # defined in `updates`
    train_model = theano.function(
        inputs=[index],
        outputs=cost,
        updates=updates,
        givens={
            x: train_set_x[index * batch_size: (index + 1) * batch_size],
            y: train_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )
  
      ######################
     #  MODEL TRIANING    #
     ######################

    patience = 10000  # look as this many examples regardless
    patience_increase = 2  # wait this much longer when a new best is
                           # found
    improvement_threshold = 0.995  # a relative improvement of this much is
                                   # considered significant
    validation_frequency = min(n_train_batches, patience // 2)
                                  # go through this many
                                  # minibatche before checking the network
                                  # on the validation set; in this case we
                                  # check every epoch

    best_validation_loss = numpy.inf
    best_iter = 0
    test_score = 0.
    start_time = timeit.default_timer()

    epoch = 0
    done_looping = False

    while (epoch < n_epochs) and (not done_looping):
        epoch = epoch + 1
        for minibatch_index in range(n_train_batches):

            minibatch_avg_cost = train_model(minibatch_index)
            # iteration number
            iter = (epoch - 1) * n_train_batches + minibatch_index

            if (iter + 1) % validation_frequency == 0:
                # compute zero-one loss on validation set
                validation_losses = [validate_model(i) for i
                                     in range(n_valid_batches)]
                this_validation_loss = numpy.mean(validation_losses)

                print(
                    'epoch %i, minibatch %i/%i, validation error %f %%' %
                    (
                        epoch,
                        minibatch_index + 1,
                        n_train_batches,
                        this_validation_loss * 100.
                    )
                )

                # if we got the best validation score until now
                if this_validation_loss < best_validation_loss:
                    #improve patience if loss improvement is good enough
                    if (
                        this_validation_loss < best_validation_loss *
                        improvement_threshold
                    ):
                        patience = max(patience, iter * patience_increase)

                    best_validation_loss = this_validation_loss
                    best_iter = iter

                    # test it on the test set
                    test_losses = [test_model(i) for i
                                   in range(n_test_batches)]
                    test_score = numpy.mean(test_losses)

                    print(('     epoch %i, minibatch %i/%i, test error of '
                           'best model %f %%') %
                          (epoch, minibatch_index + 1, n_train_batches,
                           test_score * 100.))

            if patience <= iter:
                done_looping = True
                break
    ## print the output ..
    end_time = timeit.default_timer()
    print(('Optimization complete. Best validation score of %f %% '
           'obtained at iteration %i, with test performance %f %%') %
          (best_validation_loss * 100., best_iter + 1, test_score * 100.))

## Calling the function .. 

In [84]:
if __name__ == '__main__':
    test_mlp()

... loading data
... building the model
... training
epoch 1, minibatch 2500/2500, validation error 9.620000 %
     epoch 1, minibatch 2500/2500, test error of best model 10.090000 %
epoch 2, minibatch 2500/2500, validation error 8.610000 %
     epoch 2, minibatch 2500/2500, test error of best model 8.740000 %
epoch 3, minibatch 2500/2500, validation error 8.000000 %
     epoch 3, minibatch 2500/2500, test error of best model 8.160000 %
epoch 4, minibatch 2500/2500, validation error 7.600000 %
     epoch 4, minibatch 2500/2500, test error of best model 7.790000 %
epoch 5, minibatch 2500/2500, validation error 7.300000 %
     epoch 5, minibatch 2500/2500, test error of best model 7.590000 %
epoch 6, minibatch 2500/2500, validation error 7.020000 %
     epoch 6, minibatch 2500/2500, test error of best model 7.200000 %
epoch 7, minibatch 2500/2500, validation error 6.680000 %
     epoch 7, minibatch 2500/2500, test error of best model 6.990000 %
epoch 8, minibatch 2500/2500, validation er

epoch 68, minibatch 2500/2500, validation error 2.320000 %
epoch 69, minibatch 2500/2500, validation error 2.320000 %
epoch 70, minibatch 2500/2500, validation error 2.300000 %
     epoch 70, minibatch 2500/2500, test error of best model 2.330000 %
epoch 71, minibatch 2500/2500, validation error 2.290000 %
     epoch 71, minibatch 2500/2500, test error of best model 2.330000 %
epoch 72, minibatch 2500/2500, validation error 2.280000 %
     epoch 72, minibatch 2500/2500, test error of best model 2.300000 %
epoch 73, minibatch 2500/2500, validation error 2.260000 %
     epoch 73, minibatch 2500/2500, test error of best model 2.300000 %
epoch 74, minibatch 2500/2500, validation error 2.250000 %
     epoch 74, minibatch 2500/2500, test error of best model 2.300000 %
epoch 75, minibatch 2500/2500, validation error 2.230000 %
     epoch 75, minibatch 2500/2500, test error of best model 2.300000 %
epoch 76, minibatch 2500/2500, validation error 2.230000 %
epoch 77, minibatch 2500/2500, valida

KeyboardInterrupt: 

## Comment:

MLP algorithm takes much bigger running time than simple logistic regression  (LR)code .. 

This might be due to the large number of input paramters used in MLP as well as the number of hidden units (i.e. 500) .. hence, the structure of MLP network is more complex .. However, it is expected to give us less classification error than LR.