# Classifying MNIST digits using Logistic Regression

This notebook will show how Theano can be used to implement the logistic regression. As the plan, we have,

1) A brief Intro to logistic Regression

2) Loading data models

3) Making the Logistic Regression Files

4) Running the model

## The model

So, What is Logistic regression?

Logistic regression is a probabilistic, linear classifier. It is parametrized by a weight matrix $W$ and a bias vector $b$. Classification is done by projecting an input vector onto a set of hyperplanes, each of which corresponds to a class. The distance from the input to a hyperplane reflects the probability that the input is a member of the corresponding class.
Mathematically, the probability that an input vector $x$ is a member of a class $i$, a value of a stochastic variable $Y$, can be written as:
\begin{align}
P(Y = i \mid x, W,b) &= softmax_i (Wx+b)\\
&= \frac{e^{W_i x + b_i}}{\sum_{j}{{e}^{W_j x + b_j }}}
\end{align}


The model’s prediction $y_{pred}$ is the class whose probability is maximal, specifically:
\begin{equation}
y_{pred} = argmax_i P(Y = i \mid x,W,b)
\end{equation}


 We thus maximize the log-likelihood of our classifier given all the labels in a training set.
\begin{equation}
\mathcal{L}(\theta, \mathcal{D}) =
    \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)
\end{equation}
The likelihood of the correct class is not the same as the number of right predictions, but from the point of view of a randomly initialized classifier they are pretty similar. 

Since we usually speak in terms of minimizing a loss function, learning will thus attempt to minimize the negative log-likelihood (NLL).

## Stochastic Gradient Descent
What is ordinary gradient descent? it is a simple algorithm in which we repeatedly make small steps downward on an error surface defined by a loss function of some parameters. For the purpose of ordinary gradient descent we consider that the training data is rolled into the loss function.

Stochastic gradient descent (SGD) works according to the same principles as ordinary gradient descent, but proceeds more quickly by estimating the gradient from just a few examples at a time instead of the entire training set. In its purest form, we estimate the gradient from just a single example at a time.

The variant that is generally recommended for deep learning is a further twist on stochastic gradient descent using so-called “minibatches”. Minibatch SGD (MSGD) works identically to SGD, except that we use more than one training example to make each estimate of the gradient. This technique reduces variance in the estimate of the gradient, and often makes better use of the hierarchical memory organization in modern computers.

## Regularization
while training our model from data, we are trying to prepare it to do well on new examples, not the ones it has already seen. The training loop above for MSGD does not take this into account, and may overfit the training examples. A way to combat overfitting is through regularization.

#### L2 regularization
L2 regularization involve adding an extra term to the loss function, which penalizes certain parameter configurations.

If our loss function is
\begin{equation}
NLL(\theta, \mathcal{D}) = - \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)
\end{equation}
then the regularized loss will be:
\begin{equation}
E(\theta, \mathcal{D}) =  NLL(\theta, \mathcal{D}) + \lambda||\theta||_2^2
\end{equation}
where, $||\theta||$ is the $L_2$ norm of $\theta$.

## Loading the data

You can either download the data from [Kaggle](https://www.kaggle.com/c/digit-recognizer/data), or from [The MNIST Database website](http://yann.lecun.com/exdb/mnist/).

However, to understand the use of pickled data, we will load it from the existing picked file, [mnist.pkl.gz](mnist.pkl.gz)

## Pickle 

Pickle is used for serializing and de-serializing a Python object structure. Any object in python can be pickled so that it can be saved on disk. For more information, check [this](https://pythontips.com/2013/08/02/what-is-pickle-in-python/) out.

#### Now lets get to it!

In [33]:
# Function for loading the dataset
# We define a function that takes the path to a file, and loads it.
# Or downloads the file if not availible from University of Motreal's website.
from code import dataset_loader as dl
# train_set, valid_set, test_set format: tuple(input, target)
# input is a numpy.ndarray of 2 dimensions (a matrix)
# where each row corresponds to an example. target is a
# numpy.ndarray of 1 dimension (vector) that has the same length as
# the number of rows in the input. It should give the target
# to the example with the same index in the input.

function

In [6]:
# Importing the useful libraries and packages

from __future__ import print_function

__docformat__ = 'restructedtext en'

import six.moves.cPickle as pickle
import gzip
import os
import sys
import timeit

import numpy

import theano
import theano.tensor as T


We need to first define the Logistic regression class.
It will have some basic definations,

1) <code>\_\_init\_\_</code> : for the initialization. According the the number of inputs and outputs.

2) <code> negative_log_likelihood </code> : For calculating the mean loss with respect to a given set of points of outcome variables(Minibatch).

3)<code> error </code> : This gives the proportion of incorrectly labelled points.

Kindly check the file, [LogisticRegression.py](code/LogisticRegression.py) to see how it has been coded. We will simply import it.

In [23]:
from code import LogisticRegression

We start by allocating symbolic variables for the training inputs x and their corresponding classes $y$. Note that $x$ and $y$ are defined outside the scope of the <b> LogisticRegression object </b> . 

Since the class requires the input to build its graph, it is passed as a parameter of the __init__ function. This is useful in case you want to connect instances of such classes to form a deep network. The output of one layer can be passed as the input of the layer above.

Finally, we define a (symbolic) cost variable to minimize, using the instance method <b>classifier.negative_log_likelihood</b> .

In [12]:
# the cost we minimize during training is the negative log likelihood of
# the model in symbolic format
cost = classifier.negative_log_likelihood(y)

In [13]:
cost

Elemwise{neg,no_inplace}.0

In [14]:
g_W = T.grad(cost=cost, wrt=classifier.W)
g_b = T.grad(cost=cost, wrt=classifier.b)

In [15]:
g_W

dot.0