# Deep learning alone, without any framework (on the MNIST data set)

The purpose of this notebook is to train a 3 layered neural network without any framework.
Propagation takes 3 lines in python
and back propagation takes 4 lines.

By reproducing  the results given on the [MNIST site](http://yann.lecun.com/exdb/mnist/). In less than 2 minutes, you will build and train a fully connected neural network (NN) 
performing about 2% error on the [MNIST database](http://yann.lecun.com/exdb/mnist/),

Let's begin with loading the data

In [1]:
import keras
import time

import numpy as np
np.random.seed(42)       

from keras.datasets import mnist
#from keras.datasets import fashion_mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()


X_train = X_train/255
X_test = X_test/255

x_train = X_train.reshape(60000, 784)  # reshape input from (28,28) to 784
x_test = X_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

num_classes = 10;
Y_train = keras.utils.to_categorical(y_train, num_classes)
Y_test = keras.utils.to_categorical(y_test, num_classes)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


We begin by defining an activation function (see for instance https://github.com/mlelarge/dataflowr/tree/master/Notebooks)  
and for the cross entropy loss  (see for instance https://deepnotes.io/softmax-crossentropy or 
https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/),  
together with their derivative or grandient.

In [2]:
class MyReLU(object):
    def forward(x):
        return np.maximum(0, x)   
    def backward(x):
        return np.maximum(0, np.sign(x))
    
class MyCrossEntropy(object):
    def Cost(y,yt):
        exps = np.exp(y)            
        exps /= np.sum(exps, axis=0)   # normalize
        return -np.sum(yt*np.log(exps),axis=0),exps
    def Grad_Cost(y,yt):
        y /= np.sum(y, axis=0)   # normalize
        return -np.sum(yt*np.log(y),axis=0)

Intitialization

In [3]:
n,p = x_train.shape
batch_size = 128

n1 = 300;
sig = 0.05
W1 = sig*np.random.randn(n1,p+1)
W2 = sig*np.random.randn(10,n1+1)

Propagation is performed in paralel on all the exemples taken within the mini batch

In [4]:
ind = np.random.choice(np.arange(n),batch_size, replace=False)

x  = x_train[ind,]
a1 = W1@np.r_[x.T, np.ones((1,batch_size))]
h1 = MyReLU.forward(a1)
a2 = W2@np.r_[h1, np.ones((1,batch_size))]
y = MyReLU.forward(a2)

In [None]:
Sanity check

In [5]:
print(x.shape)
print(W1.shape)
print(a1.shape)
print(h1.shape)
print(W2.shape)
print(a2.shape)
print(y.shape)

(128, 784)
(300, 785)
(300, 128)
(300, 128)
(10, 301)
(10, 128)
(10, 128)


In [6]:
err, exps = MyCrossEntropy.Cost(y,Y_train[ind,].T)
errT = np.sum(err)
print(errT)

298.89242651165546


Backpropagation with L2 weight decay

In [7]:
lambd = 0.0001;

dJ_dy  = exps - Y_train[ind,].T
dJ_da2 = dJ_dy * MyReLU.backward(a2)
gradW2 = dJ_da2@np.r_[h1, np.ones((1,batch_size))].T + lambd*W2
dJ_da1 = W2[:,0:n1].T@dJ_da2 * MyReLU.backward(a1)
gradW1 = dJ_da1@np.r_[x.T, np.ones((1,batch_size))].T + lambd*W1

In [8]:
print(exps.shape)
print(dJ_dy.shape)
print(dJ_da2.shape)
print(gradW2.shape)
print(dJ_da1.shape)
print(gradW1.shape)

(10, 128)
(10, 128)
(10, 128)
(10, 301)
(300, 128)
(300, 785)


Weight update

In [9]:
stepsize = 0.001/batch_size;
W1 = W1 - stepsize*gradW1
W2 = W2 - stepsize*gradW2

Let's loop.
Before, some initialization are needed.

In [17]:
n,p = x_train.shape
batch_size = 128

n1 = 300;
sig = 0.05
W1 = sig*np.random.randn(n1,p+1)
W2 = sig*np.random.randn(10,n1+1)

lambd = 0.01;

n_epoch = 50
nb_ite_max = int(np.round(n_epoch*n/batch_size));
print(nb_ite_max)

stepsize = 0.1/batch_size;

23438


Compute the initial error rate

In [18]:
nt,p = x_test.shape

a1 = W1@np.r_[x_test.T, np.ones((1,nt))]
h1 = MyReLU.forward(a1)
y = W2@np.r_[h1, np.ones((1,nt))]

winner = y.max(0).reshape((nt,1))
yp = (y == np.outer(np.ones((10,1)),winner)).astype(int)
print(100*np.sum(np.abs(yp - Y_test.T))/nt/2)

88.54


It takes about two minutes

In [19]:
t0=time.time()
for i in range(nb_ite_max):
#                                                                  RANDOMLY PICKING THE MINIBACH    
    ind = np.random.choice(np.arange(n),batch_size, replace=False)
#                                                                  PROPAGATION
    x  = x_train[ind,]
    a1 = W1@np.r_[x.T, np.ones((1,batch_size))]
    h1 = MyReLU.forward(a1)
    y  = W2@np.r_[h1, np.ones((1,batch_size))]
#                                                                  COST    
    err, exps = MyCrossEntropy.Cost(y,Y_train[ind,].T)
    errT = np.sum(err)
#                                                                  BACK PROPAGATION    
    dJ_da2 = exps - Y_train[ind,].T
    gradW2 = dJ_da2@np.r_[h1, np.ones((1,batch_size))].T + lambd*W2
    dJ_da1 = W2[:,0:n1].T@dJ_da2 * MyReLU.backward(a1)
    gradW1 = dJ_da1@np.r_[x.T, np.ones((1,batch_size))].T + lambd*W1
#                                                                  GRADIENT UPDATE
    W1 = W1 - stepsize*gradW1
    W2 = W2 - stepsize*gradW2

 #   print(errT)
print('total computing time: '+str(time.time()-t0))

total computing time: 112.7305588722229


Compute the final error rate

In [20]:
a1 = W1@np.r_[x_test.T, np.ones((1,nt))]
h1 = MyReLU.forward(a1)
y = W2@np.r_[h1, np.ones((1,nt))]

winner = y.max(0).reshape((nt,1))
yp = (y == np.outer(np.ones((10,1)),winner)).astype(int)
err_rate = 100*np.sum(np.abs(yp - Y_test.T))/nt/2
print('Test error rate:  %4.2f %%'% err_rate)

Test error rate:  2.01 %


## Conclusion
Of course, it is far from being finished, a many imporvement are waiting to be coded. 
Still, we hope that these few lines give an good idea of the nature of deep learning.