# Moving to Shallow Neural Networks

In this tutorial, you'll implement a shallow neural network to classify digits ranging from 0 to 9. The dataset you'll use is quite famous, it's called 'MNIST' http://yann.lecun.com/exdb/mnist/. A French guy put it up, he's very famous in the DL comunity, he's called Yann Lecun and is now both head of the Facebook AI reseach program and head of something in the University of New York...


###Â First step

As a first step, I invite you to discover what is MNIST. You might find [this notebook](https://nbviewer.jupyter.org/github/marc-moreaux/Deep-Learning-classes/blob/master/notebooks/dataset_MNIST.ipynb) to be usefull, but feel to browse the web.

Once you get the idea, you can download the dataset 

In [28]:
# Download the dataset in this directory (does that work on Windows OS ?)
#! wget http://deeplearning.net/data/mnist/mnist.pkl.gz

In [29]:
import pickle as cPickle
import gzip, numpy
import numpy as np

# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f, encoding='latin1') #added for Python 3
f.close()

def to_one_hot(y, n_classes=10): # You might want to use this as some point...
    _y = np.zeros((len(y), n_classes))
    _y[np.arange(len(y)), y] = 1
    return _y

X_train, y_train = train_set[0], train_set[1]
X_valid, y_valid = valid_set[0], valid_set[1]
X_test,  y_test  = test_set[0],  test_set[1]
y_train = to_one_hot(y_train)
y_valid = to_one_hot(y_valid)
y_test = to_one_hot(y_test)

---
# You can now implement a 2 layers NN

Now that you have the data, you can build the a shallow neural network (SNN). I expect your SNN to have two layers. 
    - Layer 1 has 20 neurons with a sigmoid activation
    - Layer 2 has 10 neurons with a softmax activation
    - Loss is Negative Log Likelihood (wich is also the cross entropy)
    
You'll need to comment your work such that I understand that you understand what you are doing

### 1 - Define Parameters

In [30]:
# HELPER 
def softmax(Z):
    """Z is a vector eg. [1,2,3]
    return: the vector softmax(Z) eg. [.09, .24, .67]
    """
    return np.exp(Z) / np.exp(Z).sum(axis=0)

n_hidden = 20    
# Define the variables here (initialize the weights with the np.random.normal module):
W1, b1 = np.random.normal(size=(X_train.shape[1], n_hidden)), np.random.normal(size=n_hidden)
W2, b2 = np.random.normal(size=(n_hidden ,y_train.shape[1])), np.random.normal(size=y_train.shape[1])

### 2 - Define Model

In [31]:
def sigmoid(x):
    return 1.0/(1+np.exp(-x))

def Pred(X, W1, b1, W2, b2):
    """Explanations ...
    Here we do the vectorized version
    Arguments:
        X: An input image (as a vector)(shape is <50000,784>) 
        Layer1 : Contain W1 (784,20) and b1 (20)
        Layer2 : Contain W2 (20,10) and b2 (10)
    Returns : P, a matrix (50000, 10)
    """
    A1 = np.dot(X, W1) + b1 # A1 = W1t*X + b1
    A2 = sigmoid(A1)
    Z = np.dot(A2, W2) + b2
    P = softmax(Z.T).T # A2 = softmax(A1t*W2) + b2 
    #We needed to transpose one time to apply the softmax and then another one to put the matrix in the right pass order
    return P

def loss(P, Y):
    """Explanations : 
    Here we do the vectorized version
    Arguments:
        P: The prediction vector corresponding to an image (X^s)
        Y: The ground truth of an image
    Returns: a scalar
    """
    return -np.sum(np.multiply(Y, np.log(P)))

### 3 - Define Derivatives

In [32]:
def dW1(W1, b1, W2, P, Y, X):
    """Explanations ??
    Vectorized version
    Returns: A vector which is the derivative of the loss with respect to W1
    The formula that we need to implement is :
    X*A2*(A-A2)*W2*(P-Y)
    """
    A2 = sigmoid(np.dot(X, W1) + b1)
    new_A2 = np.zeros(A2.shape[1])
    # We do the following to have a vector of size 20 based on A2*(1-A2) if we use the matrix A2 as a all,
    # it doesn't work
    for i in range(A2.shape[1]):
        new_A2[i] = np.matmul(A2[i], 1-A2[i])
    dW1 = np.dot(X.T, np.dot((P-Y), np.multiply(W2.T, new_A2)))
    return dW1


def db1(W1, b1, W2, P, Y, X):
    """Explanations ??
    Vectorized version
    Arguments:
        L is the loss af a sample (a scalar)
    Returns: A scalar which is the derivative of the Loss with respect to b1
    The formula that we need to implement is :
    A2*(A-A2)*W2*(P-Y)
    """
    A2 = sigmoid(np.dot(X, W1) + b1)
    new_A2 = np.zeros(A2.shape[1])
    for i in range(A2.shape[1]):
        new_A2[i] = np.matmul(A2[i], 1-A2[i])
    db1 = np.sum(np.dot((P-Y), np.multiply(W2.T, new_A2)), axis=0)
    return db1

def dW2(W1, b1, W2, P, Y, X):
    """
    The formula that we need to implement is :
    A2*(P-Y)
    """
    A2 = sigmoid(np.dot(X, W1) + b1)
    return np.dot(A2.T, (P-Y))


def db2(P, Y):
    """
    The formula that we need to implement is :
    P-Y
    """
    return np.sum(P-Y, axis=0)

### 4 - Train you model

You may use Standard Gradient Descent (SGD) to train your model. (Experiment with many learning rates)

In [33]:
learning_rate = 0.00005
tol = 0.001 # tolerance for convergence
while True:

    #Forward Propagation
    P = Pred(X_train, W1, b1, W2, b2)
    previous_loss = loss(P, y_train)/len(X_train)
    #Backward Propagation
    W1 -= learning_rate * dW1(W1, b1, W2, P, y_train, X_train)
    b1 -= learning_rate * db1(W1, b1, W2, P, y_train, X_train)
    W2 -= learning_rate * dW2(W1, b1, W2, P, y_train, X_train)
    b2 -= learning_rate * db2(P, y_train)
    new_loss = loss(Pred(X_train, W1, b1, W2, b2), y_train)/len(X_train)
    print("Loss" , new_loss)
    print("Accuracy" ,np.sum(np.equal(np.argmax(Pred(X_test, W1, b1, W2, b2), axis=1),np.argmax(y_test, axis = 1)))/len(y_test))
    if (np.abs(previous_loss - new_loss) < tol):
        break
    

Loss 4.08379457984
Accuracy 0.1369
Loss 4.64370226353
Accuracy 0.0927
Loss 3.24699558219
Accuracy 0.1808
Loss 2.49906779982
Accuracy 0.3466
Loss 2.22878569368
Accuracy 0.3216
Loss 1.83012316056
Accuracy 0.4114
Loss 1.45737038149
Accuracy 0.5012
Loss 1.49464879178
Accuracy 0.53
Loss 1.5260663479
Accuracy 0.4656
Loss 1.7326525182
Accuracy 0.4303
Loss 2.03984953014
Accuracy 0.4136
Loss 2.42895875938
Accuracy 0.4121
Loss 1.40088166299
Accuracy 0.5507
Loss 1.37330561038
Accuracy 0.5672
Loss 1.4125726401
Accuracy 0.5524
Loss 1.19836654098
Accuracy 0.6121
Loss 1.13808432584
Accuracy 0.6228
Loss 1.13271637765
Accuracy 0.6281
Loss 1.09664074298
Accuracy 0.6414
Loss 1.23683528443
Accuracy 0.5896
Loss 1.08955500856
Accuracy 0.643
Loss 1.13476486213
Accuracy 0.6232
Loss 1.0411162639
Accuracy 0.6562
Loss 1.09907590824
Accuracy 0.6398
Loss 0.99878013147
Accuracy 0.6701
Loss 1.03010662383
Accuracy 0.6699
Loss 0.91479986406
Accuracy 0.7005
Loss 0.899695102226
Accuracy 0.7112
Loss 0.874057471555
Accura

### 5 - Test the accuracy of your model on the Test set

In [34]:
print("Loss" , loss(P, y_train)/len(X_train))
print("Accuracy" ,np.sum(np.equal(np.argmax(Pred(X_test, W1, b1, W2, b2), axis=1),np.argmax(y_test, axis = 1)))/len(y_test))

Loss 0.650512069427
Accuracy 0.792


---
# You can now go Deeper

Build a deeper model trained with SGD (You don't need to use the biases here)
    - Layer 1 has 10 neurons with a sigmoid activation
    - Layer 2 has 10 neurons with a sigmoid activation
    - Layer 3 has 10 neurons with a sigmoid activation
    - Layer 4 has 10 neurons with a sigmoid activation
    - Layer 5 has 10 neurons with a sigmoid activation
    - Layer 6 has 10 neurons with a softmax activation
    - Loss is Negative Log Likelihood

Is it converging ? Why ? What's wrong ?