# Moving to Shallow Neural Networks

In this tutorial, you'll implement a shallow neural network to classify digits ranging from 0 to 9. The dataset you'll use is quite famous, it's called 'MNIST' http://yann.lecun.com/exdb/mnist/. A French guy put it up, he's very famous in the DL comunity, he's called Yann Lecun and is now both head of the Facebook AI reseach program and head of something in the University of New York...


### First step

As a first step, I invite you to discover what is MNIST. You might find [this notebook](https://nbviewer.jupyter.org/github/marc-moreaux/Deep-Learning-classes/blob/master/notebooks/dataset_MNIST.ipynb) to be usefull, but feel to browse the web.

Once you get the idea, you can download the dataset 

In [125]:
# Download the dataset in this directory (does that work on Windows OS ?)
#! wget http://deeplearning.net/data/mnist/mnist.pkl.gz

In [126]:
import pickle as cPickle
import gzip, numpy
import numpy as np

# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f, encoding='latin1') #added for Python 3
f.close()

def to_one_hot(y, n_classes=10): # You might want to use this as some point...
    _y = np.zeros((len(y), n_classes))
    _y[np.arange(len(y)), y] = 1
    return _y

X_train, y_train = train_set[0], train_set[1]
X_valid, y_valid = valid_set[0], valid_set[1]
X_test,  y_test  = test_set[0],  test_set[1]
y_train = to_one_hot(y_train)
y_valid = to_one_hot(y_valid)
y_test = to_one_hot(y_test)

---
# You can now implement a 2 layers NN

Now that you have the data, you can build the a shallow neural network (SNN). I expect your SNN to have two layers. 
    - Layer 1 has 20 neurons with a sigmoid activation
    - Layer 2 has 10 neurons with a softmax activation
    - Loss is Negative Log Likelihood (wich is also the cross entropy)
    
You'll need to comment your work such that I understand that you understand what you are doing

### 1 - Define Parameters

In [127]:
# HELPER 
def softmax(Z):
    """Z is a vector eg. [1,2,3]
    return: the vector softmax(Z) eg. [.09, .24, .67]
    """
    return np.exp(Z) / np.exp(Z).sum(axis=0)

n_hidden = 20    
# Define the variables here (initialize the weights with the np.random.normal module):
W1, b1 = np.random.normal(size=(X_train.shape[1], n_hidden)), np.random.normal(size=n_hidden)
W2, b2 = np.random.normal(size=(n_hidden ,y_train.shape[1])), np.random.normal(size=y_train.shape[1])

### 2 - Define Model

In [128]:
def sigmoid(x):
    return 1.0/(1+np.exp(-x))

def Pred(X, W1, b1, W2, b2):
    """Explanations ...
    Here we do the vectorized version
    Arguments:
        X: An input image (as a vector)(shape is <50000,784>) 
        Layer1 : Contain W1 (784,20) and b1 (20)
        Layer2 : Contain W2 (20,10) and b2 (10)
    Returns : A2, a matrix (50000, 10)
    """
    A1 = np.dot(X, W1) + b1 # A1 = W1t*X + b1
    A2 = sigmoid(A1)
    Z = np.dot(A2, W2) + b2
    P = softmax(Z.T).T # A2 = softmax(A1t*W2) + b2 
    #We needed to transpose one time to apply the softmax and then another one to put the matrix in the right pass order
    return P

def loss(P, Y):
    """Explanations : 
    Here we do the vectorized version
    Arguments:
        P: The prediction vector corresponding to an image (X^s)
        Y: The ground truth of an image
    Returns: a vector ???
    """
    return -np.sum(np.multiply(Y, np.log(P)))

### 3 - Define Derivatives

In [129]:
def dW1(W1, b1, W2, P, Y, X):
    """Explanations ??
    Returns: A vector which is the derivative of the loss with respect to W1
    """
    A2 = sigmoid(np.dot(X, W1) + b1)
    #ones = np.ones(A2.shape)
    new_A2 = np.zeros(A2.shape[1])
    for i in range(A2.shape[1]):
        new_A2[i] = np.matmul(A2[i], 1-A2[i])
    dW1 = np.dot(X.T, np.dot((P-Y), np.multiply(W2.T, new_A2)))
    return dW1
    #return np.dot(X.T, np.dot((P-Y), W2.T))
    #return np.dot(X.T, np.dot(A2, np.dot((1-A2).T, np.dot(P-Y, W2.T))))


def db1(W1, b1, W2, P, Y, X):
    """Explanations ??
    Arguments:
        L is the loss af a sample (a scalar)
    Returns: A scalar which is the derivative of the Loss with respect to b1
    """
    A2 = sigmoid(np.dot(X, W1) + b1)
    new_A2 = np.zeros(A2.shape[1])
    for i in range(A2.shape[1]):
        new_A2[i] = np.matmul(A2[i], 1-A2[i])
    db1 = np.sum(np.dot((P-Y), np.multiply(W2.T, new_A2)), axis=0)
    return db1
    #return np.sum(np.dot((P-Y), W2.T), axis=0)
    #return np.sum(np.dot(A1, np.dot((1-A1).T, np.dot(P-Y, W2.T))), axis=0)


def dW2(W1, b1, W2, P, Y, X):
    A2 = sigmoid(np.dot(X, W1) + b1)
    return np.dot(A2.T, (P-Y))


def db2(P, Y):
    return np.sum(P-Y, axis=0)

### 4 - Train you model

You may use Standard Gradient Descent (SGD) to train your model. (Experiment with many learning rates)

In [130]:
learning_rate = 0.00005

for i in range(200):
    
    #Forward Propagation
    P = Pred(X_train, W1, b1, W2, b2)
    
    #Backward Propagation
    W1 -= learning_rate * dW1(W1, b1, W2, P, y_train, X_train)
    b1 -= learning_rate * db1(W1, b1, W2, P, y_train, X_train)
    W2 -= learning_rate * dW2(W1, b1, W2, P, y_train, X_train)
    b2 -= learning_rate * db2(P, y_train)
    print("Loss" , loss(P, y_train)/len(X_train))
    print("accuracy" ,np.sum(np.equal(np.argmax(Pred(X_test, W1, b1, W2, b2), axis=1),np.argmax(y_test, axis = 1)))/len(y_test))

Loss 6.21066170245
accuracy 0.1198
Loss 4.920450328
accuracy 0.1254
Loss 4.51000556403
accuracy 0.1847
Loss 3.82689735372
accuracy 0.1998
Loss 3.16926412587
accuracy 0.2651
Loss 2.66229461102
accuracy 0.2584
Loss 2.33947196555
accuracy 0.2861
Loss 2.13615158757
accuracy 0.304
Loss 2.03407702893
accuracy 0.319
Loss 1.97787098321
accuracy 0.3248
Loss 1.9482180027
accuracy 0.3348
Loss 1.91380723487
accuracy 0.3397
Loss 1.88095659142
accuracy 0.348
Loss 1.84678558841
accuracy 0.3545
Loss 1.8175169571
accuracy 0.3614
Loss 1.78861479676
accuracy 0.3829
Loss 1.76143494098
accuracy 0.3915
Loss 1.73481601675
accuracy 0.3997
Loss 1.71007466387
accuracy 0.4103
Loss 1.68666837093
accuracy 0.4284
Loss 1.66438632104
accuracy 0.4361
Loss 1.6434621785
accuracy 0.4589
Loss 1.62357655539
accuracy 0.4644
Loss 1.60431541802
accuracy 0.4745
Loss 1.58552829995
accuracy 0.4849
Loss 1.5674830577
accuracy 0.4926
Loss 1.5501596092
accuracy 0.5009
Loss 1.53345032039
accuracy 0.5076
Loss 1.51738411905
accuracy 0.

### 5 - Test the accuracy of your model on the Test set

---
# You can now go Deeper

Build a deeper model trained with SGD (You don't need to use the biases here)
    - Layer 1 has 10 neurons with a sigmoid activation
    - Layer 2 has 10 neurons with a sigmoid activation
    - Layer 3 has 10 neurons with a sigmoid activation
    - Layer 4 has 10 neurons with a sigmoid activation
    - Layer 5 has 10 neurons with a sigmoid activation
    - Layer 6 has 10 neurons with a softmax activation
    - Loss is Negative Log Likelihood

Is it converging ? Why ? What's wrong ?