### Instructions

1. Take the framework code from the lesson and paste it into this notebook, or (even better) into a separate Python module
1. Define and train one-layered perceptron, observing training and validation accuracy during training
1. Try to understand if overfitting took place, and adjust layer parameters to improve accuracy
1. Repeat previous steps for 2- and 3-layered perceptrons. Try to experiment with different activation functions between layers.
1. Try to answer the following questions:
    - Does the inter-layer activation function affect network performance?
    - Do we need 2- or 3-layered network for this task?
    - Did you experience any problems training the network? Especially as the number of layers increased.
    - How do weights of the network behave during training? You may plot max abs value of weights vs. epoch to understand the relation.

# Multi layer Perceptron

This project is based on the [AI for beginners](https://github.com/microsoft/AI-For-Beginners) course from microsoft and uses the same framework to build the network, although the derivatives are calculated in a different way, following my own mathematical resolution of an N-layered perceptron. 

For building a perceptron framework we need 3 elements, in the form of classes:

- Hidden layer
- Output layer
- Stackable network framework

Both layers have a linear transformation forward pass, which outputs a transformation based on the weights and bias of the model. Then the hidden layer has an activation function forward pass which applies whatever activation function you choose to the layer, and the output layer has an activation function forward pass which is always a softmax to output probabilities. 
Both of them have a backward pass to calculate the derivatives of the weights and bias and an update method to update them.

The stackable network framework is the framework that allows us to stack multiple hidden layers and an output layer in our perceptron model and perform forward and backward passes on all of the layers automatically. 

The loss function is a cross entropy function and the output layer uses a softmax activation function, since that is what I assumed for my mathematical resolution, but any hidden layer activation function can be used. In particular I will be using tanh since it is a simple function to differentiate. 

## Vector batches for SGD

MNIST digits are arrays of shape (784,)

To use SGD we will use a batch of input vectors instead of a single one. A batch of vectors is an array of shape (b,784)

The reasoning was done with column vectors (784,) so this new approach requires us to transpose the matrix multiplications.
This involves changing the order and then transposing both vector and matrix. (the vector is already transposed)

By changing the order of all terms and transposing the matrix at __init__ (changing input and output dimensions) we achieve
the same matrix multiplications, now with vectors of shape (,784) for forward and backward passes. 

Now that we have the multiplication for row vectors, we have to accomodate for a batch of vectors. The result of a linear 
transformation of a batch is a matrix, where every row is the result of the transformation of one vector of the batch. Now the bias needs to be a vector of shape (1,nout), and numpy understands to add it in every row of the resulting matrix since it is a row vector.

Vector-vector multiplication when changing to batch matrix-batch matrix multiplication is simply the sum of all pairs of vector vector multiplication. 
If the batch has 4 vectors, instead of (1,784) each vector is a matrix (4,784) and their multiplication is a sum over all 
4 pairs of vector-vector multiplications. 
In the case of the bias, they don't sum 
themselves, the derivative of the bias is a (b,...) matrix so we must sum over axis=0 to obtain the proper shape.

This means that all operations can stay the same for a batch-batch multiplication as they were for a single vector when doing forward and backward passes.


In a batch of b vectors, the dimensions of the vectors for layer n are

Forward
    - input vector   (b,m(n-1))
    - Weights        (m(n-1),m(n))
    - bias           (1,m(n))
    - output         (b,m(n))

Backward
    - derivative activation  (b,m(n))
    - A coefficient          (b,m(n-1))
    - A "previous"(n+1)      (b,m(n))
    - Weights derivatives    (m(n-1),m(n))
    - bias derivatives       (b,m(n-1))


In [4]:
import gzip
import pandas as pd
with gzip.open('mnist.pkl.gz', 'rb') as mnist_pickle:
    #MNIST = pickle.load(mnist_pickle,encoding='latin1')
    training_data, validation_data, test_data = pickle.load(mnist_pickle,encoding='latin1')
#MNIST = pd.read_pickle('mnist.pkl.gz',compression='gzip')



FileNotFoundError: [Errno 2] No such file or directory: 'mnist.pkl.gz'

In [5]:
import numpy as np
#from sklearn.datasets import make_classification
# pick the seed for reproducibility - change it to explore the effects of random variations
np.random.seed(0)
import random
import time

In [6]:
class Hidden_Layer():
    def __init__(self,input_dimension,output_dimension):
        self.W = np.random.normal(0,1.0/np.sqrt(input_dimension), (input_dimension,output_dimension))
        self.b = np.zeros((1,output_dimension))
        self.dW = np.zeros_like(self.W)
        self.db = np.zeros_like(self.b)
    def linear_forward(self,x):
        self.yn_1 = x                       #storing y(n-1) for later
        return np.dot(x,self.W) + self.b    #returns the linear transformation
    
    def activation_forward(self,z):
        y = np.tanh(z)
        self.da = 1-y*y                     #storing a'(n) for layer
        return y                            #tanh squishes values between -1 and 1
    
    def backward(self,A_prev):
        
        Ada = np.multiply(A_prev,self.da)
        
        self.dW = np.matmul(self.yn_1.T,Ada)
        self.db = Ada.sum(axis=0)

        return np.matmul(Ada,self.W.T)
        
    def update(lr):
        self.W -= lr*self.dW
        self.b -= lr*self.db


In [16]:
class Output_Layer():
    def __init__(self,input_dimension,output_dimension):
        self.W = np.random.normal(0,1.0/np.sqrt(input_dimension), (input_dimension,output_dimension))
        self.b = np.zeros((1,output_dimension))
        self.dW = np.zeros_like(self.W)
        self.db = np.zeros_like(self.b)
        
    def linear_forward(self,x):
        self.yn_1 = x                       #storing y(n-1) for later
        return np.dot(x,self.W) + self.b    #returns the linear transformation
    
    def activation_forward(self,z):        
        #adding a constant >1 in front of z makes small increases in z bring greater increases in p
        expz = np.exp(z)
        Z = expz.sum(axis=1,keepdims=True)
        self.p = expz / Z                   #storing p for the backward pass
        return self.p

    
    def backward(self,labels):
        
        #labels is vector of shape (b,)
        
        #I get p from self.p since I stored in the forward pass
        
        #p_of_f = \sum_alpha -(d_gamma_i - p_i)    (b,m(N))
        
        #p is a matrix of shape (b,m(N)) and to transform it into p_of_f I need to iterate over every row based on the order
        #of the labels -> 0th label 0th row, 1st label 1st row
        #and from every row transform the probability matching the label value into p_of_f = p-1
        #the others stay as p_of_f = p
        
        p_of_f = self.p
        for i,lab in enumerate(labels):
            p_of_f[i,lab] -= 1
        
        self.db = p_of_f.sum(axis=0)
        self.dW = np.matmul(self.yn_1.T,p_of_f)

        return np.matmul(p_of_f,self.W.T)
        
    def update(self,lr):
        self.W -= lr*self.dW
        self.b -= lr*self.db


In [17]:
class Network:
    def __init__(self):
        self.layers = []
        
    def add(self,l):
        self.layers.append(l)
        
    def forward(self,y):
        for l in self.layers:
            #x = layer.linear_forward(x)
            #x = layer.activation_forward(x)
            #or compose them x = layer.activation_forward(layer.linear_forward(x))
            
            y = l.activation_forward(l.linear_forward(y))
        return y                 #returns p
    
    def backward(self,z):
        for l in self.layers[::-1]:
            z = l.backward(z)
        #return z               #I don't need backward to return A(1)
    
    def update(self,lr):
        for l in self.layers:
            if 'update' in l.__dir__():
                l.update(lr)


## Training the model

The first test we can do is create a simple 1 layer network and train it with a number of batches that covers our training set exactly once. This is sometimes known as a training epoch.

___

# Using AI for beginners method

For building a perceptron framework we need 5 elements, which will be in the form of classes:

- linear transformation
- hidden layer activation function
- softmax activation function
- loss function ??????????????????????????
- stackable network framework

All element except the network framework have a forward pass to calculate the probabilities of the classes and a backward pass to calculate the derivatives of the weights through backpropagation. 
The loss function is a cross entropy function and the output layer uses a softmax activation function, since that is what I assumed for my mathematical resolution, but any hidden layer activation function can be used. In particular I will be using tanh since it is a simple function to differentiate. 


In [274]:
class Linear:

    def __init__(self,input_dimension,output_dimension):
        self.W = np.random.normal(0,1.0/np.sqrt(input_dimension), (output_dimension,input_dimension))
        self.b = np.zeros((output_dimension))        

    def forward(self,x): 
        return np.dot(self.W,x) + self.b 

    def backward(self):
        return 
    
x = np.array([1,-2,0,-1,2])
lll = Linear(5,3)
print(lll.forward(x))

[-0.88184657 -0.74112329 -0.16958856]


In [276]:
class Softmax:
    def forward(self,z):
        expz = np.exp(z)
        Z = expz.sum(keepdims=True)
        self.p = expz / Z
        return self.p

    def backward(self):
        return 1

softm = Softmax()
print(softm.forward(x))

[0.23412166 0.01165623 0.08612854 0.03168492 0.63640865]


In [277]:
class Activation_function:
    def forward(self,z):
        self.y = np.tanh(z)
        return self.y #tanh squishes values between -1 and 1 

activ = Activation_function()
activ.forward(x)

array([ 0.76159416, -0.96402758,  0.        , -0.76159416,  0.96402758])

In [278]:
class loss_function:
    def forward(self,p,labels):
        p_of_true = p[np.arange(len(labels)), labels]
        return -np.log(p_of_true).mean()

crossloss = loss_function()
p = np.array([[0.01,0.1,0.7,0.1,0.05,0.04],[0.84,0.01,0.01,0.05,0.05,0.04]])
l = np.array([2,0])
crossloss.forward(p,l) 

#Might be better as a function since I am not calling it to calculate derivatives only the loss itself, so it gets called only 
#once at the end of the training epoch or training process

0.2655141655417551

_____

## Training

In [34]:
%matplotlib nbagg
import matplotlib.pyplot as plt 
from matplotlib import gridspec
from sklearn.datasets import make_classification
import numpy as np
# pick the seed for reproducibility - change it to explore the effects of random variations
np.random.seed(0)
import random

In [35]:
n = 100
X, Y = make_classification(n_samples = n, n_features=2,
                           n_redundant=0, n_informative=2, flip_y=0.2)
X = X.astype(np.float32)
Y = Y.astype(np.int32)

# Split into train and test dataset
train_x, test_x = np.split(X, [n*8//10])
train_labels, test_labels = np.split(Y, [n*8//10])

In [36]:
#in progress
net = Network()
net.add(Output_Layer(2,2))
learning_rate = 0.1


pred = np.argmax(net.forward(train_x),axis=1)
acc = (pred==train_labels).mean()
print("Initial accuracy: ",acc)

batch_size=4
for i in range(0,len(train_x),batch_size):
    xb = train_x[i:i+batch_size]
    yb = train_labels[i:i+batch_size]
    
    # forward pass
    #z = lin.forward(xb)
    #p = softmax.forward(z)
    #loss = cross_ent_loss.forward(p,yb)
    
    p = net.forward(xb)
    
    
    # backward pass
    net.backward(yb)
    net.update(learning_rate)
    #dp = cross_ent_loss.backward(loss)
    #dz = softmax.backward(dp)
    #dx = lin.backward(dz)
    #lin.update(learning_rate)
    
pred = np.argmax(net.forward(train_x),axis=1)
acc = (pred==train_labels).mean()
print("Final accuracy: ",acc)


Initial accuracy:  0.2125
Final accuracy:  0.825


_____

In [None]:
#class definition in progress for (784,) as input vector
class Hidden_Layer_singlevec():
    def __init__(self,input_dimension,output_dimension,batch):
        self.W = np.random.normal(0,1.0/np.sqrt(input_dimension), (output_dimension,input_dimension))
        self.b = np.zeros((1,output_dimension))
        self.dW = np.zeros_like(self.W)
        self.db = np.zeros_like(self.b)
    def linear_forward(self,x):
        self.yn_1 = x                       #storing y(n-1) for later
        return np.dot(self.W,x) + self.b    #returns the linear transformation
    
    def activation_forward(self,z):
        y = np.tanh(z)
        self.da = 1-y*y                     #storing a'(n) for layer
        return y                            #tanh squishes values between -1 and 1
    
    def backward(self,A_prev):
        #calculate A of this layer
        #AN1 = np.matmul(WN1.T,np.multiply(AN,a'N1))
        
        #calculate derivatives of this layer using An_1
        #np.matmul(np.multiply(AN,a'N1),yN2.T)
        #np.multiply(AN,a'N1)
        
        #store derivatives but not any A since they are fed as inputs
        
        #return A
        
        Ada = np.multiply(A_prev,self.da)
        
        A = np.matmul(self.W.T,Ada)
        
        dW = np.matmul(Ada,self.yn_1.T)
        db = Ada
        
        #takes An as input
        #calculates and stores derivatives
        #returns An-1
        
        return A
        
    def update(lr):
        #the derivatives are already stored
        #remember the - signs that can be removed
        
        #change W and b by derivatives no returning anything
        
        
        
hid = Hidden_Layer(5,3)
print(1-hid.activation_forward(hid.linear_forward(x))*hid.activation_forward(hid.linear_forward(x)),hid.da)
