### Instructions

1. Take the framework code from the lesson and paste it into this notebook, or (even better) into a separate Python module
1. Define and train one-layered perceptron, observing training and validation accuracy during training
1. Try to understand if overfitting took place, and adjust layer parameters to improve accuracy
1. Repeat previous steps for 2- and 3-layered perceptrons. Try to experiment with different activation functions between layers.
1. Try to answer the following questions:
    - Does the inter-layer activation function affect network performance?
    - Do we need 2- or 3-layered network for this task?
    - Did you experience any problems training the network? Especially as the number of layers increased.
    - How do weights of the network behave during training? You may plot max abs value of weights vs. epoch to understand the relation.

This project is based on the [AI for beginners](https://github.com/microsoft/AI-For-Beginners) course from microsoft and uses the same framework to build the network, although the derivatives are calculated in a different way, following my own mathematical resolution of an N-layered perceptron. 

For building a perceptron framework we need 5 elements, which will be in the form of classes:

- linear transformation
- hidden layer activation function
- softmax activation function
- loss function ??????????????????????????
- stackable network framework

All element except the network framework have a forward pass to calculate the probabilities of the classes and a backward pass to calculate the derivatives of the weights through backpropagation. 
The loss function is a cross entropy function and the output layer uses a softmax activation function, since that is what I assumed for my mathematical resolution, but any hidden layer activation function can be used. In particular I will be using tanh since it is a simple function to differentiate. 


In [113]:
import numpy as np
#from sklearn.datasets import make_classification
# pick the seed for reproducibility - change it to explore the effects of random variations
np.random.seed(0)
import random
import time

In [114]:
#MNIST digits are arrays of shape (784,)             (np.array([1,2,3,4]))
#a batch of digits is an array of shape (n,784)
#We will transpose them because the reasoning was done with column vectors

In [210]:
class Linear:
    def __init__(self,input_dimension,output_dimension):
        self.W = np.random.normal(0,1.0/np.sqrt(input_dimension), (output_dimension,input_dimension))
        self.b = np.zeros((output_dimension))
        
    def forward(self,x): 
        return np.dot(self.W,x) + self.b
    
    def backward(self):
        return 
    
lll = Linear(5,3)
x = np.array([1,2,-1,0,1]) 
print(lll.forward(x))


[ 0.97889949  0.21807771 -0.47248609]


In [207]:
class Softmax:
    def forward(self,z):
        expz = np.exp(z)
        Z = expz.sum(keepdims=True)
        self.p = expz / Z
        return self.p
    
    def backward(self):
        return self.p
softm = Softmax()
print(softm.forward(x),softm.backward())

[0.18719356 0.29192151 0.33898731 0.03704098 0.14485663] [0.18719356 0.29192151 0.33898731 0.03704098 0.14485663]


In [209]:
class Activation_function:
    def forward(self,z):
        self.y = np.tanh(z)
        return self.y #tanh squishes values between -1 and 1
    
    
activ = Activation_function()
activ.forward(x)

array([ 0.76159416,  0.96402758, -0.76159416,  0.        ,  0.76159416])

In [155]:
class loss_function:
    def forward(self,p,labels):
        p_of_true = p[np.arange(len(labels)), labels]
        return -np.log(p_of_true).mean()
crossloss = loss_function()
p = np.array([[0.01,0.1,0.7,0.1,0.05,0.04],[0.84,0.01,0.01,0.05,0.05,0.04]])
l = np.array([2,0])
crossloss.forward(p,l) 

#Might be better as a function since I am not calling it to calculate derivatives only the loss itself, so it gets called only 
#once at the end of the training epoch or training process


0.2655141655417551

In [None]:
class Network:
    def __init__(self):
        self.layers = []
        
    def add(self,l):
        self.layers.append(l)
        
    def forward(self,y):
        for l in self.layers:
            y = l.forward(y)
        return y
    
    def backward(self,z):
        for l in self.layers[::-1]:
            z = l.backward(z)
        return z
    
    def update(self,lr):
        for l in self.layers:
            if 'update' in l.__dir__():
                l.update(lr)


In [166]:
Lis = ['a','b','c','d']
for n,m in zip(Lis,Lis[1:]):
    print(n,m)

a b
b c
c d


In [177]:
#Backpropagation using my method requires to calculate the A coefficients and then the derivatives

#The A coefficients can be calculated backward pass, and are different for the Nth layer and the rest

#For the rest, they require a matrix multiplication with the transposed weights of the same layer and the vector resulting from
#an elemnt-wise multiplication of the previous A coefficient and the derivative of the activation function of the same layer. 

#In layer N, instead of the element-wise multiplication we get the probability of failure vector, which is a vector of the same
#size as softmax output

#The derivatives are also split between the Nth layer and the rest. 

#In the Nth layer they are calculated as a matrix multiplication of the probability of failure and the output of the N-1 layer 
#for the weights, and simply the probability of failure for the bias.

#In the rest they are calculated as a matrix multiplication of the element-wise multiplication of the previous layer's A 
#coefficient and the derivative of the activation function of the same layer, and the output of the following layer. 

#So basically, softmax has its own backward pass since it is layer N, and the activation function has the recursive formula with
#the element-wise multiplications. 

In [None]:
#Maybe redo the structure so the classes are only Hidden layer, Output layer and Network
#Network is the same, just stacking layers

#Hidden layer contains modules for a linear forward, activation forward, derivatives backward
#Output layer contains modules for a linear forward, activation forward (same name as hidden but uses softmax), 
#update backward, to change weights of the layer. 

#Now a forward pass has half as many layers since instead of being separated into linear and activation they are together, but
#in the loop you just do

#x = layer.linear_forward(x)
#x = layer.activation_forward(x)

#or compose them x = layer.activation_forward(layer.linear_forward(x))

#The benefits of this are that a single layer has access to both the weights w(n) the y(n-1) and the a'(n), which are needed to 
#calculate derivatives. The y(n-1) is also stored since it is the input used for a linar forward and as such is stored 

In [213]:
V = 2 
for n in range(4):
    V = V*2
    print(V)
    V = V*3
    print(V)

4
12
24
72
144
432
864
2592


In [None]:
#MNIST digits are arrays of shape (784,)             (np.array([1,2,3,4]))
#a batch of digits is an array of shape (n,784)
#The reasoning was done with column vectors (784,) so this new approach requires us to transpose the matrix multiplications.
#This involves changing the order and then transposing both vector and matrix. (vector is already transposed)

#by changing the order of all terms and transposing the matrix at __init__ (changing input and output dimensions) we achieve
#the same matrix multiplications, now with vectors of shape (,784)

#Now that we have the multiplication for row vectors, we have to accomodate for multiple vectors. In the case of MNIST, multiple
#vectors become a matrix of shape (n,784), and the result of a linear transformation is a matrix, where every row is the result
#of the transformation of one vector of the batch. Now the a bias needs to be a vector of shape (1,nout), and numpy understands 
#to add it in every row of the resulting matrix since it is a row vector.

#vector vector multiplication when changing to batch matrix is simply the sum of all pairs of vector vector multiplication. 
#If the batch has 4 vectors, instead of (1,784) each vector is a matrix (4,784) and their multiplication is a sum over all 
#4 pairs. 

#This means that all operations can stay the same when doing forward and backward passes, and that the derivatives will
#automatically sum themselves when calculating them using a batch of vectors. 

#In a batch of b vectors

#layer n
#forward
    #input vector   (b,m(n))
    #Weights        (m(n-1),m(n))
    #bias           (1,m(n))
    #previous input (b,m(n-1))
#backward
    #derivative activation (b,m(n))
    #A coefficient         (b,m(n-1))
    #A "previous"(n+1)     (b,m(n))
    #Weights derivatives   (m(n-1),m(n))
    #bias derivatives      (b,m(n-1))


In [243]:
class Hidden_Layer():
    def __init__(self,input_dimension,output_dimension,batch):
        self.W = np.random.normal(0,1.0/np.sqrt(input_dimension), (output_dimension,input_dimension))
        self.b = np.zeros((1,output_dimension))
        self.dW = np.zeros_like(self.W)
        self.db = np.zeros_like(self.b)
    def linear_forward(self,x):
        self.yn_1 = x                       #storing y(n-1) for later
        return np.dot(x,self.W) + self.b    #returns the linear transformation
    
    def activation_forward(self,z):
        y = np.tanh(z)
        self.da = 1-y*y                     #storing a'(n) for layer
        return y                            #tanh squishes values between -1 and 1
    
    def backward(self,A_prev):
        #calculate A of this layer
        #AN1 = np.matmul(WN1.T,np.multiply(AN,a'N1))
        
        #calculate derivatives of this layer using An_1
        #np.matmul(np.multiply(AN,a'N1),yN2.T)
        #np.multiply(AN,a'N1)
        
        #store derivatives but not any A since they are fed as inputs
        
        #return A
        
        Ada = np.multiply(A_prev,self.da)
        
        A = np.matmul(self.W.T,Ada)
        
        dW = np.matmul(Ada,self.yn_1.T)
        db = Ada
        
        #takes An as input
        #calculates and stores derivatives
        #returns An-1
        
        return A
        
    def update(lr):
        #the derivatives are already stored
        #remember the - signs that can be removed
        
        #change W and b by derivatives no returning anything
        
        
        
hid = Hidden_Layer(5,3)
print(1-hid.activation_forward(hid.linear_forward(x))*hid.activation_forward(hid.linear_forward(x)),hid.da)


[0.20406182 0.29158037 0.01810237] [0.20406182 0.29158037 0.01810237]


In [None]:
training_data[0]

In [None]:
#class definition in progress for (784,) as input vector
class Hidden_Layer_singlevec():
    def __init__(self,input_dimension,output_dimension,batch):
        self.W = np.random.normal(0,1.0/np.sqrt(input_dimension), (output_dimension,input_dimension))
        self.b = np.zeros((1,output_dimension))
        self.dW = np.zeros_like(self.W)
        self.db = np.zeros_like(self.b)
    def linear_forward(self,x):
        self.yn_1 = x                       #storing y(n-1) for later
        return np.dot(self.W,x) + self.b    #returns the linear transformation
    
    def activation_forward(self,z):
        y = np.tanh(z)
        self.da = 1-y*y                     #storing a'(n) for layer
        return y                            #tanh squishes values between -1 and 1
    
    def backward(self,A_prev):
        #calculate A of this layer
        #AN1 = np.matmul(WN1.T,np.multiply(AN,a'N1))
        
        #calculate derivatives of this layer using An_1
        #np.matmul(np.multiply(AN,a'N1),yN2.T)
        #np.multiply(AN,a'N1)
        
        #store derivatives but not any A since they are fed as inputs
        
        #return A
        
        Ada = np.multiply(A_prev,self.da)
        
        A = np.matmul(self.W.T,Ada)
        
        dW = np.matmul(Ada,self.yn_1.T)
        db = Ada
        
        #takes An as input
        #calculates and stores derivatives
        #returns An-1
        
        return A
        
    def update(lr):
        #the derivatives are already stored
        #remember the - signs that can be removed
        
        #change W and b by derivatives no returning anything
        
        
        
hid = Hidden_Layer(5,3)
print(1-hid.activation_forward(hid.linear_forward(x))*hid.activation_forward(hid.linear_forward(x)),hid.da)
