# Multilayer Perceptron

<img src="image/multilayer.png" width='400'>

# FeedForward Neural Networks

>A feedforward neural network is an artificial neural network where connections between the nodes do not form a cycle.

Training this kind of network requires three steps.

1. Forward propagation
2. Computing cost
3. Backpropagation


 ## 1. Forward Propagation

Let `X` be the input vector to the neural network, so that `a[0] = X`.
Now, we need to calculate `a[l]` for every layer `l` in the network.
Before calculating the activation, `a[l]`, we will calculate an intermediate value `z[l]`. Each element `k` in `z[l]` is just the sum of bias for the neuron `k` in the layer `l` with the weighted sum of the activation of the previous layer, `l-1`.

We can calculate `z[l]` from the following equation:

<img src="image/e1.png" width='450'>

Now that we have `z[l]`, we can compute `a[l]` easily by applying the activation function `g[l]` element-wise to the vector `z[l]`.

<img src="image/e2.png" width='300'>

this will go for all layers. We can show it like this. 

<img src="image/g1.png" width='270'>



## 2. Cost Funtion: 

We will use cross-entropy cost function

<img src="image/cost.png" width='300'>

Here,

$\hat{y}$ = predicted output

$y$ = real output

$\log$  refers natural logarithm ( $\ln$ )


## 3. Backpropagation

The goal of backpropagation is to compute the partial derivatives of the `cost function` ($C$) with respect to any `weight` ($w$) or `bias` ($b$) in the network.

Once we have these partial derivatives, we will update the weights and biases in the network by the product of some constant `alpha` ( $\alpha$ ) and the partial derivative of that quantity with respect to the cost function.Here `alpha` is the learning rate which we already used in Xgboost and other algorithms. This way of updating `weight` is known as gradient descent algorithm.

<img src="image/update.png" width='300'>

Visual of how it works: 
<img src="image/gradient.png" width='350'>


### Chain Rule 
If $y = f(u)$ ,

$u = g(x)$ 

Both differentiable functions, then

$\frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} . \frac{\partial u}{\partial x} $

**The partial derivative of the cost function `C` with respect to `w[3]`, `b[3]`.
Using chain rule:**
<img src="image/w3.png" width='350'>

Also 

<img src="image/w2.png" width='350'>

And 

<img src="image/w1.png" width='350'>


## The partial derivative

<img src="image/f.png" width='350'>

So for calculating the partial derivatives of C with respect to `w[l]`, `b[l]`, we need to calculate

<img src="image/f2.png" width='350'>


## For las layer `L`

<img src="image/zL.png" width='350'>

Where .* represents element-wise multiplication of the matrices, also known as the Hadamard product. We multiply element-wise to make sure that all the dimensions of our matrix multiplications match up as expected.

**Derivative of the activation function:**

<img src="image/a.png" width='350'>

from both :

<img src="image/az.png" width='350'>


### Partial derivative of C with respect to `z[l]`

We want the partial derivative of `C` with respect to `z[l]` in terms of the partial derivative of `C` with respect to the layer `l+1`, so that once we have `z[L]`, we can calculate `z[L-1]`, `z[L-2]`.. and so on.
We can express `C` as a function of `z[l + 1]` for any given `l`. 

Therefore, we can write:

<img src="image/zl2.png" width='350'>
<img src="image/zl3.png" width='350'>

Putting it all together we get:
<img src="image/zl4.png" width='350'>

Note that we have adjusted the terms to make sure our matrix multiplication dimensions match as expected. Here ‘.’ represents the matrix multiplication operation and .* represents the element-wise product as above.


[link] : https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536#:~:text=The%20goal%20of%20backpropagation%20is,bias%20b%20in%20the%20network.&text=The%20partial%20derivatives%20give%20us%20the%20direction%20of%20greatest%20ascent. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [2]:
def sigmoid(x):
    return 1/(1 + np.exp(-x))

In [3]:
def sigmoid_prime(x):
    return sigmoid(x)*(1.0 - sigmoid(x))

In [44]:
class NeuralNetwork(object):
    
    def __init__(self, architecture):
        #architecture - numpy array with ith element representing the number of neurons in the ith layer.
        
        #Initialize the network architecture
        self.L = architecture.size - 1 #The index of the last layer L
        self.n = architecture #n stores the number of neurons in each layer
        self.input_size = self.n[0] #input_size is the number of neurons in the first layer
        self.output_size = self.n[self.L] #output_size is the number of neurons in the last layer

        #Parameters will store the weights and biases
        self.parameters = {}
        
        #Initialize the network weights and biases:
        for i in range (1, self.L + 1): 
            #Initialize weights to small random values
            self.parameters['W' + str(i)] = np.random.randn(self.n[i], self.n[i - 1]) * 0.01
            
            #Initialize rest of the parameters to 1
            self.parameters['b' + str(i)] = np.ones((self.n[i], 1))
            self.parameters['z' + str(i)] = np.ones((self.n[i], 1))
            self.parameters['a' + str(i)] = np.ones((self.n[i], 1))
        print(self.parameters)
        #As we started the loop from 1, we haven't initialized a[0]:
        self.parameters['a0'] = np.ones((self.n[i], 1))
        
        #Initialize the cost:
        self.parameters['C'] = 1
        
        #Create a dictionary for storing the derivatives:
        self.derivatives = {}
        
        #Learning rate
        self.alpha = 0.01
            
    def forward_propagate(self, X):
        #Note that X here, is just one training example
        self.parameters['a0'] = X
        
        #Calculate the activations for every layer l
        for l in range(1, self.L + 1):
            self.parameters['z' + str(l)] = np.add(np.dot(self.parameters['W' + str(l)], 
                                                          self.parameters['a' + str(l - 1)]),    
                                                   self.parameters['b' + str(l)])
            self.parameters['a' + str(l)] = sigmoid(self.parameters['z' + str(l)])
        
    def compute_cost(self, y):
        self.parameters['C'] = -(y*np.log(self.parameters['a' + str(self.L)]) + \
                                 (1-y)*np.log( 1 - self.parameters['a' + str(self.L)]))
    
    def compute_derivatives(self, y):
        #Partial derivatives of the cost function with respect to z[L], W[L] and b[L]:        
        #dzL
        self.derivatives['dz' + str(self.L)] = self.parameters['a' + str(self.L)] - y
        #dWL
        self.derivatives['dW' + str(self.L)] = np.dot(self.derivatives['dz' + str(self.L)], 
                                                      np.transpose(self.parameters['a' + str(self.L - 1)]))
        #dbL
        self.derivatives['db' + str(self.L)] = self.derivatives['dz' + str(self.L)]

        #Partial derivatives of the cost function with respect to z[l], W[l] and b[l]
        for l in range(self.L-1, 0, -1):
            self.derivatives['dz' + str(l)] = np.dot(
                np.transpose(self.parameters['W' + str(l + 1)]), 
                self.derivatives['dz' + str(l + 1)])*sigmoid_prime(self.parameters['z' + str(l)])
            self.derivatives['dW' + str(l)] = np.dot(
                self.derivatives['dz' + str(l)], 
                np.transpose(self.parameters['a' + str(l - 1)]))
            self.derivatives['db' + str(l)] = self.derivatives['dz' + str(l)]
            
    def update_parameters(self):
        for l in range(1, self.L+1):
            self.parameters['W' + str(l)] -= self.alpha*self.derivatives['dW' + str(l)]
            self.parameters['b' + str(l)] -= self.alpha*self.derivatives['db' + str(l)]
        
    def predict(self, x):
        self.forward_propagate(x)
        return self.parameters['a' + str(self.L)]
        
    def fit(self, X, Y, num_iter):
        for iter in range(0, num_iter):
            c = 0
            acc = 0
            n_c = 0
            for i in range(0, X.shape[0]):
                x = X[i].reshape((X[i].size, 1))
                y = Y[i]
                self.forward_propagate(x)
                self.compute_cost(y)
                c += self.parameters['C'] 
                y_pred = self.predict(x)
                y_pred = (y_pred > 0.5)
                if y_pred == y:
                    n_c += 1
                self.compute_derivatives(y)
                self.update_parameters()
            
            c = c/X.shape[0]
            acc = (n_c/X.shape[0])*100
            print('Iteration: ', iter)
            print("Cost: ", c)
            print("Accuracy:", acc)

In [45]:
dataset = pd.read_csv('wheat-seeds-binary.csv')
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Area                     140 non-null    float64
 1   Perimeter                140 non-null    float64
 2   Compactness              140 non-null    float64
 3   Length of Kernel         140 non-null    float64
 4   Width of Kernel          140 non-null    float64
 5   Asymmetry Coefficient    140 non-null    float64
 6   Length of Kernel Groove  140 non-null    float64
 7   Class                    140 non-null    int64  
dtypes: float64(7), int64(1)
memory usage: 8.9 KB


In [46]:
shuffled_dataset = dataset.sample(frac=1).reset_index(drop=True)
shuffled_dataset['Class'] = shuffled_dataset['Class'] - 1

X = shuffled_dataset.iloc[:, 0:-1].values
y = shuffled_dataset.iloc[:, -1].values

In [48]:
sc_X = StandardScaler()
X = sc_X.fit_transform(X)

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [50]:
architecture = np.array([7, 2, 1])

In [51]:
classifier = NeuralNetwork(architecture)

{'W1': array([[ 0.00100735,  0.02481865, -0.0116685 ,  0.0212643 , -0.00084768,
         0.01209944, -0.0057515 ],
       [ 0.00236923, -0.00109906, -0.01647006,  0.00413856, -0.0037008 ,
         0.004628  ,  0.0068499 ]]), 'b1': array([[1.],
       [1.]]), 'z1': array([[1.],
       [1.]]), 'a1': array([[1.],
       [1.]]), 'W2': array([[0.00583551, 0.00133127]]), 'b2': array([[1.]]), 'z2': array([[1.]]), 'a2': array([[1.]])}


In [52]:
classifier.fit(X_train, y_train, 20)

Iteration:  0
Cost:  [[0.75614427]]
Accuracy: 53.06122448979592
Iteration:  1
Cost:  [[0.71851975]]
Accuracy: 53.06122448979592
Iteration:  2
Cost:  [[0.70017197]]
Accuracy: 53.06122448979592
Iteration:  3
Cost:  [[0.68899374]]
Accuracy: 53.06122448979592
Iteration:  4
Cost:  [[0.67987434]]
Accuracy: 53.06122448979592
Iteration:  5
Cost:  [[0.67062775]]
Accuracy: 54.08163265306123
Iteration:  6
Cost:  [[0.66020066]]
Accuracy: 68.36734693877551
Iteration:  7
Cost:  [[0.64801616]]
Accuracy: 88.77551020408163
Iteration:  8
Cost:  [[0.6337837]]
Accuracy: 89.79591836734694
Iteration:  9
Cost:  [[0.61747034]]
Accuracy: 88.77551020408163
Iteration:  10
Cost:  [[0.59929194]]
Accuracy: 87.75510204081633
Iteration:  11
Cost:  [[0.57966485]]
Accuracy: 89.79591836734694
Iteration:  12
Cost:  [[0.55911628]]
Accuracy: 89.79591836734694
Iteration:  13
Cost:  [[0.53818453]]
Accuracy: 89.79591836734694
Iteration:  14
Cost:  [[0.51734208]]
Accuracy: 89.79591836734694
Iteration:  15
Cost:  [[0.49695592]]

In [43]:
acc = 0
n_c = 0
for i in range(0, X_test.shape[0]):
    x = X_test[i].reshape((X_test[i].size, 1))
    y = y_test[i]
    y_pred = classifier.predict(x)
    y_pred = (y_pred > 0.5)
    print('Expected: %d Got: %d' %(y, y_pred))
    if y_pred == y:
        n_c += 1

acc = (n_c/X_test.shape[0])*100
print("Test Accuracy", acc)

Expected: 1 Got: 0
Expected: 0 Got: 0
Expected: 1 Got: 1
Expected: 0 Got: 0
Expected: 1 Got: 0
Expected: 1 Got: 1
Expected: 1 Got: 1
Expected: 0 Got: 0
Expected: 1 Got: 1
Expected: 1 Got: 1
Expected: 1 Got: 1
Expected: 0 Got: 0
Expected: 0 Got: 0
Expected: 0 Got: 0
Expected: 0 Got: 0
Expected: 1 Got: 1
Expected: 1 Got: 1
Expected: 0 Got: 0
Expected: 1 Got: 0
Expected: 1 Got: 1
Expected: 0 Got: 0
Expected: 1 Got: 0
Expected: 0 Got: 0
Expected: 1 Got: 1
Expected: 0 Got: 0
Expected: 0 Got: 0
Expected: 0 Got: 0
Expected: 1 Got: 1
Expected: 0 Got: 0
Expected: 1 Got: 1
Expected: 1 Got: 1
Expected: 0 Got: 0
Expected: 1 Got: 0
Expected: 0 Got: 0
Expected: 1 Got: 1
Expected: 0 Got: 0
Expected: 0 Got: 0
Expected: 0 Got: 0
Expected: 1 Got: 1
Expected: 1 Got: 1
Expected: 0 Got: 0
Expected: 1 Got: 1
Test Accuracy 88.09523809523809
