# Introduction to Machine Learning
### By **[NimbleBox](https://www.nimblebox.ai)**


[<img src="./assets/nbx.jpeg" alt="NimbleBox.ai logo" width="600"/>](https://www.nimblebox.ai)


## What will we do
 
In this notebook we are going to implement a neural network from scratch in python and this time we are going to build a different neural network then the one that we used to explain neural networks. This time we are going to solve a classification problem. So there will be some minor changes to the network to use it for Classification. Let's look at them.
 
## Softmax 
 
A softmax function takes a vector of input and gives out a vector with probabilities that obviously add up to one because they are probabilities and in our case a softmax unit at the end will tell us the probability of the class that the neural network thinks it is.
 
Softmax is calculated by taking the elementwise exponent of the vector and then dividing elementwise by the sum of the vector after taking the exponent.
 
<img src="./assets/softmax.png" width=150>
 
 
so let's take an example in our implementation we will be using the iris dataset which has 3 flowers['setosa', 'versicolor', 'virginica'] in the same order so for one example the true output looks like this.
 
$$ Y = [0, 1, 0] $$
 
This means that the second node in our neural network denotes a particular flower which is versicolor and we will be expecting a softmax output something like this.
 
$$ Y_hat = [0.3, 0.95, 0.2] $$
 
We pick the largest number in the vector to decide which class the neural network thinks this example denotes to.
 
## Loss Function
 
We will also use a new loss function named **Cross Entropy loss** to measure the classification loss between multiple classes.
 
the formula for the loss is this where $ y_o $ is Y or the labels and $ P_o $ is the predicted label which you will also see in the below implementation. 
 
<img src="./assets/cross_entropy.png" width=150>
 
 
we will change the input units or the inputs($ X $) to 4 and the number of hidden units to 5 and the output is going to be 3 units as there are 3 flowers Now let's see the architecture of the neural network.
 
<img src="./assets/nn.png" width=700>
 
The Forward, Backward and Gradient steps are going to remain the same, the only difference is that now instead of a scalar they are going to be matrices. Let's have a look over these steps again and I will be mentioning the matrix dimensions in front of all the variables this time.
 
### Forward Propagation step
 
As we know that we will be using a sigmoid function instead of $ g() $ and a softmax function at the end I will also be replacing that.
 
1. $ Z_1 = W_1[5,4] * X[4,150] + b_1[5, 1]$
2. $ A_1 = sigmoid(Z)[5, 150] $
3. $ Z_2 = softmax(W_2[3, 5] * A_1[5,150] + b_2[3 ,1])$
4. $ Y\_hat = Z_2[3, 150] $
 
### Backward Propagation step
 
For backward Propagation we only used $ W1 $ as our example and the change their is that instead of calculating the derivative for a scalar we will be calculating the derivative for the whole vector or matrix but element wise. Let's take an example and suppose we had a vector A like the one below. 
 
$$ A = [2, 4, 5] $$
 
So if say do an element wise square on $ A $, The result will be.
 
$$ A^2 = [4, 16, 25] $$
 
And as the loss function is changed the derivative of $ ∂E/∂Y\_hat $ is also changed.
 
1. $ ∂E/∂Y\_hat = Y_hat - Y $
2. $ ∂Y\_hat/∂A = W_2 $
3. $ ∂A/∂Z = A*(1 - A) $
4. $ ∂Z/∂W_1 = X $
 
finally the whole derivative for $ ∂E/∂W_1 $ will be.
 
$$ ∂E/∂W_1 = (Y_hat - Y) * W_2 * A*(1 - A) * X $$
 
### Gradient descent
 
$$ W_1[5, 4] = W_1[5, 4] - α[1, 1] * ∂E/∂W_1[5, 4] $$ 
$$ b_1[5, 1] = b_1[5, 1] - α[1, 1] * ∂E/∂b_1[5, 1] $$
$$ W_2[3, 5] = W_2[3, 5] - α[1, 1] * ∂E/∂W_2[3, 5] $$
$$ b_2[3, 1] = b_2[3, 1] - α[1, 1] * ∂E/∂b_2[3, 1] $$
 
Where $ α $ is going to be a scalar which we will broadcast to match the shape of our matrix to which it subtracts with. 
 
## Implementation
 

In [2]:
import numpy as np
from sklearn import datasets, preprocessing

def sigmoid(X):
  return 1 / (1 + np.exp(-X))

def sigmoid_der(x):
  return sigmoid(x) * (1 - sigmoid(x))

def softmax(X):
  e = np.exp(X - X.max())
  return (e / e.sum(axis=1, keepdims=True))

def loss(Y, Y_hat):
  loss = Y * np.log(Y_hat)
  return -np.sum(loss)

def neural_network_train(X, Y, num_iteration=1200):
  # Random initializing the weights and bias

  W_1 = np.random.randn(5, 4)
  b_1 = np.random.randn(1, 5)
  
  W_2 = np.random.randn(3, 5)
  b_2 = np.random.randn(1, 3)

  # Defining the learning rate

  lr = 1e-3

  for iteration in range(num_iteration):
      # Forward Propagation

    Z_1 = np.dot(X, W_1.T) + b_1
    A_1 = sigmoid(Z_1)
    Z_2 = np.dot(A_1, W_2.T) + b_2
    Y_hat = softmax(Z_2)

    # Backward Propagation

    dE_dY_hat = Y_hat - Y
    dY_hat_dW_2 = A_1

    dE_dW_2 = np.dot(dY_hat_dW_2.T, dE_dY_hat)

    dE_db_2 = dE_dY_hat

    dZ_2_dA_1 = W_2
    dE_dA_1 = np.dot(dE_dY_hat, dZ_2_dA_1)
    dA_1_dZ_1 = sigmoid_der(Z_1)
    dZ_1_dW_1 = X
    dE_dW_1 = np.dot(dZ_1_dW_1.T, dA_1_dZ_1 * dE_dA_1)

    dE_db_1 = dE_dA_1 * dA_1_dZ_1

    # Gradient Descent

    W_2 = W_2 - lr * dE_dW_2.T
    b_2 = b_2 - lr * dE_db_2.sum(axis=0)

    W_1 = W_1 - lr * dE_dW_1.T
    b_1 = b_1 - lr * dE_db_1.sum(axis=0)


    if iteration % 5 == 0:
      print("iteration : ", iteration, "   loss : ", loss(Y, Y_hat))

if __name__ == "__main__":
  X, Y = datasets.load_iris(return_X_y=True)

  # Preprocessing the data to have a mean 0 and variance 1
  X = preprocessing.scale(X)
  # Y has the shape (150, ) rather than (150, 1)
  Y = Y.reshape(150,1)
  neural_network_train(X, Y)


iteration :  0    loss :  714.248039960841
iteration :  5    loss :  574.7556644840092
iteration :  10    loss :  526.8132917923971
iteration :  15    loss :  507.89264044261654
iteration :  20    loss :  500.36320086007197
iteration :  25    loss :  497.3204452129758
iteration :  30    loss :  496.02228412648174
iteration :  35    loss :  495.41901638734294
iteration :  40    loss :  495.10576238347045
iteration :  45    loss :  494.9210980924519
iteration :  50    loss :  494.7984504907715
iteration :  55    loss :  494.70951852235754
iteration :  60    loss :  494.6416982207869
iteration :  65    loss :  494.58880638800525
iteration :  70    loss :  494.54728260335895
iteration :  75    loss :  494.5146887324421
iteration :  80    loss :  494.4891492113893
iteration :  85    loss :  494.4691519159778
iteration :  90    loss :  494.4534718772057
iteration :  95    loss :  494.44112900863485
iteration :  100    loss :  494.43135205100435
iteration :  105    loss :  494.4235432844798
i

iteration :  1065    loss :  494.3756082408439
iteration :  1070    loss :  494.37560719883857
iteration :  1075    loss :  494.37560617504
iteration :  1080    loss :  494.3756051690532
iteration :  1085    loss :  494.3756041804931
iteration :  1090    loss :  494.375603208985
iteration :  1095    loss :  494.37560225416365
iteration :  1100    loss :  494.3756013156731
iteration :  1105    loss :  494.37560039316656
iteration :  1110    loss :  494.3755994863062
iteration :  1115    loss :  494.3755985947624
iteration :  1120    loss :  494.3755977182141
iteration :  1125    loss :  494.3755968563483
iteration :  1130    loss :  494.37559600885953
iteration :  1135    loss :  494.37559517545026
iteration :  1140    loss :  494.37559435583023
iteration :  1145    loss :  494.37559354971614
iteration :  1150    loss :  494.3755927568319
iteration :  1155    loss :  494.37559197690797
iteration :  1160    loss :  494.3755912096815
iteration :  1165    loss :  494.375590454896
iteration

## What to do next
 
As you can see that we haven't chosen the best parameters. There are a lot of things that you can tune. let's mention some of them.
 
- Learning rate.
- Number of hidden units.
- Number of hidden layers.
- weight initialization method.
- activation function used: You can use ReLU or any other activation function instead of sigmoid.
- number of iterations.
- Type of classification loss.
- Different optimization algorithm such as mini-batch gradient descent, Adam or RMSprop.