# 2 LAYER NEURAL NETWORK

## 1 - Packages ##

Let's first import all the packages that you will need during this assignment.
- [numpy](www.numpy.org) is the fundamental package for scientific computing with Python.
- [sklearn](http://scikit-learn.org/stable/) provides simple and efficient tools for data mining and data analysis. 
- [matplotlib](http://matplotlib.org) is a library for plotting graphs in Python.

In [185]:
# Package imports
import numpy as np
import matplotlib.pyplot as plt

import sklearn
import sklearn.linear_model
%matplotlib inline

np.random.seed(1) # set a seed so that the results are consistent

## 2 - Dataset ##

First, let's get the dataset you will work on. The following code will load a "DIGITS" 10-class dataset converted into 2-class dataset i.e 0  & 1 then dataset converted into variables `X` and `Y`.

In [223]:
import pandas as pad
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score

#help(sklearn.datasets.load_digits)

## 2.1 - DATA DESCRIPTION ##
Help on function load_digits in module sklearn.datasets.base:

load_digits(n_class=10, return_X_y=False)
    Load and return the digits dataset (classification).
    
    Each datapoint is a 8x8 image of a digit.
    
    =================   ==============
    Classes                         10
    Samples per class             ~180
    Samples total                 1797
    Dimensionality                  64
    Features             integers 0-16
    =================   ==============
    
   -Number of Instances 
    - 1797

   -Number of Attributes
	- 64 input+1 class attribute
    
   -Missing Attribute Values
	- None
    
    This is a copy of the test set of the UCI ML hand-written digits datasets
    http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

For Each Attribute:
- All input attributes are integers in the range 0..16.
- The last attribute is the class code 0..9

Class: No of examples in set
- 0:  178
- 1:  182
- 2:  177
- 3:  183
- 4:  181
- 5:  182
- 6:  181
- 7:  179
- 8:  174
- 9:  180

In [188]:
dataset = load_digits()
temp = dataset.data
# Transposing the Data
X=temp.T
new_Y = dataset.target
Y=[]
# Converting Multiclass problem to Binary By assigning classes 0,1,2,3,4 as 0 and 5,6,7,8,9 as 1
t=0
while t<len(new_Y):
    if new_Y[t] in [0,1,2,3,4]:
        Y.append(0)
    else:
        Y.append(1)
    t+=1
        
Y = np.array(Y)
Y = np.array(Y).reshape(1,Y.shape[0])


[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 ...
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]
[0 1 2 ... 8 9 8]


## 3 - NORMALIZATION OF DATA ##
    As our data Attributes have value between 0-16. So we can simply divide every value by 16 to create normalized data

In [189]:
X=X/16

## 4 - Neural Network model

Logistic regression did not work well.So we are going to train a Neural Network with a two hidden layer.

**Mathematically**:

For one example $x^{(i)}$:
$$z^{[1] (i)} =  W^{[1]} x^{(i)} + b^{[1] (i)}\tag{1}$$ 
$$a^{[1] (i)} = \tanh(z^{[1] (i)})\tag{2}$$
$$z^{[2] (i)} = W^{[2]} a^{[1] (i)} + b^{[2] (i)}\tag{3}$$
$$\hat{y}^{(i)} = a^{[2] (i)} = relu(z^{ [2] (i)})\tag{4}$$
$$z^{[3] (i)} = W^{[3]} a^{[1] (i)} + b^{[3] (i)}\tag{5}$$
$$\hat{y}^{(i)} = a^{[3] (i)} = \sigma(z^{ [3] (i)})\tag{6}$$
$$y^{(i)}_{prediction} = \begin{cases} 1 & \mbox{if } a^{[3](i)} > 0.5 \\ 0 & \mbox{otherwise } \end{cases}\tag{7}$$

Given the predictions on all the examples, you can also compute the cost $J$ as follows: 
$$J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large\left(\small y^{(i)}\log\left(a^{[3] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[3] (i)}\right)  \large  \right) \small \tag{8}$$

**Reminder**: The general methodology to build a Neural Network is to:
    1. Define the neural network structure ( # of input units,  # of hidden units, etc). 
    2. Initialize the model's parameters
    3. Loop:
        - Implement forward propagation
        - Compute loss
        - Implement backward propagation to get the gradients
        - Update parameters (gradient descent)

You often build helper functions to compute steps 1-3 and then merge them into one function we call `nn_model()`. Once you've built `nn_model()` and learnt the right parameters, you can make predictions on new data.

### 4.1 - Defining the neural network structure ####

**To Do**: Define  variables:
    - n_x: the size of the input layer
    - n_h1: the size of the hidden layer (set this to 8) 
    - n_h2: the size of the hidden layer (set this to 3) 
    - n_y: the size of the output layer

**Hint**: Use shapes of X and Y to find n_x and n_y. Also, hard code the hidden layer sizes to be 8 and 3.

In [190]:

def layer_sizes(X, Y):
    """
    Arguments:
    X -- input dataset of shape (input size, number of examples)
    Y -- labels of shape (output size, number of examples)
    
    Returns:
    n_x -- the size of the input layer
    n_h1 -- the size of the first hidden layer
    n_h2 -- the size of the second hidden layer
    n_y -- the size of the output layer
    """
    ### START CODE HERE ### (≈ 4 lines of code)
    n_x  = X.shape[0] # size of input layer
    n_h1 = 8
    n_h2 = 3
    n_y  = Y.shape[0] # size of output layer
    ### END CODE HERE ###
    return (n_x, n_h1, n_h2, n_y)

(n_x, n_h1, n_h2, n_y) = layer_sizes(X, Y)
print("The size of the input layer is: n_x = " + str(n_x))
print("The size of the first hidden layer is: n_h1 = " + str(n_h1))
print("The size of the second hidden layer is: n_h2 = " + str(n_h2))
print("The size of the output layer is: n_y = " + str(n_y))

The size of the input layer is: n_x = 64
The size of the first hidden layer is: n_h1 = 8
The size of the second hidden layer is: n_h2 = 3
The size of the output layer is: n_y = 1


### 4.2 - Initialize the model's parameters ####

**To Do**: Implement the function `initialize_parameters()`.

**Instructions**:
- You will initialize the weights matrices with random values. 
    - Use: `np.random.randn(a,b) * 0.01` to randomly initialize a matrix of shape (a,b).
- You will initialize the bias vectors as zeros. 
    - Use: `np.zeros((a,b))` to initialize a matrix of shape (a,b) with zeros.

In [191]:
# GRADED FUNCTION: initialize_parameters

def initialize_parameters(n_x, n_h1, n_h2, n_y):
    """
    Argument:
    n_x -- size of the input layer
    n_h1 -- the size of the first hidden layer
    n_h2 -- the size of the second hidden layer
    n_y -- size of the output layer
    
    Returns:
    params -- python dictionary containing your parameters:
                    W1 -- weight matrix of shape (n_h1, n_x) 
                    b1 -- bias vector of shape (n_h1, 1)
                    W2 -- weight matrix of shape (n_h2, n_h1)
                    b2 -- bias vector of shape (n_h2, 1)
                    W3 -- weight matrix of shape (n_y, n_h2)
                    b3 -- bias vector of shape (n_y, 1)
    """
    
    np.random.seed(2) # we set up a seed so that your output matches ours although the initialization is random.
    
    ### START CODE HERE ### (≈ 6 lines of code)
    W1 = np.random.randn(n_h1,n_x)*0.005
    b1 = np.zeros((n_h1,1))
    W2 = np.random.randn(n_h2,n_h1)*0.005
    b2 = np.zeros((n_h2,1))
    W3 = np.random.randn(n_y,n_h2)*0.005
    b3 = np.zeros((n_y,1))
    ### END CODE HERE ###
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2,
                  "W3": W3,
                  "b3": b3}
    #parameters["W1"]
    

    return parameters

In [192]:
parameters = initialize_parameters(n_x, n_h1, n_h2, n_y)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
print("W3 = " + str(parameters["W3"]))
print("b3 = " + str(parameters["b3"]))

W1 = [[-2.08378924e-03 -2.81334136e-04 -1.06809805e-02  8.20135404e-03
  -8.96717793e-03 -4.20873683e-03  2.51440709e-03 -6.22644043e-03
  -5.28976109e-03 -4.54503807e-03  2.75727022e-03  1.14610401e-02
   2.07696965e-04 -5.58962723e-03  2.69529160e-03 -2.98079850e-03
  -9.56524826e-05  5.87500610e-03 -3.73935475e-03  4.51262549e-05
  -4.39053947e-03 -7.82170852e-04  1.28285226e-03 -4.94389524e-03
  -1.69410983e-03 -1.18092015e-03 -3.18827506e-03 -5.93806143e-03
  -7.10608614e-03 -7.67475978e-04 -1.34528480e-03  1.11568339e-02
  -1.21738379e-02  5.63632524e-04  1.85222268e-03  6.79816931e-03
   2.50928603e-03 -4.22106852e-03  4.88073580e-08  2.71176286e-03
  -1.56754098e-03  3.85505869e-03 -9.34045327e-03  8.65592333e-03
   7.33839005e-03 -1.67838669e-03  3.05670390e-03  2.39852959e-04
  -4.14567645e-03  4.38551092e-04  5.00182943e-03 -1.90546259e-03
  -1.87834712e-03 -3.72353814e-04  2.16748165e-03  6.39189615e-03
  -3.17339653e-03  2.54198121e-03  1.08058003e-03 -9.29306193e-03
  -2.

### 4.3 - Forward and Backward Propagation ####

**To Do**: Implement `forward_propagation()`.

**Instructions**:
- Look above at the mathematical representation of your classifier.
- You can use the function `np.tanh()`. It is part of the numpy library.
- Since Relu function can be directly implemented by using `np.maximum()` we are going to use that.
- The steps you have to implement are:
    1. Retrieve each parameter from the dictionary "parameters" (which is the output of `initialize_parameters()`) by using `parameters[".."]`.
    2. Implement Forward Propagation. Compute $Z^{[1]}, A^{[1]}, Z^{[2]}$, $A^{[2]}, Z^{[3]}$ and $A^{[3]}$ (the vector of all your predictions on all the examples in the training set).
- Values needed in the backpropagation are stored in "`cache`". The `cache` will be given as an input to the backpropagation function.

In [193]:
# GRADED FUNCTION: forward_propagation
def sigmoid(x):
    """
    Compute the sigmoid of x

    Arguments:
    x -- A scalar or numpy array of any size

    Return:
    s -- sigmoid(x)
    """
    
    ### START CODE HERE ### (≈ 1 line of code)
    s = 1.0/(1+np.exp(-x))
    ### END CODE HERE ###
    
    return s
    
def forward_propagation(X, parameters):
    """
    Argument:
    X -- input data of size (n_x, m)
    parameters -- python dictionary containing your parameters (output of initialization function)
    
    Returns:
    A3 -- The sigmoid output of the second activation
    cache -- a dictionary containing "Z1", "A1", "Z2", "A2", "Z3" and "A3"
    """
    # Retrieve each parameter from the dictionary "parameters"
    ### START CODE HERE ### (≈ 6 lines of code)
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    ### END CODE HERE ###
    
    # Implement Forward Propagation to calculate A3 (probabilities)
    ### START CODE HERE ### (≈ 6 lines of code)
    Z1 = np.dot(W1,X) + b1
    A1 = np.tanh(Z1)
    Z2 = np.dot(W2,A1) + b2
    A2 = np.maximum(0,Z2)
    Z3 = np.dot(W3,A2) + b3
    A3 = sigmoid(Z3)
    ### END CODE HERE ###
    
    cache = {"Z1": Z1,
             "A1": A1,
             "Z2": Z2,
             "A2": A2,
             "Z3": Z3,
             "A3": A3}
    
    return A3, cache

In [194]:
A3, cache = forward_propagation(X, parameters)
print(parameters["W2"].shape, cache['A1'].shape)

# Note: we use the mean here just to make sure that your output matches ours. 
print(np.mean(cache['Z1']) ,np.mean(cache['A1']),np.mean(cache['Z2']),np.mean(cache['A2']),np.mean(cache['Z3']),np.mean(cache['A3']))

(3, 8) (8, 1797)
-0.0007381787070461341 -0.000737816134359491 -4.44102897046795e-05 5.021025118428967e-05 7.440008365130357e-07 0.5000001860002091


Now that you have computed $A^{[3]}$ (in the Python variable "`A3`"), which contains $a^{[3](i)}$ for every example, you can compute the cost function as follows:

$$J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large{(} \small y^{(i)}\log\left(a^{[3] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[3] (i)}\right) \large{)} \small\tag{9}$$

**Exercise**: Implement `compute_cost()` to compute the value of the cost $J$.

**Instructions**:
- There are many ways to implement the cross-entropy loss. To help you, we give you how we would have implemented
$- \sum\limits_{i=0}^{m}  y^{(i)}\log(a^{[3](i)})$:
```python
logprobs = np.multiply(np.log(A3),Y)
cost = - np.sum(logprobs)                # no need to use a for loop!
```

(you can use either `np.multiply()` and then `np.sum()` or directly `np.dot()`).


In [195]:
# GRADED FUNCTION: compute_cost

def compute_cost(A3, Y, parameters):
    """
    Computes the cross-entropy cost given in equation (9)
    
    Arguments:
    A3 -- The sigmoid output of the second activation, of shape (1, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    parameters -- python dictionary containing your parameters W1, b1, W2, b2, W3 and b3
    
    Returns:
    cost -- cross-entropy cost given equation (9)
    """
    
    m = Y.shape[1] # number of example

    # Compute the cross-entropy cost
    ### START CODE HERE ### (≈ 2 lines of code)
    logprobs = np.multiply(np.log(A3), Y) + np.multiply((1 - Y), np.log(1 - A3))
    cost = -np.sum(logprobs) / m
    ### END CODE HERE ###
    
    cost = np.squeeze(cost)     # makes sure cost is the dimension we expect. 
                                # E.g., turns [[17]] into 17 
    assert(isinstance(cost, float))
    
    return cost

In [196]:
print("cost = " + str(compute_cost(A3, Y, parameters)))

cost = 0.6931472014006419


Using the cache computed during forward propagation, you can now implement backward propagation.

**To Do**: Implement the function `backward_propagation()`.

**Instructions**:
Backpropagation is usually the hardest (most mathematical) part in deep learning. To help you, here again is the slide from the lecture on backpropagation. You'll want to use the six equations on the right of this slide, since you are building a vectorized implementation.  

<!--
$\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } = \frac{1}{m} (a^{[2](i)} - y^{(i)})$

$\frac{\partial \mathcal{J} }{ \partial W_2 } = \frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } a^{[1] (i) T} $

$\frac{\partial \mathcal{J} }{ \partial b_2 } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)}}}$

$\frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)} } =  W_2^T \frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } * ( 1 - a^{[1] (i) 2}) $

$\frac{\partial \mathcal{J} }{ \partial W_1 } = \frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)} }  X^T $

$\frac{\partial \mathcal{J} _i }{ \partial b_1 } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)}}}$

- Note that $*$ denotes elementwise multiplication.
- The notation you will use is common in deep learning coding:
    - dW1 = $\frac{\partial \mathcal{J} }{ \partial W_1 }$
    - db1 = $\frac{\partial \mathcal{J} }{ \partial b_1 }$
    - dW2 = $\frac{\partial \mathcal{J} }{ \partial W_2 }$
    - db2 = $\frac{\partial \mathcal{J} }{ \partial b_2 }$
    
-->

- Tips:
    - To compute dZ1 you'll need to compute $g^{[1]'}(Z^{[1]})$. Since $g^{[1]}(.)$ is the tanh activation function, if $a = g^{[1]}(z)$ then $g^{[1]'}(z) = 1-a^2$. So you can compute 
    $g^{[1]'}(Z^{[1]})$ using `(1 - np.power(A1, 2))`.
    - To compute dZ1 you'll need to compute $g^{[2]'}(Z^{[2]})$. Since $g^{[2]}(.)$ is the tanh activation function, if $a = g^{[2]}(z)$ then $g^{[2]'}(z) = \begin{cases} 1 & \mbox{if } a^{[2](i)} > 0 \\ 0 & \mbox{otherwise } \end{cases}$. 

In [207]:
# GRADED FUNCTION: backward_propagation

def backward_propagation(parameters, cache, X, Y):
    """
    Implement the backward propagation using the instructions above.
    
    Arguments:
    parameters -- python dictionary containing our parameters 
    cache -- a dictionary containing "Z1", "A1", "Z2", "A2", "Z3" and "A3".
    X -- input data of shape (2, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    
    Returns:
    grads -- python dictionary containing your gradients with respect to different parameters
    """
    m = X.shape[1]
    
    # First, retrieve W1, W2 and W3 from the dictionary "parameters".
    ### START CODE HERE ### (≈ 3 lines of code)
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]
    ### END CODE HERE ###
        
    # Retrieve also A1, A2 and A3 from dictionary "cache".
    ### START CODE HERE ### (≈ 3 lines of code)
    A1 = cache["A1"]
    A2 = cache["A2"]
    A3 = cache["A3"]
    ### END CODE HERE ###
    
    # Backward propagation: calculate dW1, db1, dW2, db2, dW3, db3. 
    ### START CODE HERE ### (≈ 12 lines of code, corresponding to 9 equations on slide above)
    dZ3 = A3-Y
    dW3 = 1/m*(np.dot(dZ3,A2.T))
    db3 = 1/m*(np.sum(dZ3,axis=1, keepdims=True))
    A2_ = A2.copy()  
    A2_[A2<0]=0
    A2_[A2>0]=1                     #A2_ is the differentiation of relu(A2)
    dZ2 = np.multiply(np.dot(W3.T,dZ3),A2_)
    dW2 = 1/m*(np.dot(dZ2,A1.T))
    db2 = 1/m*(np.sum(dZ2,axis=1, keepdims=True))
    dZ1 = np.multiply(np.dot(W2.T,dZ2),(1-np.power(A1,2)))
    dW1 = 1/m*(np.dot(dZ1,X.T))
    db1 = 1/m*(np.sum(dZ1,axis=1, keepdims=True))
    ### END CODE HERE ###
    
    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2,
             "dW3": dW3,
             "db3": db3 }
    
    return grads

In [208]:
grads = backward_propagation(parameters, cache, X, Y)
print ("dW1 = "+ str(grads["dW1"]))
print ("db1 = "+ str(grads["db1"]))
print ("dW2 = "+ str(grads["dW2"]))
print ("db2 = "+ str(grads["db2"]))
print ("dW3 = "+ str(grads["dW3"]))
print ("db3 = "+ str(grads["db3"]))

dW1 = [[ 0.00000000e+00  7.44235309e-03  4.11940999e-02  7.49772984e-02
   1.10941781e-01 -1.35342860e-01 -1.19212270e-01 -1.89098272e-02
  -3.18077742e-04  4.57549549e-02 -4.23725077e-02  2.02711323e-01
   3.60238256e-01 -1.13386184e-02 -9.05929928e-02 -7.43671458e-03
   1.58925807e-04  4.63246082e-02  8.48518063e-02  3.69466296e-01
   4.58835274e-01  1.37352359e-01  3.62265330e-02  4.20471096e-03
   2.70929961e-04  1.06837549e-01  1.15548433e-01  6.41237988e-02
   2.22396313e-01  1.20434107e-01  9.32984906e-02  2.70900884e-04
   0.00000000e+00  1.65186325e-01 -5.31501268e-03 -1.10162678e-01
   7.40512999e-02  3.98073188e-02  4.34973882e-02  0.00000000e+00
   1.56729810e-03  1.19611948e-01  7.45101151e-02  1.77869152e-01
   2.82019764e-01  9.22573355e-02 -6.95409616e-02 -4.22466884e-03
   7.06799768e-04  4.14917320e-02  8.25639552e-02  1.53915958e-01
   5.75110669e-01  2.32511632e-01 -3.34931564e-02  1.00061025e-02
   1.60965805e-04  8.95955534e-03  6.29271780e-03  1.26101717e-01
   4

In [209]:
# GRADED FUNCTION: update_parameters

def update_parameters(parameters, grads, learning_rate = 0.2):
    """
    Updates parameters using the gradient descent update rule given above
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients 
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    """
    # Retrieve each parameter from the dictionary "parameters"
    ### START CODE HERE ### (≈ 6 lines of code)
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    ### END CODE HERE ###
    
    # Retrieve each gradient from the dictionary "grads"
    ### START CODE HERE ### (≈ 6 lines of code)
    dW1 = grads["dW1"]
    db1 = grads["db1"]
    dW2 = grads["dW2"]
    db2 = grads["db2"]
    dW3 = grads["dW3"]
    db3 = grads["db3"]
    ## END CODE HERE ###
    
    # Update rule for each parameter
    ### START CODE HERE ### (≈ 6 lines of code)
    W1 = W1 - learning_rate*dW1
    b1 = b1 - learning_rate*db1
    W2 = W2 - learning_rate*dW2
    b2 = b2 - learning_rate*db2
    W3 = W3 - learning_rate*dW3
    b3 = b3 - learning_rate*db3
    ### END CODE HERE ###
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2,
                  "W3": W3,
                  "b3": b3}
    
    return parameters

In [210]:

parameters = update_parameters(parameters, grads)

print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
print("W3 = " + str(parameters["W3"]))
print("b3 = " + str(parameters["b3"]))

W1 = [[-2.08378924e-03 -3.73836403e-02  4.90730518e-02  2.50722223e-01
   5.10237386e-01 -7.57073995e-02  1.38563145e-01  3.86032009e-02
   3.86622201e-02  2.02145704e-01  6.03407930e-01  2.14000404e-01
   1.35563794e-02  8.72799268e-02  2.43121908e-01  7.67999373e-02
  -5.91457459e-05 -7.47511240e-02 -2.23639948e-01 -8.45604958e-01
  -1.41470417e-02 -2.39986373e-01 -4.08739788e-02 -3.13309991e-02
  -7.34331807e-03 -4.36166690e-01 -6.59279451e-01 -3.51072226e-01
   4.25180416e-01 -1.42388198e-01 -1.07484030e-01  1.66346075e-03
  -1.21738379e-02 -4.81202290e-02 -2.16093181e-01  5.98695932e-01
  -3.04509247e-01 -2.04646562e-01  1.72620978e-01  2.71176286e-03
  -3.84949471e-03 -1.34689474e-01 -1.31355929e-01 -7.59965330e-01
   4.29625130e-01  2.09123730e-01  1.60573434e-01  1.31738170e-02
  -6.90884577e-03 -2.01561818e-01  6.63541586e-02 -5.62328676e-01
  -2.74997247e-01  4.45127148e-01  2.32788047e-01 -3.79947199e-02
  -3.90525592e-03  1.06385408e-02  3.82604435e-01  2.37581254e-01
  -3.

### 4.4 - Integrate parts 4.1, 4.2 and 4.3 in nn_model() ####

**Question**: Build your neural network model in `nn_model()`.

**Instructions**: The neural network model has to use the previous functions in the right order.

In [211]:
# GRADED FUNCTION: nn_model

def nn_model(X, Y, n_h1, n_h2, num_iterations = 10000, print_cost=False):
    """
    Arguments:
    X -- dataset of shape (2, number of examples)
    Y -- labels of shape (1, number of examples)
    n_h1 -- size of first hidden layer
    n_h2 -- size of second hidden layer
    num_iterations -- Number of iterations in gradient descent loop
    print_cost -- if True, print the cost every 1000 iterations
    
    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """
    
    np.random.seed(3)
    n_x = layer_sizes(X, Y)[0]
    n_y = layer_sizes(X, Y)[-1]
    
    # Initialize parameters, then retrieve W1, b1, W2, b2, W3, b3. Inputs: "n_x, n_h1, n_h2, n_y". Outputs = "W1, b1, W2, b2, W3, b3, parameters".
    ### START CODE HERE ### (≈ 7 lines of code)
    parameters = initialize_parameters(n_x,n_h1,n_h2,n_y)
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    ### END CODE HERE ###
    
    
    # Loop (gradient descent)

    for i in range(0, num_iterations):
         
        ### START CODE HERE ### (≈ 4 lines of code)
        # Forward propagation. Inputs: "X, parameters". Outputs: "A3, cache".
        A3, cache = forward_propagation(X,parameters)
        
        # Cost function. Inputs: "A3, Y, parameters". Outputs: "cost".
        cost = compute_cost(A3,Y,parameters)
 
        # Backpropagation. Inputs: "parameters, cache, X, Y". Outputs: "grads".
        grads = backward_propagation(parameters,cache,X,Y)
 
        # Gradient descent parameter update. Inputs: "parameters, grads". Outputs: "parameters".
        parameters = update_parameters(parameters,grads)
        
        ### END CODE HERE ###
        
        # Print the cost every 1000 iterations
        if print_cost and i % 1000 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
    return parameters

In [212]:
parameters = nn_model(X, Y, 8,3, num_iterations=10000, print_cost=True)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
print("W3 = " + str(parameters["W3"]))
print("b3 = " + str(parameters["b3"]))

Cost after iteration 0: 0.693147
Cost after iteration 1000: 0.693143
Cost after iteration 2000: 0.693142
Cost after iteration 3000: 0.693106
Cost after iteration 4000: 0.204442
Cost after iteration 5000: 0.042095
Cost after iteration 6000: 0.013784
Cost after iteration 7000: 0.003936
Cost after iteration 8000: 0.001978
Cost after iteration 9000: 0.001226
W1 = [[-2.08378924e-03 -3.58951697e-02  5.73118718e-02  2.65717683e-01
   5.32425743e-01 -1.02775972e-01  1.14720691e-01  3.48212355e-02
   3.85986045e-02  2.11296695e-01  5.94933429e-01  2.54542669e-01
   8.56040305e-02  8.50122031e-02  2.25003309e-01  7.53125944e-02
  -2.73605845e-05 -6.54862023e-02 -2.06669586e-01 -7.71711699e-01
   7.76200131e-02 -2.12515902e-01 -3.36286723e-02 -3.04900569e-02
  -7.28913207e-03 -4.14799180e-01 -6.36169765e-01 -3.38247467e-01
   4.69659679e-01 -1.18301377e-01 -8.88243315e-02  1.71764093e-03
  -1.21738379e-02 -1.50829641e-02 -2.17156183e-01  5.76663397e-01
  -2.89698987e-01 -1.96685098e-01  1.8132045

### 4.5 Predictions

**Question**: Use your model to predict by building predict().
Use forward propagation to predict results.

**Reminder**: predictions = $y_{prediction} = \mathbb 1 \text{{activation > 0.5}} = \begin{cases}
      1 & \text{if}\ activation > 0.5 \\
      0 & \text{otherwise}
    \end{cases}$  
    
As an example, if you would like to set the entries of a matrix X to 0 and 1 based on a threshold you would do: ```X_new = (X > threshold)```

In [213]:
# GRADED FUNCTION: predict

def predict(parameters, X):
    """
    Using the learned parameters, predicts a class for each example in X
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    X -- input data of size (n_x, m)
    
    Returns
    predictions -- vector of predictions of our model (red: 0 / blue: 1)
    """
    
    # Computes probabilities using forward propagation, and classifies to 0/1 using 0.5 as the threshold.
    ### START CODE HERE ### (≈ 2 lines of code)
    A3, cache = forward_propagation(X,parameters)
    predictions = 1*(A3>0.5)
    ### END CODE HERE ###
    
    return predictions

In [214]:
# Build a model with a n_h1-dimensional, n_h2-dimensional hidden layer
parameters = nn_model(X, Y, n_h1 = 16,n_h2 = 4, num_iterations = 20000, print_cost=True)

Cost after iteration 0: 0.693147
Cost after iteration 1000: 0.693143
Cost after iteration 2000: 0.693131
Cost after iteration 3000: 0.279291
Cost after iteration 4000: 0.093416
Cost after iteration 5000: 0.023652
Cost after iteration 6000: 0.007348
Cost after iteration 7000: 0.004574
Cost after iteration 8000: 0.003466
Cost after iteration 9000: 0.002756
Cost after iteration 10000: 0.002127
Cost after iteration 11000: 0.001480
Cost after iteration 12000: 0.000948
Cost after iteration 13000: 0.000638
Cost after iteration 14000: 0.000471
Cost after iteration 15000: 0.000373
Cost after iteration 16000: 0.000308
Cost after iteration 17000: 0.000262
Cost after iteration 18000: 0.000228
Cost after iteration 19000: 0.000201


In [215]:
# Print accuracy
predictions = predict(parameters, X)
print(predictions)
print( X.shape, predictions.shape)
print ('Accuracy: %d' % float((np.dot(Y,predictions.T) + np.dot(1-Y,1-predictions.T))/float(Y.size)*100) + '%')

[[0 0 0 ... 1 1 1]]
(64, 1797) (1, 1797)
Accuracy: 100%


**Optional questions**:

Some optional/ungraded questions that you can explore if you wish: 
- What happens when you change the tanh activation for a sigmoid activation or a ReLU activation?
- Play with the learning_rate. What happens?
- What if we change the dataset? (See part 5 below!)

## 5 - COMPARISON WITH LOGISTIC REGRESSION  ##

In [222]:
import pandas as pad
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
dataset = load_digits()
X = dataset.data
new_Y = dataset.target
Y=[]
# tranforming 10 class in binary class i.e from 0,1,2,3,4 is 0 and 5,6,7,8,9 is 1
t=0
while t<len(new_Y):
    if new_Y[t] in [0,1,2,3,4]:
        Y.append(0)
    else:
        Y.append(1)
    t+=1
        

model = LogisticRegression()
model.fit(X,Y)

pridiction=model.predict(X)

print("Accuracy : "+str(100*accuracy_score(Y,model.predict(X)))+"%")

Accuracy : 91.15191986644408%
