# Build a Multi-Layer Neural Network


## Weights Initialization
---
Firstly, weights need to be initialized for different layers. Note that in general, the input is not considered as a layer, but output is.

For `lth` layer, 
- weight $W^{[l]}$ has shape $(n^{[l]}, n^{[l-1]})$
- bias $b^{[l]}$ has shape $(n^{[l]}, 1)$

where $n^{[0]}$ equals the number input feature.

In [2]:
import numpy as np

In [5]:
def weights_init(layers_dim):
    params = {}
    
    n = len(layers_dim)
    for i in range(1, n):
        params['W' + str(i)] = np.random.randn(layers_dim[i], layers_dim[i-1])
        params['b' + str(i)] = np.random.randn(layers_dim[i], 1)
    return params

In [6]:
p = weights_init([2, 5, 1])
p

{'W1': array([[-1.20260404,  1.51803722],
        [ 1.31068456,  3.23918964],
        [ 0.89989452,  0.42286449],
        [ 0.83184057, -0.17971632],
        [-1.13190191, -1.41989977]]),
 'b1': array([[ 0.33066558],
        [ 0.69946101],
        [-0.42476741],
        [ 0.19228531],
        [ 0.61643758]]),
 'W2': array([[ 0.43247981,  1.33887682, -0.54990187,  0.42008628,  0.22424862]]),
 'b2': array([[0.14716179]])}

# Forward
---
## Equations of Multi-layer
---
$$ Z^{[l]} = W^{[l]}A^{[l-1]} + b^{[l]} $$

$$ A^{[l]} = g^{[l]}(Z^{[l]}) $$

Where $l$ is the `lth` layer.

In [31]:
def sigmoid(x):
    return 1/(1 + np.exp(-x))


def relu(x):
    return np.maximum(x, 0)

In [33]:
x = np.array([-1.2, -2.0, 1.3])

sx = sigmoid(x)
rx = relu(x)

print(sx, rx)

[0.23147522 0.11920292 0.78583498] [0.  0.  1.3]


In [47]:
def forward(X, params):
    # intermediate layer use relu as activation
    # last layer use sigmoid
    n_layers = int(len(params)/2)
    A = X
    cache = {}
    for i in range(1, n_layers):
        W, b = params['W'+str(i)], params['b'+str(i)]
        Z = np.dot(W, A) + b
        A = relu(Z)
        cache['Z'+str(i)] = Z
        cache['A'+str(i)] = A
    
    # last layer
    W, b = params['W'+str(i+1)], params['b'+str(i+1)]
    Z = np.dot(W, A) + b
    A = sigmoid(Z)
    cache['Z'+str(i+1)] = Z
    cache['A'+str(i+1)] = A
    
    return cache, A

In [49]:
X = np.array([1., 1.]).reshape(2, 1)
cache, A = forward(X, p)

In [50]:
cache

{'Z1': array([[ 0.64609876],
        [ 5.24933522],
        [ 0.8979916 ],
        [ 0.84440957],
        [-1.9353641 ]]),
 'A1': array([[0.64609876],
        [5.24933522],
        [0.8979916 ],
        [0.84440957],
        [0.        ]]),
 'Z2': array([[7.31571729]]),
 'A2': array([[0.99933544]])}

In [51]:
A

array([[0.99933544]])

# Cost Function
---
Still we consider this a binary classification, the cost of a batch would be:
$$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) $$

Where $a$ is the predicted value, and $y$ is the actual one.

In [68]:
def compute_cost(A, Y):
    """
    For binary classification, both A and Y would have shape (1, m), where m is the batch size
    """
    assert A.shape == Y.shape
    m = A.shape[1]
    s = np.dot(Y, np.log(A.T)) + np.dot(1-Y, np.log((1 - A).T))
    loss = -s/m
    return np.squeeze(loss)

In [69]:
A = np.array([[0.9, 0.3]])
Y = np.array([[1, 0]])

loss = compute_cost(A, Y)
print(loss)

0.23101772979827936


# Backward Propagation
---
<img src='images/backprop_kiank.png' style="width:800px;height:250px;">
<caption><center> **[source]**: https://github.com/enggen/Deep-Learning-Coursera </center></caption>

The backward gradient can be calculated in recurrent fashion:

$$ dZ^{[l]} = dA^{[l]} * g^{[l]'}(Z^{[l]}) $$
$$ dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} $$
$$ db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}$$
$$ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} $$
