# Building a Neural Network

## 1 - The general methodology to build a Neural Network

1. Define the neural network structure (# of input units, # of hidden units, etc)
2. Initialize the model's parameters
3. Loop
    - Forward propagation
    - Compute loss
    - Backward propagation to get the gradients
    - Update parameters (gradient descent)

## 2 - General Idea of Loop

### 2.1 - Neural Network Model

<img src="./img/neural_network_model.png" style="width:540px;">

### 2.2 - Forward Propagation

For one example $x^{(i)}$:

$$z^{[1](i)} = W^{[1]}x^{(i)} + b^{[1](i)}$$

$$a^{[1](i)} = tanh(z^{[1](i)})$$

$$z^{[2](i)} = W^{[2]}a^{[1](i)} + b^{[2](i)}$$

$$\hat{y}^{(i)} = a^{[2](i)} = \sigma(z^{[2](i)})$$

$$y^{(i)}_{prediction}= 
\begin{cases}
    1, & \text{if } \hat{y}^{(i)} > 0.5 \\
    0, & \text{otherwise}
\end{cases}$$

### 2.3 - Cost Function

$$J = - \frac{1}{m} \sum_{i = 0}^{m} \large{(} y^{(i)}\log(a^{[2] (i)}) + (1-y^{(i)})\log(1- a^{[2] (i)}){)}$$

### 2.4 - Backpropagation

<img src="./img/grad_summary.png" style="width:600px;">


### 2.5 - Update parameter

General gradient descent rule: $\theta = \theta - \alpha \frac{\partial J}{\partial \theta}$, where $\alpha$ is the learning rate and $\theta$ represents a parameter.


## 3 - Further explanation

### 3.1 - Backpropagation

Calculateing the partial derivate of the error with respect to weight $W_{ij}$ is done using the chain rule twice:

$$\frac{\partial J}{\partial w_{ij}} = \frac{\partial J}{\partial o_j} \frac{\partial o_j}{\partial net_j} \frac{\partial net_j}{\partial w_{ij}}$$

$J$ is cost function.

For each neuron $j$, its output $o_{j}$ is defined as 
$$o_j = \varphi(net_j) = \varphi(\sum_{k=1}^{n}w_{kj}o_k)$$
The activation function $\varphi$ is non-linear and differentiable. Common activation function includes sigmoid and TanH.

The input $net_j$ to a neuron is the weighted sum of output $o_k$ of previous neurons.

#### The third factor of $\frac{\partial J}{\partial w_{ij}}$:

$$\frac{\partial net_j}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}} \Bigg(\sum_{k=1}^{n}w_{kj}o_k\Bigg) = \frac{\partial}{\partial w_{ij}} w_{ij}o_i = o_i $$

If the neuron is the first layer after the input layer, $o_i$ is just $x_i$.

#### The second factor of $\frac{\partial J}{\partial w_{ij}}$:

$$\frac{\partial o_j}{\partial net_j} = \frac{\partial}{\partial net_j} \varphi(net_j) = \varphi'(net_j)$$

The derivative of the putput of neuron $j$ with respect to its input is simpy the partial derivative of the activation function.

#### The first factor of $\frac{\partial J}{\partial w_{ij}}$:
When the neuron is in the output layer, since $o_j = y$
$$\frac{\partial J}{\partial o_j} = \frac{\partial J(y)}{\partial y} = J'(y)$$

When $j$ is in an arbitary inner layer of the network, finding the derivative $J$ with respect to $o_j$ is less obvious. 

Considering $J$ as a function of the inputs of all neurons $L = u,v, ..., w$ receiving from neuron $j$

$$\frac{\partial J(o_j)}{\partial o_j} = \frac{\partial J(net_u, net_v, ..., net_w)}{\partial o_j}
= \sum_{l \in L} \Bigg(\frac{\partial J}{\partial net_l} \frac{\partial net_l}{\partial o_j}\Bigg)
= \sum_{l \in L} \Bigg(\frac{\partial J}{\partial o_l} \frac{\partial o_l}{\partial net_l} \frac{\partial net_l}{\partial o_j}\Bigg)
= \sum_{l \in L} \Bigg(\frac{\partial J}{\partial o_l} \frac{\partial o_l}{\partial net_l}w_{jl}\Bigg)$$

Therefore, the derivation with respect to $o_j$ can be calculated if all the derivatives with respect to the outputs $o_l$ of the next layer are known.

#### Putting together

$$\frac{\partial J}{\partial w_{ij}} = \delta_j o_i$$
with
$$\delta_j = \frac{\partial J}{\partial o_j}\frac{\partial o_j}{\partial net_j} =
\begin{cases}
    J'(o_j) \varphi'(net_j), & \text{if $j$ is an output neuron} \\
    (\sum_{l \in L}w_{jl}\delta_l)\varphi'(net_j), & \text{if $j$ is an inner neuron}
\end{cases}$$

#### Abount summary of gradient descent in 2.4 

Sigmid is used in output layer. Thus $\varphi(z) = \frac{1}{1 + e^{-z}}$, which has a convenient derivative of 
$$\frac{d\varphi}{dz}(z) = \varphi(z)(1-\varphi(z))$$

Thus

$$dz^{[2]} = J'(o_j) \varphi'(net_j) = (-yloga + (1-y)log(1-a))'(a)(1-a) = 
(\frac{-y}{a} + \frac{1-y}{1-a})(a)(1-a) = a^{[2]} - y$$ 

$$dz^{[1]} = (\sum_{l \in L}w_{jl}\delta_l)\varphi'(net_j) = W^{[2]T}dz^{[2]}g^{[1]'}(z^[1])$$

From

$$z^{[2](i)} = W^{[2]}a^{[1](i)} + b^{[2](i)}$$

$$z^{[1](i)} = W^{[1]}x^{(i)} + b^{[1](i)}$$

we know that

$$dW^{[2]} = dz^{[2]}a^{[1]T}$$

$$db^{[2]} = dz^{[2]}$$

$$dW^{[1]} = dz^{[1]}x^{T}$$

$$db^{[1]} = dz^{[1]}$$

# References:

- Coursera Deep Learning, Andrew Ng, https://www.coursera.org/learn/neural-networks-deep-learning/programming/wRuwL/planar-data-classification-with-a-hidden-layer
- Wikipedia contributors, "Backpropagation," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Backpropagation&oldid=871032328 (accessed December 1, 2018).
