## Model Representation
### Components of Neural Network
#### A Single Logistic Neuron Unit
- Input features
    - The $x_0$ input is used as "bias unit" and always equal to 1
- Input Wire
- Neuron
- Output Wire
- Output Hypothesis $h_{\theta}(x) = \frac{1}{1+e^{-\theta^Tx}}$

![W4-LOGISTIC-UNIT](Plots/W4-LOGISTIC-UNIT.png)

#### Activation Units
Nodes in the intermediate/hidden layers
- $a^{(j)}_i$: Activation of unit $i$ in layer $j$ (be aware of the notation)
- $\Theta^{(j)}_{b,a}$: matrix of weights controlling function mapping from the node $a$ in layer $j$ to the node $b$ in the layer $j+1$
    - Dimension of $\Theta^{(j)}$: $s_{j+1}*(S_{j}+1)$ if the layer $j$ has $s_j$ units and the layer $j+1$ has $s_{j+1}$ units
    - Each layer has an **additional bias unit** $x_0$/$a_0$

### Forward Propagation - Compute the Hypothesis
In the Neural Net below, we have 3 nodes in the input layers and 3 nodes in the second layer (hidden layer), so the DImension of $\Theta^{(1)}$ is $3*4 = 12$. 

The computation of the second layer (hidden layer) is shown with $a^{(2)}_1,a^{(2)}_2,a^{(2)}_3$.

The final output $h_{\Theta}$ is computed with $a^{(3)}_1$

![W4-FORWARD-PROP](Plots/W4-FORWARD-PROP.png)
![W4-NN-FP](Plots/W4-NN-FP.png)
- On the right size, it shows the vectorized computation. Note: 
    - Remember to add bias unit (1) to each layer
    - Use $z^{(j)}_k$ to represent the product inside the $g()$ function
        - $z^{(j)} = \Theta^{(j-1)}a^{(j-1)}$
    - The $g()$ in the screenshot stands for the sigmoid function $g(z) = \frac{1}{1+e^{-z}}$

### Examples - From AND/OR to XNOR Operation
XNOR: gives 1 when $x_1=x_2$

An important reference is the curve of the Sigmoid Function
![W4-SIGMOID](Plots/W4-SIGMOID.png)
![W4-NN-OPT](Plots/W4-NN-OPT.png)

### Multiclass Classification
Same as the traditional Logistic Regression, we use **One vs All** method
- The hypothesis output $h_{\theta}(x)$ is a vector instead of a number, and the class with the largest probability will be marked with 1

![W4-NN-MM](Plots/W4-NN-MM.png)

### Backpropagation
#### Cost Function
#### Notation
- $L$: the total number of layers in the network
- $s_l$: the number of units (excluding the bias unit) in layer $l$
- $K$: the number of output units/classes

##### Cost of Weight Matrix
![W5-COST](Plots/W5-COST.png)
- Extra summation than the Regularized Logistic Regression to account for the multiple output nodes
    - Loop through the number of output nodes (For example, 1 to 10)
- $\Theta_{j,i}^{(l)}$: the weight matrix that maps from the node $i$ in the layer $l$ to the node $j$ in the next layer 

#### Backpropagation Algorithm - Minimize Cost Function $J(\Theta)$
##### Reference
- [Backpropagation in CS231](http://cs231n.github.io/optimization-2/)
- [J.G. Makin in UC Berkeley](https://inst.eecs.berkeley.edu/~cs182/sp06/notes/backprop.pdf)
![W4-NN-BP](Plots/W4-NN-BP.png)

- **Accumulate Losses**: Perform forward propagation to find $a^{(L)}$ (or $h_{\theta}(x)$) of one $x^{(i)}$, and compute the loss with the corresponding $y^{(i)}$
    - One pair at a time
- **$\delta^{(l)}_j = \frac{\partial}{\partial z^{(l)}_j} cost(i)$: "Error" of cost for $a^{(l)}_j$**
    - On the last layer, the $cost^{(L)}_j = 1/2*[a^{(L)}_j - y_j]^2$, so $\delta^{(L)}_j = a^{(L)}_j - y_j$
    - After find the error of the output layer $\delta^{(L)}$, we can compute the $\delta^{(l)}$ of the hidden layers through back-propagation
        - The error in one node is distributed to all the nodes in the next layer that is connected to it through the weight matrix
        - Thus, the $\delta^{(l+1)}$ term in the formula below is not a single number, but **all the error** in the next layer
![W5-DELTA](Plots/W5-DELTA.png)
    - The second part of the formula comes from the g-prime
    ![W5-GPRIME](Plots/W5-GPRIME.png)
    - [Proof with the derivative of Sigmoid](http://mathworld.wolfram.com/SigmoidFunction.html): $\frac{d}{dx} (\frac{1}{1+e^{-x}}) = \frac{e^{-x}}{(1+e^{-x})^2} = y(x)*[(1+y(x))]$

- Vectorize the accumulation of $\Delta$
![W5-DELTA2](Plots/W5-DELTA2.png)


## Implementation
### Unrolling Parameters - Vectorization
Both the Weight Matrix $\Theta^{l}$ and the Gradient Matrix $\Delta$ are in matrix form, we need to unroll/vectorize them into a long (vertical) vector
- When using Forward Prop/BackProp, matrix will help with the computation
- When using advanced optimization method (like `fminunc()`), it requires vector input

In [None]:
% Unrolling
thetaVector = [ Theta1(:); Theta2(:); Theta3(:);];
deltaVector = [ D1(:); D2(:); D3(:)];

% Reshape back to the original form
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)

### Gradient Checking
**Assure that the backpropagation works correctly**
- Once we compute our gradApprox vector, we can check that gradApprox ≈ deltaVector
- This is very **computationally expensive**, so DO NOT use it in training

Based on the two-way gradient formula
![W5-GRAD-MATH](Plots/W5-GRAD-MATH.png)

Generalize to the $\Theta$ matrix:
![W5-GR-CHECK](Plots/W5-GR-CHECK.png)



In [None]:
% Matlab Code Example
epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

### Random Initialization
Initialize all weight matrix $\Theta$ to the same value like 0 **does not work well with Neural Network**. 
- When we backpropagate, all nodes will update to the same value repeatedly

**We need to initialize weights randomly and avoid symmetry**
![W5-RA-INIT](Plots/W5-RA-INIT.png)


In [None]:
% If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.
Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

## Guideline to Build up a Neural Network
### Pick a network architecture (Connectivity Pattern)
- Number of input units = dimension of features x(i)
- Number of output units = number of classes
- Number of hidden units per layer = usually more the better
    - Must balance with cost of computation as it increases with more hidden units
    - Defaults: 1 hidden layer. 
    - If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer

### Training a Neural Network
1. Randomly initialize the weights
2. Implement forward propagation to get hΘ(x(i)) for any x(i)
3. Implement the cost function
4. Implement backpropagation to compute partial derivatives
5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.