### 1. Basic structure of a neural network
- Simple example only with inputs, a hidden layer, and the output:
![1](images/5.1.1.png)

- Simple example with inputs, two hidden layers, and the output:
![2](images/5.1.2.png)

- For multiple classification:
![2](images/5.1.3.png)

### 2. Model
- For the first hidden layer:
$$ a_{k}^{(1)} = g(\mathbf{w}^T \mathbf{x_{i}} + b_{0}) $$
, where $k = 1,2,3,..,K$ ($K$ is the total number of neurons in the first layer), $i = 1,2,3,...,n$ ($n$ is the total number of features in the original dataset)
- For the other hidden layers:
$$ a_{k}^{(j)} = g(\mathbf{w}^T \mathbf{a_{k}^{(j-1)}} + b_{0}^{(j-1)}) $$
, where $j=1,2,...,J $ indicates which layer and $J$ is the total number of hidden layers.  
- For the last layer:
$$ h(\mathbf{x}) = g(\mathbf{w}^T \mathbf{a_{k}^{(J-1)}} + b_{0}^{(J-1)}) $$ 
<p align="center">
<img src=images/5.1.4.png width="300" height="150" alt="5" align=centering>
- Note: the g(x) and h(x) can be a lot of functions:
  - (1) Sigmoid function: $ g(x) = \frac{1}{1 + e^{-x}} $;
  - (2) $ g(x) = tanh(x) $
  - (3) $ g(x) = x $
  - (4) Perceptron: 
  $$ 
  g(x) = 
  \begin{cases}
  1,\quad x\geq 0\\
  -1, \quad x<0
  \end{cases}
  \tag{1}
  $$

### 3. Strategy - cost function
Just like many other machine learning models, the strategy to pick the optimal model is to minimize the cost function.
- Regression:
$$ C = \frac{1}{N} \sum_{i=1}^n (y_{i}-f(\mathbf{{x}_{i}}))^2 $$
- Classification:
$$ C = \sum_{i=1}^n y_{i}logf(\mathbf{{x}_{i}}) + (1-y_{i})log(1-f(\mathbf{{x}_{i}})) $$

<p align="center">
<img src=images/5.1.5.png width="500" height="200" alt="5" align=centering>

<p align="center">
<img src=images/5.1.6.png width="500" height="200" alt="5" align=centering>

<p align="center">
<img src=images/5.1.7.png width="500" height="200" alt="5" align=centering>

### 4. Algorithm - gradient descent and backpropagation
- Use gradient descent method:
  - Compute the gradient, $ - \nabla C(\mathbf{W})$;
  - Update the weights by adding the gradient.
- Interpret the $ - \nabla C(\mathbf{W})$:
  - Sign: tells us the direction of each weight should go (increase or decrease);
  - Magnitude: indicates what nudges to all of the weights and biases have more importance (cause the fastest change to the cost function) or which changes to which weights matter the most;

- How one training sample affects(trains) the weights?
  - Start from the last layer:
  <p align="center">
<img src=images/5.1.8.png width="500" height="300" alt="5" align=centering>
  - We should find ways to make the output closer to the label: change $\beta$; change $w$; change $a_{k}^{(J-1)}$
<p align="center">
<img src=images/5.1.9.png width="500" height="250" alt="5" align=centering>

while we can't change $a_{k}^{(J-1)}$ directly, we should trace back to the previous layers.
  - Consider other classes:
  <p align="center">
<img src=images/5.1.10.png width="400" height="250" alt="5" align=centering>
  - Averaging all these changes that we want to make, we expect the following changes to $a_{k}^{(J-1)}$. Repeating the similar process, we change the weights and $a_{k}^{(J-2)}$ to achieve the expected changes in $a_{k}^{(J-1)}$.
  <p align="center">
<img src=images/5.1.11.png width="100" height="200" alt="5" align=centering>

- How all the training samples affect(train) the weights?
<p align="center">
<img src=images/5.1.12.png width="400" height="200" alt="5" align=centering>

these averages are proportinal to $ - \nabla C(\mathbf{W})$, so that we have
$$ - \nabla C(\mathbf{W}) = $$
$$
\left[
\begin{matrix}
-0.08\\
+0.12\\
-0.06\\
:\\
+0.04\\
\end{matrix}
\right]
$$

- Mini-batch gradient descent (stochastic) (randomly divide samples into small groups and train the neural network with these groups) can make the calculation of gradients much faster.

### 5. Detailed calculation behind the algorithm
- Think about a simple example first:
  - We want to know how changes in the weight impact the cost function:
  <p align="center">
<img src=images/5.1.13.png width="400" height="250" alt="5" align=centering>
  - Actually, we can have - the chain rule:
  $$ \frac{\partial C_{0}}{\partial w^{(L)}} = $$
  $$ \frac{\partial z^{(L)}}{\partial w^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} \frac{\partial C_{0}}{\partial a^{(L)}} $$
  - Calculate the derivatives:
  $$ \frac{\partial C_{0}}{\partial a^{(L)}} = 2(a^{(L)} - y) $$
  $$ \frac{\partial a^{(L)}}{\partial z^{(L)}} = \sigma^{'} (z^{(L)}) $$
  $$ \frac{\partial z^{(L)}}{\partial w^{(L)}} = a^{(L-1)}$$
  therefore,
  $$ \frac{\partial C_{0}}{\partial w^{(L)}} = 2(a^{(L)} - y)\sigma^{'} (z^{(L)})a^{(L-1)}$$
  similarly,
  $$ \frac{\partial C_{0}}{\partial b^{(L)}} = 2(a^{(L)} - y) \sigma^{'} (z^{(L)})$$
  $$ \frac{\partial C_{0}}{\partial a^{(L-1)}} = 2(a^{(L)} - y) \sigma^{'} (z^{(L)})w^{(L)}$$
  - For all the training examples, we have:
  $$ \frac{\partial C}{\partial w^{(L)}} = \frac{1}{n} \sum_{k=0}^{n-1} \frac{\partial C_{k}}{\partial w^{(L)}} $$
  - Overall, the graident should be:
  $$ \nabla C(\mathbf{W}) = $$
$$
\left[
\begin{matrix}
\frac{\partial C}{\partial w^{(1)}}\\
\frac{\partial C}{\partial b^{(1)}}\\
:\\
\frac{\partial C}{\partial w^{(L)}}\\
\frac{\partial C}{\partial b^{(L)}}\\
\end{matrix}
\right]
$$
<p align="center">
<img src=images/5.1.14.png width="400" height="200" alt="5" align=centering>
<p align="center">
<img src=images/5.1.15.png width="400" height="200" alt="5" align=centering>

In [None]:
### 6. Perceptron