#### New Representations in Multi-layer Perceptron
<font size = 2>
    
Now we have a look about **Linear Single Perceptron** with multiple outputs and multiple hidden layers. And still take **Sigmoid** as activation function.
    
<div>
<img src="Backpropagation_1.png" style="zoom:40%"/>
</div>
    
Now in each token, the upper capital script represents **layer**, lower scripts still represents index of nodes, as well as the double lower scripts of weight, which represents the connection of previous-layer node and current-layer node.
    
Gray is input layer **$I$**. Blue is hidden layer **$J$**. Orange is output layer **$K$**.
    
For example: 
    
**$w^{K}_{jk}$** means the weight in current layer **$K$** connecting the **j-th** node in previous layer **$J$** and the **k-th** node in current layer **$K$**.

**$O^{J}_{j}$** means the output of the **j-th** node in previous layer **$J$**.
    
**$O^{K}_{k}$** means the output of the **k-th** node in current layer **$K$**.

#### Output Layer
<font size = 2>
    
According to the conclusion from **Multilpe Perceptron** example, we can simplify the result from single-layer multi-output perceptron:

$$ \frac{\partial{E}}{\partial{w^{1}_{jk}}} = (O^{1}_{k} - t_{k}) O^{1}_{k} (O^{1}_{k} - 1) x^{0}_{j} $$
    
with:

$$\delta^{K}_{k} = (O^{1}_{k} - t_{k}) O^{1}_{k} (O^{1}_{k} - 1)$$
    
into:
    
$$ \frac{\partial{E}}{\partial{w^{1}_{jk}}} = \delta^{K}_{k} x^{0}_{j} $$
    
Now we consider the situation where the current layer is layer **$K$**, the previous layer is layer **$J$**. Substitute the input **$x^{0}_{j}$** with the output from previous hidden layer **$J$**, which is **$O^{J}_{j}$**, and substitute the 1st-layer weight **$w^{1}_{jk}$** with current weight **$w^{K}_{jk}$**:
    
 $$\frac{\partial{E}}{\partial{w^{K}_{jk}}} = \delta^{K}_{k} O^{J}_{j}$$
    
**$w^{K}_{jk}$** is weight that links previous layer **$J$** and current layer **$K$**.

**$\delta^{K}_{k}$** contains the information from current layer **$K$** to the end, i.e. **Error**. 

**$O^{J}_{j}$** is the output from previous layer **$J$**, which is taken as input of current layer **$K$**.

Now if we want to calculate **$\frac{\partial{E}}{\partial{w^{K}_{jk}}}$**, we only need output from previous layer **$O^{J}_{j}$** and the information to the end **$\delta^{K}_{k}$** which can be calculated as iteration.

#### Hidden layer
<font size = 2>
    
Recap the graph of multipal layer.

<div>
<img src="Backpropagation_1.png" style="zoom:40%"/>
</div>
    
Now we want to calculate the gradient of **Error** w.r.t. weight **$w^{J}_{ij}$**. After applying backpropagation and chain rules, we can get the results(process neglected):
    
$$ \frac{\partial{E}}{\partial{w^{J}_{ij}}} = O^{I}_{i} O^{J}_{j} (1 - O^{J}_{j}) \sum^{m}_{k = 0} (\delta^{K}_{k} w^{K}_{jk}) $$
    
$O^{I}_{i}$ is the output of **i-th** node from hidden layer **$I$**.

$O^{J}_{j}$ is the output of **j-th** node from hidden layer **$J$**.
    
$\sum^{m}_{k = 0} (\delta^{K}_{k} w^{K}_{jk})$ is $\frac{\partial{E}}{\partial{w^{K}_{jk}}}$, which is the gradient of **Error** w.r.t next layer's weight $w^{K}_{jk}$.
    
Do substitution and we can get similar form:

$$ \delta^{J}_{j} = O^{J}_{j} (1 - O^{J}_{j}) \sum^{m}_{k = 0} (\delta^{K}_{k} w^{K}_{jk}) $$
    
$$\frac{\partial{E}}{\partial{w^{J}_{ij}}} = \delta^{J}_{j} O^{I}_{i} $$

#### Conclusion
<font size = 2>
    
In conclusion, we can calculate the gradients of **Error** w.r.t any weights in any layers by rules of:

(1) For the weight $w^{K}_{jk}$ of an output layer $K$ node k:

$$\frac{\partial{E}}{\partial{w^{K}_{jk}}} = \delta^{K}_{k} O^{J}_{j}$$
    
with:
    
$$\delta^{K}_{k} = (O^{K}_{k} - t_{k}) O^{K}_{k} (O^{K}_{k} - 1)$$
    
(2) For the weight $w^{J}_{ij}$ of a hidden layer $J$ node j:
    
$$\frac{\partial{E}}{\partial{w^{J}_{ij}}} = \delta^{J}_{j} O^{I}_{i} $$

with:  
  
$$ \delta^{J}_{j} = O^{J}_{j} (1 - O^{J}_{j}) \sum^{m}_{k = 0} (\delta^{K}_{k} w^{K}_{jk}) $$
    
(3) For more hidden layers, the process can be iterated. For example: for the weight $w^{I}_{hi}$ of a hidden layer $I$ node i:
    
$$\frac{\partial{E}}{\partial{w^{I}_{hi}}} = \delta^{I}_{i} O^{H}_{h} $$

with:  
  
$$ \delta^{I}_{i} = O^{I}_{i} (1 - O^{I}_{i}) \sum^{n}_{j = 0} (\delta^{J}_{j} w^{J}_{ij}) $$


#### Backpropagation with Chain-Rule

In [6]:
import torch
x = torch.tensor(1.)
w1 = torch.tensor(2., requires_grad=True)
b1 = torch.tensor(1., requires_grad=True)
w2 = torch.tensor(2., requires_grad=True)
b2 = torch.tensor(1., requires_grad=True)
y = (x * w1 + b1) * w2 + b2
grad_w1,grad_b1,grad_w2,grad_b2 = torch.autograd.grad(y,[w1,b1,w2,b2])
print(grad_w1,grad_b1,grad_w2,grad_b2)

tensor(2.) tensor(2.) tensor(3.) tensor(1.)


the gradients above:

grad_w1:

$$
\begin{equation}
\begin{aligned}
\frac{\partial{y}}{\partial{w_{1}}} &= \frac{\partial{[(xw_{1} + b_{1})w_{2} + b_{2}}]}{\partial{w_{1}}} \\
&= \frac{\partial{[(xw_{1} + b_{1})w_{2} + b_{2}}]}{\partial{(xw_{1} + b_{1})}} \frac{\partial{(xw_{1} + b_{1})}}{\partial{w_{1}}} \\
&= w_{2}x
\end{aligned}
\end{equation}
$$

grad_b1:

$$
\begin{equation}
\begin{aligned}
\frac{\partial{y}}{\partial{w_{1}}} &= \frac{\partial{[(xw_{1} + b_{1})w_{2} + b_{2}}]}{\partial{b_{1}}} \\
&= \frac{\partial{[(xw_{1} + b_{1})w_{2} + b_{2}}]}{\partial{(xw_{1} + b_{1})}} \frac{\partial{(xw_{1} + b_{1})}}{\partial{b_{1}}} \\
&= w_{2} \cdot 1
\end{aligned}
\end{equation}
$$

grad_w2:

$$
\begin{equation}
\begin{aligned}
\frac{\partial{y}}{\partial{w_{1}}} &= \frac{\partial{[(xw_{1} + b_{1})w_{2} + b_{2}}]}{\partial{w_{2}}} \\
&= xw_{1} + b_{1} \\
\end{aligned}
\end{equation}
$$

grad_b2:

$$
\begin{equation}
\begin{aligned}
\frac{\partial{y}}{\partial{w_{1}}} &= \frac{\partial{[(xw_{1} + b_{1})w_{2} + b_{2}}]}{\partial{b_{2}}} \\
&= 1 \\
\end{aligned}
\end{equation}
$$