In [2]:
'''Multiple Perceptron'''
import torch
import torch.nn.functional as F

#### Multiple Perceptron
<font size = 2>
    
**Linear Single Perceptron** is linear sum of products of input **$x$** and weight **$w$** in single layer.
    
<div>
<img src="MultiPerceptron_1.png" style="zoom:40%"/>
</div>
    
There are some regulation about the tokens of variables. The upper script means current layer. The lower scrip means current index of node. E.g. **$x^{0}_{1}$** represents the 1st node in 0th layer. The situation of weight **$w$** is a little different. Lower left script means the index of linked node from previous layer, and the lower right script means index of linked node from current layer. E.g. **$w^{1}_{10}$** represents the weight of 1st layer linking 1st node of previous layer and 0th node of current layer. The activation function/output is shown as **$O^{1}_{0}$** with same regulation. **$E$** means the sum of errors between the output **$O^{1}_{k}$** and label **$t_{k}$**, which MSE in this linear case:

$$O^{1}_{k} =\sigma \left( \sum^{n}_{j = 0} {w^{1}_{jk} x^{0}_{j}} \right)$$
$$E = \sum^{m}_{k = 0} \frac{1}{2} (O^{1}_{k} - t_{k}) ^ 2$$

#### Gradient Decreasing Update
<font size = 2>
    
The update of weights within neural network uses gradient decreasing update:

$$w^{(\tau + 1)} = w^{(\tau)} - \eta \frac{\partial{E}}{\partial{w^{(\tau)}}}$$
    
where **$w^{(\tau)}$** is current weight, **$w^{(\tau + 1)}$** is updated weight, **$E$** is error and **$\eta$** is learning rate.

#### Derivative of Multiple Perceptron
<font size = 2>

So we need gradients about current weight **$w^{(\tau)}$** first to update. According to linear multiple perceptron model, the gradients can be derived as:

$$
\begin{equation}
\begin{aligned}
\frac{\partial{E}}{\partial{w^{1}_{jk}}} &= \frac{\partial{E}}{\partial{O^{1}_{k}}} \frac{\partial{O^{1}_{k}}}{\partial{w^{1}_{jk}}}\\
&= \frac{\partial{\left( \sum^{m}_{k = 0} \frac{1}{2} (O^{1}_{k} - t_{k}) ^ 2 \right)}}{\partial{O^{1}_{k}}} \frac{\partial{O^{1}_{k}}}{\partial{w^{1}_{jk}}} \\
&= (O^{1}_{k} - t_{k}) \frac{\partial{O^{1}_{k}}}{\partial{w^{1}_{jk}}} \\
&= (O^{1}_{k} - t_{k}) \frac{\partial{\sigma (x^{1}_{k})}}{\partial{x^{1}_{k}}} \frac{\partial{x^{1}_{k}}}{\partial{w^{1}_{jk}}} \\
&= (O^{1}_{k} - t_{k}) O^{1}_{k} (O^{1}_{k} - 1) \frac{\partial{x^{1}_{k}}}{\partial{w^{1}_{jk}}} \\
&= (O^{1}_{k} - t_{k}) O^{1}_{k} (O^{1}_{k} - 1) \frac{\partial{\sum x^{0}_{j} w^{1}_{jk}}}{\partial{w^{1}_{jk}}} \\
&= (O^{1}_{k} - t_{k}) O^{1}_{k} (O^{1}_{k} - 1) x^{0}_{j}
\end{aligned}
\end{equation}
$$

Recap:
$$\frac{\partial{\sigma (x)}}{\partial{x}} = \sigma (x) (1 - \sigma (x))$$

As we can see from the inductive process, the specific weight **$w^{1}_{jk}$** is only realted to the connection from **j-node** **$x^{0}_{j}$** of previous layer and **k-node** part **$(O^{1}_{k} - t_{k}) O^{1}_{k} (O^{1}_{k} - 1)$** of current layer. And also the corresponding part **$\frac{1}{2} (O^{1}_{k} - t_{k}) ^ 2$** in sum of **Error** and label **$t_{k}$**.

In [10]:
# input x: 2 sample point with dimension of 5D
# e.g. a person with features: age, weight, height, running velocity, eyesight degree
x = torch.randn(2,5)
# weight w: 3 neurons with corresponding dimension 5D, which is same with input x
# we want gradients of w-vector, so set 'requires_grad' as True
w = torch.randn(3,5, requires_grad = True)
# output o: x @ w.t() = [2,5] @ [5,3] = [2,3]
o = torch.sigmoid(x@w.t())
print('Output is:', o, o.type(), o.shape)
# target: 3D output for each sample
# e.g. probabilities of each person about its performance: good/ medium/ bad 
t = torch.ones(2,3).float()
# generate error/loss function
# because mse is sum of error between labels and predictions
# the mse 'Error' e is still with shape of 1x1, i.e. only one 'destination function'
e = F.mse_loss(t, o)
# take derivative by .backward()
# because e has only one 'destination function', we don't need para of 'grad_tensor'
# i.e. selector
e.backward(retain_graph = True)
grad = w.grad
print('The gradients of Error w.r.t weights are:', grad)

Output is: tensor([[0.1087, 0.9168, 0.2259],
        [0.1224, 0.1651, 0.0901]], grad_fn=<SigmoidBackward0>) torch.FloatTensor torch.Size([2, 3])
The gradients of Error w.r.t weights are: tensor([[-0.0924,  0.0375,  0.0431,  0.0051,  0.0153],
        [-0.0562,  0.0503,  0.0045,  0.0157, -0.0096],
        [-0.1114,  0.0267,  0.0667, -0.0024,  0.0313]])
