In [2]:
'''Single Perceptron'''
import torch
import torch.nn.functional as F

#### Single Perceptron
<font size = 2>
    
**Linear Single Perceptron** is linear sum of products of input **$x$** and weight **$w$** in single layer.
    
<div>
<img src="SinglePerceptron_1.png" style="zoom:40%"/>
</div>
    
There are some regulation about the tokens of variables. The upper script means current layer. The lower scrip means current index of node. E.g. **$x^{0}_{1}$** represents the 1st node in 0th layer. The situation of weight **$w$** is a little different. Lower left script means the index of linked node from previous layer, and the lower right script means index of linked node from current layer. E.g. **$w^{1}_{10}$** represents the weight of 1st layer linking 1st node of previous layer and 0th node of current layer. The activation function/output is shown as **$O^{1}_{0}$** with same regulation. **$E$** means error between the output **$O^{1}_{0}$** and label **$t$**, which MSE in this linear case:

$$O^{1}_{0} =\sigma \left( \sum^{N}_{j = 0} {w^{1}_{j0} x^{0}_{j}} \right)$$
$$E = \sum^{M}_{i = 0} \frac{1}{2} (O^{1}_{i} - t) ^ 2$$

However this case has only one output, so we can simplify the **$E$** as:

$$E = \frac{1}{2} (O^{1}_{0} - t) ^ 2$$

#### Gradient Decreasing Update
<font size = 2>
    
The update of weights within neural network uses gradient decreasing update:

$$w^{(\tau + 1)} = w^{(\tau)} - \eta \frac{\partial{E}}{\partial{w^{(\tau)}}}$$
    
where **$w^{(\tau)}$** is current weight, **$w^{(\tau + 1)}$** is updated weight, **$E$** is error and **$\eta$** is learning rate.

#### Derivative of Single Perceptron
<font size = 2>

So we need gradients about current weight **$w^{(\tau)}$** first to update. According to linear single perceptron model, the gradients can be derived as:

$$
\begin{equation}
\begin{aligned}
\frac{\partial{E}}{\partial{w^{1}_{j0}}} &= \frac{\partial{E}}{\partial{O^{1}_{0}}} \frac{\partial{O^{1}_{0}}}{\partial{w^{1}_{j0}}}\\
&= \frac{\partial{E}}{\partial{\sigma (x^{1}_{0})}} \frac{\partial{\sigma (x^{1}_{0})}}{\partial{w^{1}_{j0}}} \\
&= (O^{1}_{0} - t) \frac{\partial{\sigma (x^{1}_{0})}}{\partial{x^{1}_{0}}} \frac{\partial{x^{1}_{0}}}{\partial{w^{1}_{j0}}} \\
&= (O^{1}_{0} - t) O^{1}_{0} (O^{1}_{0} - 1) \frac{\partial{x^{1}_{0}}}{\partial{w^{1}_{j0}}} \\
&= (O^{1}_{0} - t) O^{1}_{0} (O^{1}_{0} - 1) \frac{\partial{\sum x^{0}_{j} w^{1}_{j0}}}{\partial{w^{1}_{j0}}} \\
&= (O^{1}_{0} - t) O^{1}_{0} (O^{1}_{0} - 1) x^{0}_{j}
\end{aligned}
\end{equation}
$$

Recap:
$$\frac{\partial{\sigma (x)}}{\partial{x}} = \sigma (x) (1 - \sigma (x))$$

In [3]:
# input x: 1 sample point with dimension of 5D
# e.g. a person with features: age, weight, height, running velocity, eyesight degree
x = torch.randn(1,5)
# weight w: 1 neurons with corresponding dimension 5D, which is same with input x
# we want gradients of w-vector, so set 'requires_grad' as True
w = torch.randn(1,5, requires_grad = True)
# output o: x @ w.t() = [1,5] @ [5,1] = [1,1]
o = torch.sigmoid(x@w.t())
print('Output is:', o, o.type(), o.shape)
# target: 1D output for each sample
# e.g. a value to determine which classification does a person belong to:
# whether it belongs to A-group or B-group/ C-group/ D-group/ E-group
t = torch.ones(1,1).float()
# generate error/loss function
e = F.mse_loss(t, o)
# take derivative by .backward()
# because e has only one 'destination function', we don't need para of 'grad_tensor', i.e. selector
e.backward(retain_graph = True)
grad = w.grad
print('The gradients of Error w.r.t weights are:', grad)

Output is: tensor([[0.7573]], grad_fn=<SigmoidBackward0>) torch.FloatTensor torch.Size([1, 1])
The gradients of Error w.r.t weights are: tensor([[ 0.0088,  0.0459, -0.0724, -0.0967, -0.1758]])
