In [1]:
import torch
import torch.nn as nn
import numpy as np

$y=xA^T=b$
Here is a super simple example. $y=ax+b$, where a in this case is 0.8693 and the bias is -0.2126

torch.nn.Linear(in_features, out_features, bias=True, device=None, dtype=None)  
$y = xA^T+B$
Variabels:
- weight(torch.Tensor): shape(out_feature, in_feature)
- bias: shape(out_feature)

example:
```python
m = nn.Linear(20,30)
print(m.weight.shape)
```
will give (30, 20)
So [1x30] = [1*20] *[20 * 30] + [30 *1]  
$A^T$ is [20 *30]  
A therefore is [30 * 20]  
The orthogonal_ make this matrix $A^TA=I$  
The gain in the orthogonal_ is multiplied to each element 

In [3]:
model = nn.Linear(1, 1)

print(model.weight.size())
print(model.weight)
print(model.bias)
loss_function = nn.MSELoss()

torch.Size([1, 1])
Parameter containing:
tensor([[0.8693]], requires_grad=True)
Parameter containing:
tensor([-0.2126], requires_grad=True)


Then we feed some inputs into this layer. As can be seen from the result, it is exactly doing the calculation of  
$y=0.8693x-0.2126$

In [28]:
inputs = torch.tensor([[1.0], [2.0], [3.0], [4.0]], requires_grad=True)
# targets = torch.tensor([[2.0], [4.0], [6.0], [8.0]])
predictions = model(inputs)
print(predictions)

tensor([[0.6567],
        [1.5259],
        [2.3952],
        [3.2644]], grad_fn=<AddmmBackward0>)


You can find out the gradient of the sum of the predictions:

$\frac{\partial prediction}{\partial weight} = \frac{\partial \sum(predictions)}{\partial weight}=
\frac{\partial \sum(inputs*weights+bias)}{\partial weight} = \sum(inputs)$  


Also you can check the $\frac{\partial f}{ \partial a}$

In [24]:
model.zero_grad()
predictions.backward(torch.ones_like(predictions))
print(model.weight.grad)
print(inputs.grad)

tensor([[10.]])
tensor([[0.8693],
        [0.8693],
        [0.8693],
        [0.8693]])


But we do not want the weight to move towards the driection that make the prediction smaller, instead we want to minimize the loss function. So now we can calculate the loss. The expression of loss function is:  

$L(y_{pred}, y_{target}) = (1/N) * \sum (y_{pred_i} - y_{target_i})^2$  

$\frac{\partial L}{\partial weight}$

In [29]:
model.zero_grad()
targets = torch.tensor([[2.0], [4.0], [6.0], [8.0]])
loss = loss_function(predictions, targets)
print(loss)
loss.backward()
print(model.weight.grad)
print(inputs.grad)

tensor(10.8365, grad_fn=<MseLossBackward0>)
tensor([[-18.0241]])
tensor([[-0.5839],
        [-1.0753],
        [-1.5668],
        [-2.0582]])


OK. So now I already have the gradient of the weight. This is the direction of increasing the loss, thus we need to use gradient descent to go down.
The SGD expression is $weight = weight - lr * gradient$  
So that it is
$0.8693 - 0.01 * -18.0241 = 1.0495$

In [30]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
optimizer.step()
print(model.weight)
print(model.weight.grad)
print(inputs.grad)

Parameter containing:
tensor([[1.0495]], requires_grad=True)
tensor([[-18.0241]])
tensor([[-0.5839],
        [-1.0753],
        [-1.5668],
        [-2.0582]])


In [31]:
# Clear the gradients for the next iteration. Otherwise the gradients will be accumulated to existing gradients.
optimizer.zero_grad()
print(model.weight.grad)

None
