# Pytorch Autogradient

Reference:

[Autograd Api Doc](https://pytorch.org/docs/stable/autograd.html)

[Back Propagation](https://medium.com/ai-academy-taiwan/bacn-propagation-3946e8ed8c55)

## Gradient and partial differential

During neural networn tarining, gradient decent is a common way of determining how the weights within a neural networn should be adjusted. By calculating the parital derivatives of the loss function to a given weight, we can see the direction that a given weight should be adjusted. When the partial derivative reaches 0, the given weight is possibly at its optimal value.

Consider the following simple neural networn with 2 input features and 2 layers:

<img src="img/SimpleNeuralNetwork.jpg" width="900" height="400">


To find the inpact of $w^{1}_{11}$ on the loss function, calculate the partial derivative 
$\frac{\partial L}{\partial w^{1}_{11}}$

\begin{align}
\hat{y}:ground truth\\
L = \frac{1}{2n} \sum_{k=1}^n (y - \hat{y})^2, n=batch size\\
\text{Assume batch size n = 1, then}L = \frac{1}{2}(y - \hat{y})^2 \\
\end{align}

\begin{align}
\frac{\partial L}{\partial w^{1}_{11}} &= \frac{\partial L}{\partial z^3_1} \frac{\partial z^3_1}{\partial w^{1}_{11}}\\
&= \frac{\partial L}{\partial z^3_1} (\frac{\partial z^3_1}{\partial a^2_1} \frac{\partial a^2_1}{\partial z^2_1} \frac{\partial z^2_1}{\partial a^1_1} \frac{\partial a^1_1}{\partial z^1_1} \frac{\partial z^1_1}{\partial w^1_{11}} + \frac{\partial z^3_1}{\partial a^2_2} \frac{\partial a^2_2}{\partial z^2_2} \frac{\partial z^2_2}{\partial a^1_1} \frac{\partial a^1_1}{\partial z^1_1} \frac{\partial z^1_1}{\partial w^1_{11}}) \\
&= \frac{\partial L}{\partial z^3_1} (\frac{\partial z^3_1}{\partial a^2_1} \frac{\partial a^2_1}{\partial z^2_1} \frac{\partial z^2_1}{\partial a^1_1} + \frac{\partial z^3_1}{\partial a^2_2} \frac{\partial a^2_2}{\partial z^2_2} \frac{\partial z^2_2}{\partial a^1_1})*( \frac{\partial a^1_1}{\partial z^1_1} \frac{\partial z^1_1}{\partial w^1_{11}}) \\
&= \frac{\partial L}{\partial z^3_1} (\sum_{n=1}^2\frac{\partial z^3_1}{\partial a^n_1} \frac{\partial a^n_1}{\partial z^n_1} \frac{\partial z^n_1}{\partial a^1_1}) * ( \frac{\partial a^1_1}{\partial z^1_1} \frac{\partial z^1_1}{\partial w^1_{11}}) \text{ since } y = z^3_1 \text{ then } \partial z^3_1 = \partial y\\

&= \frac{\partial L}{\partial y} (\sum_{n=1}^2\frac{\partial z^3_1}{\partial a^n_1} \frac{\partial a^n_1}{\partial z^n_1} \frac{\partial z^n_1}{\partial a^1_1}) * ( \frac{\partial a^1_1}{\partial z^1_1} \frac{\partial z^1_1}{\partial w^1_{11}})\\
\end{align}

\begin{align}
\text{subsitude}\\
&\frac{\partial L}{\partial y} = \frac{\partial}{\partial y}[\frac{1}{2}(y-\hat{y})^2] = y - \hat{y}\\
&\frac{\partial z^3_1}{\partial a^2_1}=\frac{\partial}{\partial a^2_1}{(w^3_{11} a^2_1 + w^3_{12} a^2_2)} = w^3_{11}\\
&\frac{\partial a^n_1}{\partial z^n_1} = \frac{\partial f(z^n_1)}{\partial z^n_1}
\end{align}

\begin{align}
\frac{\partial L}{\partial w^{1}_{11}} = (y - \hat{y}) \sum_{n=1}^2(w^3_{11}\frac{\partial f(z^n_1)}{\partial z^n_1} w^2_{11}) \frac{\partial f(z^1_1)}{\partial z^1_1}\frac{\partial z^1_1}{\partial w^1_{11}}
\end{align}

## Purpose of autograd

It will take large amount of resourses to solve all the partial derivatives with brute force. However, these remaining partial derivatives are the partial derivative of the activation functions with respect to its input, which and usually have analytical solutions and can be easily calculated.

Pytorch tensors can record the functions it has been passed into from the input layer $x$ all the way to the output layer $y$. After a tensor $x$ has been processed by multiple layers of neurons and weighted, generating an output tensor $y$, tensor $y$ will contain detailed information about all the process it has been through in reverse order.

In [None]:
import torch

x1 = torch.arange(0, 5, 1, dtype = torch.float32, requires_grad = True)
print("Initial x = ", x1)
x1 = x1 + 1
print("x + 1 = ", x1)
x1 *= 2
print("2*(x + 1)", x1)
x1 **= 2
print("pow(2(x+1), 2)", x1)
x1 /= 2
print("pow(2(x+1)/2, 2)", x1)

## Actual usage of autograd

Suppose we are training this model:

In [None]:
import torch

torch.manual_seed(0)
BATCH_SIZE = 2
DIM_IN = 7
HIDDEN_SIZE = 4
DIM_OUT = 2

class TinyModel(torch.nn.Module):

    def __init__(self):
        super(TinyModel, self).__init__()
        self.layer1 = torch.nn.Linear(DIM_IN, HIDDEN_SIZE)
        self.relu = torch.nn.ReLU()
        self.layer2 = torch.nn.Linear(HIDDEN_SIZE, DIM_OUT)
        # the grad_fn shows Addmmobject, which is what linear layer does
        # Addmmobject ref: https://pytorch.org/docs/stable/generated/torch.addmm.html
    
    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.layer2(x)
        return x
    
some_input = torch.randn(BATCH_SIZE, DIM_IN, requires_grad=False)     # random tensor as an input to the model
ideal_output = torch.randn(BATCH_SIZE, DIM_OUT, requires_grad=False)  # random ground truth 
loss_fn = torch.nn.L1Loss()

model = TinyModel()
optimizer = torch.optim.Adam(model.parameters())

In [None]:

predicted_output = model(some_input)
loss = loss_fn(predicted_output, ideal_output)


At this moment, the model has predicted what ```some_input``` by its absurd and under-trained parameters. Also, the loss function has calculated how far the model output deviates from ```ideal_output```. But the gradients of the model have not been calculated yet. What that means is that the model have no idea how to adjust its weights to reduce loss.


In [None]:
print(f"Current weights of model's layer1: (in {DIM_IN} out {HIDDEN_SIZE})")
model.layer1.weight

In [None]:
print("Current gradient of model's layer1:")
print(model.layer1.weight.grad)


So let's calculate its weights using the ```tensor.backwards()``` method.

In [None]:
loss.backward()
print("Gradients of model's layer1 after loss.backward() method call:")
print(model.layer1.weight.grad)

With gradients known, we can call the ```optimizer.step``` method to adjust the weights in the network.

In [None]:
optimizer.step()
print(model.layer1.weight)
optimizer.zero_grad()

## Conclusion
Although it is needed to calculate lots of partial differentiations during model training, the autograd feature provided by pytorch makes the process painless and simple. By logging all the mathematical operations a tensor has gone through from the input layer of the model all the way to its output layer and the loss function, the complicated partial derivatives can be found analytically.

Calling the ```backward``` method on the result of the loss function tensor, the gradients of the model can be found and the optimizer can adjust the model weights accordingly.