# PyTorch's Autograd Functionality

If we have large neural networks, it would be complicated to implement the gradient descent algorithm by hand. In this notebook we will show you that PyTorch can compute the derivatives of any function you run a vector through.

Let's start with a simple example. We have a one-dimensional vector, also called a scalar, and a function $f_1(x) = x^2$. The derivative of $f_1$ with respect to $x$ is: $\frac{\partial f_1}{\partial x} = 2x$.

So for our value $x = 5$ the gradient produced by $x$ when it is run through $f_1$ would be $2 \cdot 5 = 10$.

Will now show you how you can compute the gradient with PyTorch automatically.

In [1]:
import torch

# Define the vector as a PyTorch tensor. We also need to set the attribute 'requires_grad' to True
# so that PyTorch computes keeps track of the operations applied on this tensor.
x = torch.tensor([5.0], requires_grad=True)

# Define the function f1(x) = x^2
def f1(x):
    return x**2

# Apply the function to the tensor
y = f1(x)

# Compute the gradients
y.backward()

# Access the gradients
gradients = x.grad

# Now print the gradient to the console. It's value should be 4 as we have calculated it before.
print(gradients)

tensor([10.])


This seems like magic. How does PyTorch know that a tensor was run through a function $f(x) = x^2$. This works because in Python as well as in C/C++, which is the language the core of PyTorch is written in (because of performance optimization), it is possible to overwrite standard operators like plus, minus, and so on. Every time an operator or a function is applied to a tensor PyTorch keep track of this operation. If we then want to compute the gradient of an operation PyTorch simply looks up the path of operations in this graph and uses the chain rule to compute the gradients.

With the chain rule it is also possible to compute gradients for multiple chained functions. For example if we define a second function $f_2(x) = ln(x)$, run $x$ first through $f(x)$ and then run the results through $f_2(x)$.

First we will run $x$ through $f(x)$ and name the result $h$ ("hidden").

$$ h = f(x) = x^2 $$

The derivative of $f$ with respect to $x$ is $\frac{\partial f}{\partial x} = 2x$ as we already know.

Next we run $h$ through $f_2$.

$$ f_2(h) = ln(h) $$

The derivative of $f_2$ with respect to $h$ is $\frac{\partial f_2}{\partial h} = \frac{1}{h} = \frac{1}{x^2}$

Now we build a chain of functions defined as $f(x) = f_2(f_1(x))$ which means we first run $x$ through $f_1$ and then run the results through $f_2$. How can we compute the derivative of $f$ with respect to $x$? We can simply do this by multiplying the derivative of $f_2$ with respect to its own input $h$ and the derivative of $f_1$ with respect to its input $x$. This looks like this:

$$ \frac{\partial f}{\partial x} = \frac{\partial f_2}{\partial h} \cdot \frac{\partial h}{\partial x} 
= \frac{1}{x^2} \cdot 2x = \frac{2}{x}
$$

As $h$ is the same as $f_1(x)$ we could also write it like this:

$$ \frac{\partial f}{\partial x} = \frac{\partial f_2}{\partial df_1} \cdot \frac{\partial f_1}{\partial x} $$

So for $x = 5$ we should get the following results:

$$ \frac{\partial f_2}{\partial h} = \frac{1}{5^2} = 0.04 $$

$$ \frac{\partial h}{\partial x} = 2 \cdot 5 = 10 $$

$$ \frac{\partial f}{\partial x} = \frac{\partial f_2}{\partial h} \cdot \frac{\partial h}{\partial x}
= \frac{2}{x} = \frac{2}{5} = 0.4 * 10 = 4
$$

Now let's implement this with PyTorch.

In [2]:
# This is our input data
x = torch.tensor([5.0], requires_grad=True)

# Define the function f2(x) = ln(x)
def f2(x):
    return torch.log(x)

# Apply the functions to the tensor sequentially
h = f1(x)
z = f2(h)

print("h", h)
print("z", z)

# Compute the gradients of z with respect to h (df2 / df1)
z.backward(torch.ones_like(z), retain_graph=True)

# Access the gradients with respect to f2 (df2 / df1)
gradient_z = x.grad.item()
print("Gradient of f2 with respect to f1", gradient_z)

# Reset the gradients
# Every time we want to compute the gradient with respect to another parameter
# we need to delete the old gradients first
x.grad.zero_()

# Compute the gradient of h with respect to x (df1 / dx)
h.backward(torch.ones_like(h), retain_graph=True)

# Access the gradients with respect to f1 (df1 / dx)
gradient_h = x.grad.item()
print("Gradient of f1 with respect to x:", gradient_h)

# Use the chain rule to compute the gradient of the chained function
gradient_f = gradient_z * gradient_h
print("Gradient of f with respect to x:", gradient_f)

h tensor([25.], grad_fn=<PowBackward0>)
z tensor([3.2189], grad_fn=<LogBackward0>)
Gradient of f2 with respect to f1 0.3999999761581421
Gradient of f1 with respect to x: 10.0
Gradient of f with respect to x: 3.999999761581421


If we print both $h$ and $z$ to the console we will see the following ouputs:

```
h tensor([25.], grad_fn=<PowBackward0>)
z tensor([3.2189], grad_fn=<LogBackward0>
```

The first value is the value of the tensor when run through the functions f1 and f2. The second value is the derivative function of the operation we ran the tensor through. The first operation was to compute x to the power of two, the corresponding gradient function is named `PowBackward0`. The next operation was to get the logarithm of the result, the corresponding gradient function of the logarithm is named `LogBackward0`.

## More about Autograd

If you want to read more about PyTorch's autograd functionalityIa highly recommend the following article from the pyTorch documentation:
https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

Another very useful site for computing derivatives is WolframAlpha. Check out what you will get when you enter the following term into the search field: `df/dx ln(x^2)`
https://www.wolframalpha.com)

## Using Autograd to optimize parameters

Now that we have seen how autograd works we want to use it for training a simple linear function to fit out data. First we define a linear function and an MSE loss function that compares the predicted y-values to the true y-values.

In [3]:
def f_linear(x: torch.tensor, params: torch.tensor):
    """
    This is a linear function of form y=ax
    :param x: Input of the model
    :param params: Model parameters
    """
    return params[0] * x + params[1]

def mse_loss(y: torch.tensor, y_hat: torch.tensor):
    """
    This is a MSE loss function of form L=(y-y_hat)^2=(y-ax)^2
    :param y: True y-values
    :param y_hat: Predicted y-values
    """
    return (y - y_hat)**2

Next we define data points, the true model parameters and the parameters of the model that are different from the true parameters. We run the x-values through our true function and through our model and then compute the loss using the MSE loss function. Because we want to optimize the parameters later we need to keep track of their gradients. This is why we set the attribute `requires_grad` to `True` for them. X and y values do not change during training, so they are regarded as constants and do need to be optimized, this is why there is no need to compute gradients for them.

In [9]:
# The input data
x = torch.tensor([1.0, 2.0, 3.0])

# The model's params need to optimized during training, so gradients should be computed
model_params = torch.tensor([1.0, 1.0], requires_grad=True)

# The true parameters are fixed, so we do not need to compute gradients for them
true_params = torch.tensor([0.5, 0.5])

# Get true y-values and the predicted y-values of our model
y = f_linear(x, true_params)
y_hat = f_linear(x, model_params)

print("True y-values:", y)
print("Predicted y-values:", y_hat)

# Compue the loss between the predicted y-values and the true y-values
loss = mse_loss(y, y_hat).mean()
print("MSE Loss:", loss)

True y-values: tensor([1.0000, 1.5000, 2.0000])
Predicted y-values: tensor([2., 3., 4.], grad_fn=<AddBackward0>)
MSE Loss: tensor(2.4167, grad_fn=<MeanBackward0>)


Now we need to compute the gradients by calling the `backward` function on our loss value. Then we can access the gradients of the model's parameters.

In [5]:
# Compute the gradient of the loss function with respect to x
loss.backward()
model_params_grad = model_params.grad
print("Gradients for model parameters:", model_params_grad)

Gradients for model parameters: tensor([6.6667, 3.0000])


We can now use the computed gradients to perform the gradient descent step.

In [6]:
# Define a learning rate and perform the gradient descent step
alpha = 0.05
model_params = model_params - alpha * model_params_grad

# Print the updated parameters
print("Updated model parameters:", model_params)

Updated model parameters: tensor([0.6667, 0.8500], grad_fn=<SubBackward0>)


The updated model parameters are now much closer to the real parameter values. If we repeat this step multiple times we should end up with parameters that are really close to the true parameter values.

## Training loop with PyTorch's Autograd Module

In this section we build a trainin loop that performs gradient descent multiple times. We repeat the following steps for a number of epochs:

- Run data through the model
- Compute the loss
- Compute the gradients of our parameters
- Perform gradient descent step

In [7]:
# These are the initial model parameters
model_params = torch.tensor([1.0, 1.0])

# This is our input data
x = torch.tensor([1.0, 2.0, 3.0])

# These are the true y-values
y = torch.tensor([1.0, 1.5, 2.0])

# This is the learning rate for our training loop
alpha = 0.05

for epoch in range(50):
    # First we need to create new instances of our data and model parameters because otherwise
    # PyTorch get's confused when computing gradients.
    x = x.clone()
    model_params = model_params.clone().detach().requires_grad_(True)

    # Run x through the model
    y_hat = f_linear(x, model_params)

    # Compute loss
    loss = mse_loss(y, y_hat).mean()

    # Print some information about the learning progress
    print(f"Epoch {epoch}: loss={loss.item()}, param_a={model_params[0]}, param_b={model_params[1]}")

    # Compute gradients (backpropagation step)
    loss.backward()

    # Perform gradient descent
    model_params = model_params - alpha * model_params.grad

print("\nModel parameters after training:", model_params)

Epoch 0: loss=2.4166667461395264, param_a=1.0, param_b=1.0
Epoch 1: loss=0.4854629337787628, param_a=0.6666666269302368, param_b=0.8500000238418579
Epoch 2: loss=0.10228254646062851, param_a=0.5188888311386108, param_b=0.7816666960716248
Epoch 3: loss=0.026139602065086365, param_a=0.4537407159805298, param_b=0.7497222423553467
Epoch 4: loss=0.010897613130509853, param_a=0.42538395524024963, param_b=0.734001874923706
Epoch 5: loss=0.007738022133708, param_a=0.4134044051170349, param_b=0.72552490234375
Epoch 6: loss=0.006978096440434456, param_a=0.4087107181549072, param_b=0.7202915549278259
Epoch 7: loss=0.0066972956992685795, param_a=0.40725407004356384, param_b=0.7165202498435974
Epoch 8: loss=0.006514647975564003, param_a=0.4072314500808716, param_b=0.7134174108505249
Epoch 9: loss=0.006354494486004114, param_a=0.40783995389938354, param_b=0.710629403591156
Epoch 10: loss=0.006201764103025198, param_a=0.4087221026420593, param_b=0.7079984545707703
Epoch 11: loss=0.006053395569324493,

Before the model was trained the initial values for the parameters were a=1 and b=1. After the training the parameters have become much closer to the true parameter values a=0.5 and b=0.5.

The big advantage of PyTorch's autograd module is that we do not implement the definitions of the derivatives ourself. It is enough to define the model and the loss function, the rest is done by PyTorch automatically.