In [181]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [182]:
import torch
from torch import Tensor

## Differentiating tensors
In part 01 we looked at how tensors can be used to store data and perform vectorised operations.
In PyTorch it is possible to differentiate the results of such operations with respect to the input variables, if desired.

To do this, the dependent variable (`Tensor`) must be initialised with `requires_grad` set to `True`. Then, anytime the tensor is used in an operation the gradient of that function will be included in the result, eventually allowing it to be differentiated.

In [183]:
dep_var = torch.tensor([6.0], requires_grad=True)
dep_var

tensor([6.], requires_grad=True)

In [184]:
data = torch.randn(1)
data

tensor([-1.4449])

In [185]:
value = data*dep_var
value

tensor([-8.6692], grad_fn=<MulBackward0>)

Note that the value has a `grad_fn`, which means it can be differentiated.

`torch.autograd` contains various functions to help with this, e.g. `torch.autograd.grad`:

In [7]:
from torch.autograd import grad

In [8]:
grad(outputs=value, inputs=dep_var)

(tensor([0.7087]),)

### Differentiating functions
Above, we performed a function on a tensor and got the result, which we then differentiated with respect to the dependent input. It is also possible to differentiate functions directly for given inputs, without getting the results of the functions. I don't particularly like this.

Note that it also computes the gradient with respect to the data, even though the data wasn't set to require gradient.

These methods are mainly designed for *functional* programming, but you'd be better off looking into the [functorch ](https://pytorch.org/functorch/stable/) extension for PyTorch, or just using [JAX](https://jax.readthedocs.io/en/latest/notebooks/quickstart.html).

In [9]:
def func(data,dep_var):
    return data*dep_var

In [10]:
torch.autograd.functional.jacobian(func, (data,dep_var))

(tensor([[6.]]), tensor([[0.7087]]))

In [11]:
torch.autograd.functional.hessian(func, (data,dep_var))

((tensor([[0.]]), tensor([[1.]])), (tensor([[1.]]), tensor([[0.]])))

### Multiple variables and second derivative

In [118]:
dep_var_0 = torch.tensor([6.0], requires_grad=True)
dep_var_1 = torch.tensor([-2.0], requires_grad=True)

In [119]:
data = torch.randn(1)
data

tensor([-0.7005])

In [120]:
value = (dep_var_0**data)+dep_var_1.square()
value

tensor([4.2850], grad_fn=<AddBackward0>)

To differentiate w.r.t. multiple variables, include them as a tuple in the inputs. The `retain_graph` argument allows us to to recompute the gradient, if necessary, and the `create_graph` argument makes the output also have a `grad_fn`, is applicable.

In [121]:
jac = grad(outputs=value, inputs=(dep_var_0, dep_var_1), retain_graph=0, create_graph=True)
jac

(tensor([-0.0333], grad_fn=<WhereBackward0>),
 tensor([-4.], grad_fn=<MulBackward0>))

Since the gradient has a `grad_fn`, we can compute the second derivative, too. Note that we don't get the full Hessian matrix, though, only the diagonal.

In [122]:
grad(outputs=jac, inputs=(dep_var_0, dep_var_1), retain_graph=True, create_graph=True)

(tensor([0.0094], grad_fn=<WhereBackward0>),
 tensor([2.], grad_fn=<MulBackward0>))

#### Full Hessian

If you know that you'll be later computing the Hessian, or even just Jacobians, I find it best to have all the dependent variables in a single Tensor:

In [123]:
dep_vars = torch.tensor([6.0, -2.0], requires_grad=True)

In [124]:
value = (dep_vars[0]**data)+dep_vars[1].square()
value

tensor([4.2850], grad_fn=<AddBackward0>)

We compute the Jacobian as normal. This now returns the Jacobian in a single tensor, rather than a tuple of tensors.

In [125]:
jac = grad(outputs=value, inputs=dep_vars, retain_graph=True, create_graph=True)
jac

(tensor([-0.0333, -4.0000], grad_fn=<AddBackward0>),)

Now we try to compute the Hessian:

In [126]:
grad(outputs=jac, inputs=dep_vars, retain_graph=True, create_graph=True)

RuntimeError: grad can be implicitly created only for scalar outputs

Oh dear! The output needs to be a single value, and we provided two values. Instead we can supply each value in turn:

In [131]:
grad(outputs=jac[0][0], inputs=dep_vars, retain_graph=True, create_graph=True), grad(outputs=jac[0][1], inputs=dep_vars, retain_graph=True, create_graph=True)

((tensor([0.0094, 0.0000], grad_fn=<AddBackward0>),),
 (tensor([0., 2.], grad_fn=<AddBackward0>),))

So now we get the full Hessian matrix. (In this case the off-diagonals were zero, but they might not always be)

## Batched gradients and better Hessians
We already saw that the `grad` function has trouble dealing with non-scalar outputs, which meant we needed to call it twice and get a tuple. A more general way to do this, would be to iterate over each element of the Jacobian and stack the Hessian rows into a tensor:

In [155]:
torch.stack([grad(outputs=j, inputs=dep_vars, retain_graph=True, create_graph=True)[0] for j in jac[0].unbind(0)])  # unbind alows us to iterate through the tensor along the specified dimension

tensor([[0.0094, 0.0000],
        [0.0000, 2.0000]], grad_fn=<StackBackward0>)

In [154]:
%%timeit
torch.stack([grad(outputs=j, inputs=dep_vars, retain_graph=True, create_graph=True)[0] for j in jac[0].unbind(0)])  # unbind alows us to iterate through the tensor along the specified dimension

256 µs ± 8.49 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


A slightly quicker way is to still feed in the full Jacobian, but use the `grad_outputs` to *switch on* which element we want to differentiate. `torch.eye` creates an identity matrix of a given size, and iterating through it will provide one-hot vectors.

In [156]:
torch.eye(len(jac[0]))

tensor([[1., 0.],
        [0., 1.]])

In [158]:
torch.stack([grad(outputs=jac[0], inputs=dep_vars, grad_outputs=i, retain_graph=True, create_graph=True)[0] for i in torch.eye(len(jac[0])).unbind(0)])

tensor([[0.0094, 0.0000],
        [0.0000, 2.0000]], grad_fn=<StackBackward0>)

In [157]:
%%timeit 
torch.stack([grad(outputs=jac[0], inputs=dep_vars, grad_outputs=i, retain_graph=True, create_graph=True)[0] for i in torch.eye(len(jac[0])).unbind(0)])

235 µs ± 7.29 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### Vectorised method
The above method works, but still relies on a python for-loop to provide serial calls to `grad`. It would be better to instead perform all calls in parallel. We can do this using the PyTorch vmap function, although it is still experimental.

`vmap` takes a function and a set of input arguments and will implicitly compute the function values by unbinding the inputs, and will then stack the results to a tensor:

In [159]:
from torch._vmap_internals import _vmap as vmap

In [161]:
vmap(lambda i: grad(outputs=jac[0], inputs=dep_vars, grad_outputs=i, retain_graph=True, create_graph=True)[0])(torch.eye(len(jac[0])))

tensor([[0.0094, 0.0000],
        [0.0000, 2.0000]], grad_fn=<AddBackward0>)

In [160]:
%%timeit
vmap(lambda i: grad(outputs=jac[0], inputs=dep_vars, grad_outputs=i, retain_graph=True, create_graph=True)[0])(torch.eye(len(jac[0])))

213 µs ± 6.25 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### Batched gradients
The "non-scalar output" issue doesn't just apply to Hessians, what if we want to efficiently differentiate a batch of items independently?

In [168]:
dep_vars = torch.tensor([6.0, -2.0], requires_grad=True)

In [176]:
data = torch.rand(10,2)

In [177]:
values = (dep_vars[0]**data)+dep_vars[1].square()
values

tensor([[7.8651, 9.1787],
        [7.4021, 5.5802],
        [5.9790, 5.4469],
        [8.6321, 5.3500],
        [7.2844, 9.0974],
        [5.0533, 5.3217],
        [9.9660, 5.9545],
        [6.8146, 7.7673],
        [5.7039, 8.3140],
        [8.2068, 9.7840]], grad_fn=<AddBackward0>)

In [178]:
grad(outputs=values, inputs=dep_vars, retain_graph=True, create_graph=True)

RuntimeError: grad can be implicitly created only for scalar outputs

The trick is to reuse our vmap'd Jacobian, however even if we iterate over each row of the output values, it still isn't a scalar value; we need iterate over each element.

Rather than having the iteration adapt to every possible output shape, it is instead more convenient to write a generalised Jacobian function that works for any shape, by flattening and reshaping the inputs.

In [179]:
def jacobian(y: Tensor, x: Tensor, create_graph: bool = False, allow_unused: bool = True) -> Tensor:
    r"""
    Computes the Jacobian (dy/dx) of y with respect to variables x. x and y can have multiple elements.
    If y has multiple elements then computation is vectorised via vmap.

    Arguments:
        y: tensor to be differentiated
        x: dependent variables
        create_graph: If True, graph of the derivative will
            be constructed, allowing to compute higher order derivative products.
            Default: False.
        allow_unused: If False, specifying inputs that were not
            used when computing outputs (and therefore their grad is always

    Returns:
        dy/dx tensor of shape y.shape+x.shape
    """

    if len(y) == 0:
        return None
    flat_y = y.reshape(-1)

    def get_vjp(v: Tensor) -> Tensor:
        return torch.autograd.grad(flat_y, x, v, retain_graph=True, create_graph=create_graph, allow_unused=allow_unused)[0].reshape(x.shape)

    return vmap(get_vjp)(torch.eye(len(flat_y), device=y.device)).reshape(y.shape + x.shape)

In [180]:
jacobian(values, dep_vars)

tensor([[[ 0.4861, -4.0000],
         [ 0.7922, -4.0000]],

        [[ 0.3875, -4.0000],
         [ 0.0673, -4.0000]],

        [[ 0.1257, -4.0000],
         [ 0.0497, -4.0000]],

        [[ 0.6605, -4.0000],
         [ 0.0377, -4.0000]],

        [[ 0.3633, -4.0000],
         [ 0.7723, -4.0000]],

        [[ 0.0051, -4.0000],
         [ 0.0343, -4.0000]],

        [[ 0.9912, -4.0000],
         [ 0.1218, -4.0000]],

        [[ 0.2709, -4.0000],
         [ 0.4648, -4.0000]],

        [[ 0.0845, -4.0000],
         [ 0.5866, -4.0000]],

        [[ 0.5622, -4.0000],
         [ 0.9443, -4.0000]]])

This gives us the Jacobian of every element

## No-grad contexts
Calculations performed involving tensors with gradient incur an increased cost in terms of time and memory. In cases where gradient tracking isn't required, the context manager `no_grad` may be used:

In [168]:
dep_vars = torch.tensor([6.0, -2.0], requires_grad=True)

In [176]:
data = torch.rand(10,2)

In [187]:
%%timeit
(dep_vars[0]**data)+dep_vars[1].square()

14.3 µs ± 309 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [191]:
%%timeit
with torch.no_grad():
    (dep_vars[0]**data)+dep_vars[1].square()

10.2 µs ± 47.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


The `inference_mode` context is even more performant:

In [192]:
%%timeit
with torch.inference_mode():
    (dep_vars[0]**data)+dep_vars[1].square()

9.71 µs ± 59.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


## Modifying tensor with gradient
Once a tensor is set to have gradient, in-place modifications to it will result in an exception.