In [1]:
%matplotlib notebook

# Session 2: Reverse mode differentiation

[TODO: intro text]

#### Remarks

* We will restrict attention to operator overloading
* We will restrict attention to PyTorch / Torch / FTorch

#### Notation

[TODO: Recall derivative notation]

[TODO: Recall `_d` notation]

[TODO: Mention `_b` notation]

## History

* Reverse mode (the subject of this session) was discovered by Linnainmaa in the 1970s.
* Speelpenning introduced the modern formulation of reverse mode in the late 1980s.
* Griewank improved the feasibility of reverse mode in 1992 by introducing checkpointing.

## Idea

[TODO: Jacobian Transpose vector product (Again, matrix-free)]

[TODO: "Back-propagation"]

## PyTorch reverse mode demo

Builds upon the PyTorch Autograd tutorial: https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html.

In [2]:
import torch

Consider two tensors comprised of vectors of length 2. We pass `requires_grad=True` to the constructor to mark these tensors as active for differentiation.

In [3]:
a = torch.tensor([2.0, 3.0], requires_grad=True)
b = torch.tensor([6.0, 4.0], requires_grad=True)

Construct another tensor as a mathematical combination of the first two tensors.

In [4]:
Q = 3 * (a ** 3 - b * b / 3)
print(f"Q = {Q}")

Q = tensor([-12.,  65.], grad_fn=<MulBackward0>)


[TODO: Remark on `grad_fn`]

Next, we compute derivatives of `Q` with its `backward` method. The backward method accepts an optional argument for an external gradient. This defaults to a tensor of ones of the appropriate shape, but can be customised to account for cases where gradient values are propagated from another model or subcomponent.

In [5]:
external_gradient = torch.ones((2,))
Q.backward(gradient=external_gradient)

We call the `backward` method on tensor `Q` to compute gradients with respect to any of its dependencies that were constructed with the `requires_grad=True` setting. As such, we may compute $\mathrm{d}Q/\mathrm{d}a$ and $\mathrm{d}Q/\mathrm{d}b$ as follows:

In [6]:
print(f"a = {a.grad}")
print(f"b = {b.grad}")

a = tensor([36., 81.])
b = tensor([-12.,  -8.])


#### Exercise 1

Convince yourself that these gradient values are correct. That is, differentiate the expression
$$Q=3(a^3-b^2/3)$$
with respect to $a$, substitute the values $a=(2,3)$ and $b=(6,4)$, and check the values match those above. Then do the same thing differentiating with respect to $b$.

##  Mathematical background on adjoint methods

[TODO]

[TODO: Continuous vs. discrete adjoint]

## Differences with forward mode

* Forward mode is more appropriate if $\#inputs\ll\#outputs$.
* Reverse mode is more appropriate if $\#inputs\gg\#outputs$.
    * e.g., ODE/PDE-constrained optimisation (cost fn), ML training (loss fn), goal-oriented (qoi)
* Forward mode is computed *eagerly*, whereas reverse mode is done separately from the primal run.
* Reverse mode tends to have higher memory requirements.

[TODO: Simple examples]

## Validation: Dot product test

[TODO: ensure consistency of fwd and rev]

## Checkpointing

[TODO]

## Second order

[TODO: Hessian]

[TODO: Forward then reverse]

## Speelpenning demo

[TODO: Example for computing Hessian - see e.g., https://github.com/coin-or/ADOL-C/blob/master/ADOL-C/examples/speelpenning.cpp]

## Exercise: Sensitivity analysis

[TODO]

## Exercise: Online training ML

[TODO]

## Further application

[TODO]

* Data assimilation
* Uncertainty quantification
* PDE-constrained optimisation
    * Demo in Firedrake
* Goal-oriented error estimation
    * Demo in Firedrake
* Perhaps showcase combined adaptation + optimisation
    * Demo in Firedrake

## References

* S. Linnainmaa. *Taylor expansion of the accumulated rounding error*. BIT,
16(2):146–160, 1976.
* B. Speelpenning. *Compiling fast partial derivatives of functions given by algorithms*.
University of Illinois, 1980.
* A. Griewank. *Achieving logarithmic growth of temporal and spatial complexity in
reverse automatic differentiation.* Optimization Methods & Software, 1:35–54, 1992.