In [None]:
%matplotlib inline

In [None]:
import torch

# Session 2: Reverse mode differentiation and operator overloading

## Preparations

#### Terminology

**Recall from session 1**: This course introduces the concept of *differentiable programming*, a.k.a. *automatic differentiation (AD)*, or *algorithmic differentiation*. We will use the acronym AD henceforth.

#### Notation

**Recall from session 1**: For a differentiable *mathematical* function $f:A\rightarrow\mathbb{R}$ with scalar input (i.e., a single value) from $A\subseteq\mathbb{R}$, we make use of both the Lagrange notation $f'(x)$ and Leibniz notation $\frac{\mathrm{d}f}{\mathrm{d}x}$ for its derivative.
We **do not** use the physics notation for derivatives, so if you ever see (e.g.) $\dot{x}$ then this is just a variable name, not the derivative of $x$.

**Recall from session 1**: When it comes to *forward* derivatives in code, we use the `_d` notation, which is standard in the AD literature.

When it comes to *reverse* mode derivatives in code, we use the `_b` notation (for "backward").

## Recall: directional derivative, a.k.a. Jacobian-vector product (JVP)

> Consider a vector-valued function $\mathbf{f}$ mapping from a subspace $A\subseteq\mathbb{R}^n$ into $\mathbb{R}^m$, for some $m,n\in\mathbb{N}$:
> $$\mathbf{f}:A\rightarrow\mathbb R^m.$$
>
> Given input $\mathbf{x}\in A$ and a *seed vector* $\dot{\mathbf{x}}\in\mathbb{R}^n$, forward mode AD allows us to compute the *action* (matrix-vector product)
$$\nabla\mathbf{f}(\mathbf{x})\,\dot{\mathbf{x}}.$$
Again, think of the seed vector as being an input from outside of the part of the program being differentiated.
>
> Here $\nabla\mathbf{f}$ is referred to as the *Jacobian* for the map, so the above is known as a *Jacobian-vector product (JVP)*.

<div class="alert alert-block alert-warning">
<b>Note</b>
The computation is <em>matrix-free</em>. We don't actually need the Jacobian when we compute this product.
</div>

## Jacobian-transpose-vector product (JTVP)

Consider a vector-valued function $\mathbf{f}$ mapping from a subspace $A\subseteq\mathbb{R}^n$ into $\mathbb{R}^m$, for some $m,n\in\mathbb{N}$:
$$\mathbf{f}:A\rightarrow\mathbb{R}^m.$$

Given $\mathbf{x}\in A$ and a seed vector $\dot{\mathbf{y}}\in\mathbb{R}^m$, reverse mode AD allows us to compute the *transpose action* (transposed matrix-vector product)
$$\nabla\mathbf{f}(\mathbf{x})^T\dot{\mathbf{y}}.$$

<div class="alert alert-block alert-info">
<b>Optional exercise</b>

Convince yourself that the JTVP is well defined.

<b>Solution</b>

<details>
We have $\nabla\mathbf{f}(\mathbf{x})\in\mathbb{R}^{m\times n}$, so $\nabla\mathbf{f}(\mathbf{x})^T\in\mathbb{R}^{n\times m}$. Since $\dot{\mathbf{y}}\in\mathbb{R}^m, the dimensions are appropriate to take the JTVP.
</details>
</div>

<div class="alert alert-block alert-warning">
<b>Note</b>
Again, the computation is <em>matrix-free</em>. We don't actually need the Jacobian or its transpose when we compute this product.
</div>

## Reverse mode: differences with forward mode

* Forward mode is more appropriate if $\#inputs\ll\#outputs$.
* Reverse mode is more appropriate if $\#inputs\gg\#outputs$.
    * e.g., ODE/PDE-constrained optimisation (cost fn), ML training (loss fn), goal-oriented (qoi)
* Forward mode is computed *eagerly*, whereas reverse mode is done separately from the primal run.
* Reverse mode tends to have higher memory requirements.

[TODO: Simple examples]

## Validation: the dot product test

<div class="alert alert-block alert-danger">
<b>TODO: define test; check consistency of Tapenade output</b>
</div>

## Approach 2: Operator overloading

<div class="alert alert-block alert-danger">
<b>TODO: gradient values, tape, overloaded operators, tape unrolling</b>
</div>

#### Operator overloading tools:

* LLVM
    * [Enzyme](https://enzyme.mit.edu) <!-- is a plugin that performs automatic differentiation (AD) of statically analyzable LLVM. By operating on the LLVM level Enzyme is able to perform AD across a variety of languages (C/C++, Fortran , Julia, etc.) and perform optimization prior to AD -->
* C/C++
    * About 2 dozen AD tools!
    * e.g., [ADIC](https://www.mcs.anl.gov/research/projects/adic), [ADOL-C](https://github.com/coin-or/ADOL-C), [Torch Autograd](https://pytorch.org/tutorials/advanced/cpp_autograd.html), [CoDiPack](https://github.com/SciCompKL/CoDiPack), [Sacado](https://docs.trilinos.org/dev/packages/sacado/doc/html/index.html), [dco/c++](https://nag.com/automatic-differentiation) [commercial]
* Fortran
    * [Differentia](https://github.com/Nicholaswogan/Differentia), [lots of abandonware...]
* Python
    * [PyADOL-C](https://github.com/b45ch1/pyadolc), [Jax](https://github.com/jax-ml/jax), [PyTorch Autograd](https://pytorch.org/docs/stable/autograd.html)
* Julia
    * About 2 dozen AD tools! https://juliadiff.org/
    * e.g., Enzyme, [Zygote](https://fluxml.ai/Zygote.jl/stable), [ForwardDiff](https://juliadiff.org/ForwardDiff.jl/stable)
    * [DifferentiationInterface](https://www.juliapackages.com/p/differentiationinterface)
* Domain-specific
    * [dolfin-adjoint/pyadjoint](https://github.com/dolfin-adjoint/pyadjoint) (Python/UFL - Firedrake & FEniCS)
* And many more! https://autodiff.org/?module=Tools

## PyTorch reverse mode demo

Builds upon the PyTorch Autograd tutorial: https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html.

Consider two tensors comprised of vectors of length 2. We pass `requires_grad=True` to the constructor to mark these tensors as active for differentiation.

In [None]:
a = torch.tensor([2.0, 3.0], requires_grad=True)
b = torch.tensor([6.0, 4.0], requires_grad=True)

Construct another tensor as a mathematical combination of the first two tensors.

In [None]:
Q = 3 * (a ** 3 - b * b / 3)
print(f"Q = {Q}")

<div class="alert alert-block alert-danger">
<b>TODO: Remark on `grad_fn`</b>
</div>

Next, we compute derivatives of `Q` with its `backward` method. The backward method accepts an optional argument for an external gradient. This defaults to a tensor of ones of the appropriate shape, but can be customised to account for cases where gradient values are propagated from another model or subcomponent.

In [None]:
external_gradient = torch.ones((2,))
Q.backward(gradient=external_gradient)

We call the `backward` method on tensor `Q` to compute gradients with respect to any of its dependencies that were constructed with the `requires_grad=True` setting. As such, we may compute $\mathrm{d}Q/\mathrm{d}a$ and $\mathrm{d}Q/\mathrm{d}b$ as follows:

In [None]:
print(f"a = {a.grad}")
print(f"b = {b.grad}")

#### Exercise 1

Convince yourself that these gradient values are correct. That is, differentiate the expression
$$Q=3(a^3-b^2/3)$$
with respect to $a$, substitute the values $a=(2,3)$ and $b=(6,4)$, and check the values match those above. Then do the same thing differentiating with respect to $b$.

## Second order

<div class="alert alert-block alert-danger">
<b>TODO: how to calculate Hessian using forward then reverse </b>
</div>

## Speelpenning demo

<div class="alert alert-block alert-danger">
<b>TODO: example for computing Hessian - see e.g., https://github.com/coin-or/ADOL-C/blob/master/ADOL-C/examples/speelpenning.cpp</b>
</div>

##  Mathematical background on adjoint methods

<div class="alert alert-block alert-danger">
<b>TODO: derivation; continuous vs. discrete </b>
</div>

## Checkpointing

<div class="alert alert-block alert-danger">
<b>TODO: overview </b>
</div>

## Exercise: Sensitivity analysis

<div class="alert alert-block alert-danger">
<b>TODO </b>
</div>

## Exercise: Online training ML

<div class="alert alert-block alert-danger">
<b>TODO </b>
</div>

## Further application

<div class="alert alert-block alert-danger">
<b>TODO </b>
</div>

* Data assimilation
* Uncertainty quantification
* PDE-constrained optimisation
    * Demo in Firedrake
* Goal-oriented error estimation
    * Demo in Firedrake
* Perhaps showcase combined adaptation + optimisation
    * Demo in Firedrake

## References

* S. Linnainmaa. *Taylor expansion of the accumulated rounding error*. BIT,
16(2):146–160, 1976.
* B. Speelpenning. *Compiling fast partial derivatives of functions given by algorithms*.
University of Illinois, 1980.
* A. Griewank. *Achieving logarithmic growth of temporal and spatial complexity in
reverse automatic differentiation.* Optimization Methods & Software, 1:35–54, 1992.