In [None]:
%matplotlib notebook

# Session 1: Forward mode differentiation

This course introduces the concept of *differentiable programming*, a.k.a. *automatic differentiation (AD)*, or *algorithmic differentiation*. We will use the acronym AD henceforth.

#### Notation

For a mathematical function $f:A\rightarrow\mathbb{R}$ with scalar input (i.e., a single value) from $A\subseteq\mathbb{R}$, we make use of both the Lagrange notation $f'(x)$ and Leibniz notation $\frac{\mathrm{d}f}{\mathrm{d}x}$.
We **do not** use the physics notation for derivatives, so if you ever see (e.g.) $\dot{x}$ then this is just a variable name, not the derivative of $x$.

Similarly, for $m\in\mathbb{N}$ dimensional vector-valued function $\mathbf{f}:A\rightarrow\mathbb{R}^m$ with scalar input, we have derivative notations $\mathbf{f}'(x)$ and $\frac{\mathrm{d}\mathbf{f}}{\mathrm{d}x}$.

For a function with vector input (i.e., multiple inputs), we use partial derivative notation. For example, if $f:\mathbb{R}^2\rightarrow\mathbb{R}$ is written as $f=f(x,y)$ then we have the partial derivatives $\frac{\partial f}{\partial x}$ and $\frac{\partial f}{\partial y}$ with respect to first and second components, respectively. We use
$$\nabla f=\left(\frac{\partial f}{\partial x_1},\dots,\frac{\partial f}{\partial x_m}\right)$$
to denote the vector of all such partial derivatives. Similarly for vector-valued functions with multiple inputs.

When it comes to derivatives of code, we use the `_d` notation, which is standard in the AD literature. Its meaning will be described in due course.

## History

* Origins of AD in 1950s.
* However, it found a wider audience in the 1980s, when it became more relevant thanks to advances in both computer power and modern programming languages.
* Forward mode (the subject of this session) was discovered by Wengert in 1964.
* Further developed by Griewank in the late 1980s.

## Idea

The idea of AD is to **treat a model as a sequence of elementary instructions** (e.g., addition, multiplication, exponentiation). Here a *model* could be a function or subroutine, code block, or a whole program. Elementary operations are well-understood and their derivatives are known. As such, the derivative of the whole model may be computed by composing the derivatives of each operation using the *chain rule*.

#### Recap on A-level maths: the Chain Rule

Consider two composable, differentiable (mathematical) functions, $f$ and $g$, with composition $h=f\circ g$. By definition, this means
$$h(x)=(f\circ g)(x)=g(f(x)).$$

Then the *chain rule* states that the derivative of $h$ may be computed in terms of the derivatives of $f$ and $g$ using the formula
$$h'(x)=(f\circ g)'(x)=(f'\circ g)(x)\,g'(x)=g(f'(x))\,g'(x).$$

## Example

Consider two functions acting on real numbers, $f(x,y)=xy$ and $g(z)=(\sin(z),\cos(z))$. Here $f:\mathbb{R}^2\rightarrow\mathbb{R}$ takes two inputs and returns a single output, while $g:\mathbb{R}\rightarrow\mathbb{R}^2$ takes a single input and returns two outputs.

*Exercise*: Convince yourself that these functions may be composed in either order.

Consider the composition $h=f\circ g:\mathbb{R}^2\rightarrow\mathbb{R}^2$. Then we have
$$h(x,y)=(f\circ g)(x,y)=g(f(x,y))=(\sin(xy),\cos(xy)).$$

For the derivative of each component,
$$
\frac{\partial f}{\partial x}=y,
\quad\frac{\partial f}{\partial y}=x,
\quad\frac{\partial g}{\partial z}=(\cos(z),-\sin(z)).
$$

#### Exercise

Use the chain rule to work out the derivatives of each of the outputs with respect to each of the inputs, i.e.,
$$
\frac{\partial h_1}{\partial x},
\quad\frac{\partial h_1}{\partial y},
\quad\frac{\partial h_2}{\partial x},
\quad\frac{\partial h_2}{\partial y},
$$
where $h(x,y)=(h_1(x,y),h_2(x,y))$.

<details>
$$
\frac{\partial h_1}{\partial x}=y\cos(z)=y\cos(xy),
\quad\frac{\partial h_1}{\partial y}=x\cos(z)=x\cos(xy),
\quad\frac{\partial h_2}{\partial x}=y\sin(z)=y\sin(xy),
\quad\frac{\partial h_2}{\partial y}=x\sin(z)=x\sin(xy),
$$  
</details>

## Directional derivative, a.k.a. Jacobian-vector product (JVP)

Consider a vector-valued function $\mathbf{f}$ mapping from a subspace $A\subseteq\mathbb{R}^n$ into $\mathbb{R}^m$, for some $m,n\in\mathbb{N}$:
$$\mathbf{f}:A\rightarrow\mathbb R^m.$$

Given input $\mathbf{x}\in U$ and a *seed vector* $\dot{\mathbf{x}}\in\mathbb{R}^n$, forward mode AD allows us to compute the *action* (matrix-vector product)
$$\dot{\mathbf{y}}:=\nabla\mathbf{f}(\mathbf{x})\,\dot{\mathbf{x}}.$$

Here $\nabla\mathbf{f}$ is referred to as the *Jacobian* for the map, so the above is known as a *Jacobian-vector product*.

*Note*: the computation is *matrix-free*. We don't actually need the Jacobian when we compute this product.

[TODO: DAG interpretation]

## Approach 1: Source transformation

High level idea: Given some (code) function `f(x)`, generate the code for function `f_d(x, x_d)`.

Often the difficult part is then hooking the differentiated code into the wider model/build system.

#### Source transformation tools:

* [Tapenade](https://tapenade.gitlabpages.inria.fr/userdoc/build/html/index.html) (C, Fortran, Julia*)
* [TAF](http://fastopt.com/products/taf) (Fortran) [commerical]
* [PSyAD](https://psyclone-adjoint.readthedocs.io/en/stable) (domain-specific)

*\*Work in progress*

## Exercise

Below we have the Fortran code for the example functions above, written as subroutines. You can find this in the repository at `session1/exercise1/subroutine1.f90` and `session1/exercise1/subroutine2.f90`, respectively.

```fortran
subroutine f(x, y, z)
  implicit none
  real, intent(in)  :: x, y
  real, intent(out) :: z
  z = x * y
end subroutine f
```

```fortran
subroutine g(z, v)
  implicit none
  real, intent(in)  :: z
  real, intent(out), dimension(2) :: v
  v = [sin(z), cos(z)]
end subroutine g
```
1. Either [install Tapenade](https://tapenade.gitlabpages.inria.fr/tapenade/docs/html/distrib/README-install.html)* or visit the [Tapenade web interface](http://tapenade.inria.fr:8080/tapenade/index.jsp).
2. Apply Tapenade to each of these subroutines using its default setting, which will apply forward mode to compute the JVP for some seed vector.

*Note that you will need to install Java if you don't already have it install.

<details>

Running
```sh
cd session1/exercise1
tapenade subroutine1.f90
```
gives
```fortran
!        Generated by TAPENADE     (INRIA, Ecuador team)
!  Tapenade 3.16 (develop) - 23 Apr 2025 13:39
!
!  Differentiation of f in forward (tangent) mode:
!   variations   of useful results: z
!   with respect to varying inputs: x y
!   RW status of diff variables: x:in y:in z:out
SUBROUTINE F_D(x, xd, y, yd, z, zd)
  IMPLICIT NONE
  REAL, INTENT(IN) :: x, y
  REAL, INTENT(IN) :: xd, yd
  REAL, INTENT(OUT) :: z
  REAL, INTENT(OUT) :: zd
  zd = y*xd + x*yd
  z = x*y
END SUBROUTINE F_D
```

Running
```sh
cd session1/exercise1
tapenade subroutine2.f90
```
gives
```fortran
!        Generated by TAPENADE     (INRIA, Ecuador team)
!  Tapenade 3.16 (develop) - 23 Apr 2025 13:39
!
!  Differentiation of g in forward (tangent) mode:
!   variations   of useful results: v
!   with respect to varying inputs: z
!   RW status of diff variables: v:out z:in
SUBROUTINE G_D(z, zd, v, vd)
  IMPLICIT NONE
  REAL, INTENT(IN) :: z
  REAL, INTENT(IN) :: zd
  REAL, DIMENSION(2), INTENT(OUT) :: v
  REAL, DIMENSION(2), INTENT(OUT) :: vd
  INTRINSIC COS
  INTRINSIC SIN
  vd = (/COS(z)*zd, -(SIN(z)*zd)/)
  v = (/SIN(z), COS(z)/)
END SUBROUTINE G_D
```

</details>

## Approach 2: Operator overloading

[TODO]

#### Operator overloading tools:

* LLVM
    * [Enzyme](https://enzyme.mit.edu) <!-- is a plugin that performs automatic differentiation (AD) of statically analyzable LLVM. By operating on the LLVM level Enzyme is able to perform AD across a variety of languages (C/C++, Fortran , Julia, etc.) and perform optimization prior to AD -->
* C/C++
    * About 2 dozen AD tools!
    * e.g., [ADIC](https://www.mcs.anl.gov/research/projects/adic), [ADOL-C](https://github.com/coin-or/ADOL-C), [Torch Autograd](https://pytorch.org/tutorials/advanced/cpp_autograd.html), [CoDiPack](https://github.com/SciCompKL/CoDiPack), [Sacado](https://docs.trilinos.org/dev/packages/sacado/doc/html/index.html), [dco/c++](https://nag.com/automatic-differentiation) [commercial]
* Fortran
    * [Differentia](https://github.com/Nicholaswogan/Differentia), [lots of abandonware...]
* Python
    * [PyADOL-C](https://github.com/b45ch1/pyadolc), [Jax](https://github.com/jax-ml/jax), [PyTorch Autograd](https://pytorch.org/docs/stable/autograd.html)
* Julia
    * About 2 dozen AD tools! https://juliadiff.org/
    * e.g., Enzyme, [Zygote](https://fluxml.ai/Zygote.jl/stable), [ForwardDiff](https://juliadiff.org/ForwardDiff.jl/stable)
    * [DifferentiationInterface](https://www.juliapackages.com/p/differentiationinterface)
* Domain-specific
    * [dolfin-adjoint/pyadjoint](https://github.com/dolfin-adjoint/pyadjoint) (Python/UFL - Firedrake & FEniCS)
* And many more! https://autodiff.org/?module=Tools

## Exercise: ODE-constrained optimisation

[TODO: Single scalar input -> suitable for forward mode. Perhaps optimisation of theta-method timestepping scheme.]

## Validation: the Taylor test

[TODO]

## Question

Given a map $F$, some input $x$, and some seed $\dot{x}$, we have the Jacobian vector product
$$\dot{y}:=\nabla F(x)\dot{x}.$$

*How can we use this to compute the full Jacobian matrix $\nabla F(x)$?*

<details>

$$\nabla F(x)=\nabla F(x)I_n=\nabla F(x)\begin{bmatrix}e_1,e_2,\dots,e_n\end{bmatrix}.$$

Apply JVP to the $n$ canonical unit vectors.

</details>


[TODO: Exercise in notebook]

## Sparse AD

We can compute the full Jacobian with
$$\nabla F(x)=\nabla F(x)I_n=\nabla F(x)\begin{bmatrix}e_1,e_2,\dots,e_n\end{bmatrix}.$$

* But what about when $n$ gets very large?
* And what about when the Jacobian is sparse?

#### What colour is your Jacobian?

* [TODO: orthogonal columns simple demo]
* [TODO: Colouring, nice image]
* [TODO: exercise?]

## Levels of abstraction

* Low-level: elementary operators, e.g., Tapenade, ADIC, ADOL-C.
* Medium-level: API calls, e.g., AD in PETSc.
* High-level: high-level maths, e.g., Pyadjoint/dolfin-adjoint.

## References

* R. E. Wengert. *A simple automatic derivative evaluation program*. Communications
of the ACM, 7(8):463–464, 1964.
* A. Griewank. *Achieving logarithmic growth of temporal and spatial complexity in
reverse automatic differentiation.* Optimization Methods & Software, 1:35–54, 1992.
* D. Cortild, et al. *A Brief Review of Automatic Differentiation.* (2023).