In order to use a gradient type optimization method as seen previously, 
we need to be able to calculate the **gradient** of the function.

We don't want to specify the gradient expression each time we define a new function:
- it is a hard work,
- it is heavily error prone,
- it is a repetition (the function definition already "contains" its gradient expression).

**Automatic differentiation** algorithms are able to recover from the implementation of a function, its gradient relatively to any of its input parameters at a given evaluation point.

E.g., given a $f: \mathbb{R}^N \to \mathbb{R}^M$ function and $x \in \mathbb{R}^N$, such kind of algorithm should be able to return $Df(x)$.

# Finite difference schemes

The finite difference schemes can be seen as a first example of automatic differentiation.

For a given function $f: \mathbb{R}^N \to \mathbb{R}$, we can approximate its differential by using a finite difference scheme, like:

$$\forall i = 1, ..., N \quad \frac{\partial f}{\partial x_i}(x_1, ..., x_N) = \frac{f(x_1, ..., x_i+h, ..., x_N) - f(x_1, ..., x_i, ..., x_N)}{h} + \mathcal{O}(h)$$

for the **1st-order** version and:

$$\forall i = 1, ..., N \quad \frac{\partial f}{\partial x_i}(x_1, ..., x_N) = \frac{f(x_1, ..., x_i+h, ..., x_N) - f(x_1, ..., x_i-h, ..., x_N)}{2h} + \mathcal{O}(h^2)$$

for the **2nd-order** centered version.

**Pro:**
- doesn't need to know how to calculate $f$ (black box).

**Cons:**
- $N+1$ evaluation for the 1st order scheme and $2N$ for the 2nd order version,
- how do we choose $h$ ? The big-O term also depends on the second-order derivative of $f$...

## Example: a dummy function

To illustrate the automatic differentiation process, we use the following function $f : \mathbb{R}^2 \to \mathbb{R}$ defined by:

$$x, y \longmapsto \frac{x + \max(x, y)}{y + (x+y)^2}$$

In [1]:
def f(x, y):
    return (x + max(x, y)) / (y + (x + y)**2)

x = 2
y = 1

print(f"f({x}, {y}) = {f(x, y)}")

f(2, 1) = 0.4


The 2nd-order finite difference scheme to approximate its derivative:

$$\frac{\partial f}{\partial x}(x, y) = \frac{f(x+h, y) - f(x-h, y)}{2h} + \mathcal{O}(h^2)$$
$$\frac{\partial f}{\partial y}(x, y) = \frac{f(x, y+h) - f(x, y+h)}{2h} + \mathcal{O}(h^2)$$

In [2]:
for p in range(16):
    h = 10**(-p)
    dfdx = (f(x+h, y) - f(x-h, y))/(2*h)
    dfdy = (f(x, y+h) - f(x, y-h))/(2*h)
    print(f"h = {h:6} ; dfdx = {dfdx:.16} ; dfdy = {dfdy:.16}")

h =      1 ; dfdx = -0.02352941176470588 ; dfdy = -0.3888888888888889
h =    0.1 ; dfdx = -0.03986374212365584 ; dfdy = -0.2808140800179723
h =   0.01 ; dfdx = -0.03999863997423692 ; dfdy = -0.2800081202074811
h =  0.001 ; dfdx = -0.03999998640000224 ; dfdy = -0.2800000811999936
h = 0.0001 ; dfdx = -0.03999999986348257 ; dfdy = -0.2800000008112979
h =  1e-05 ; dfdx = -0.03999999999837467 ; dfdy = -0.2800000000080516
h =  1e-06 ; dfdx = -0.04000000003445692 ; dfdy = -0.2800000000191538
h =  1e-07 ; dfdx = -0.03999999997894577 ; dfdy = -0.2800000001301761
h =  1e-08 ; dfdx = -0.04000000053405728 ; dfdy = -0.2799999981872858
h =  1e-09 ; dfdx = -0.04000000330961484 ; dfdy = -0.2800000231673039
h =  1e-10 ; dfdx = -0.03999994779846361 ; dfdy = -0.2800001897007576
h =  1e-11 ; dfdx = -0.03999856001968283 ; dfdy = -0.2799954712529029
h =  1e-12 ; dfdx = -0.03999578446212126 ; dfdy = -0.2800260023860801
h =  1e-13 ; dfdx = -0.03969047313034935 ; dfdy = -0.2792210906932269
h =  1e-14 ; dfdx = 

**Remark:** you can observe a classical loss of precision in the finite difference method for small space steps (starting at `1e-6`) due to a catastrophic cancellation effect.

# The chain rule

Given the functions:
- $f: \mathbb{R}^C \to \mathbb{R}^D$,
- $g: \mathbb{R}^B \to \mathbb{R}^C$,
- $h: \mathbb{R}^A \to \mathbb{R}^B$,

we want to compute the derivative of $f \circ g \circ h$ at a given point $x \in \mathbb{R}^A$.

Using the **chain rule**, we get the following expression of the Jacobians $J$:

$$J_{f \circ g \circ h}(x) = J_f(g(h(x))) \cdot J_g(h(x)) \cdot J_h(x)$$

where
- $J_f (g(h(x))) \in \mathcal{M}_{D, C}$,
- $J_g (h(x)) \in \mathcal{M}_{C, B}$ and
- $J_h (x) \in \mathcal{M}_{B, A}$.


In **which way** do we evaluate this matrix product :

1. From left to right:
$$J_{f \circ g \circ h}(x) = \underbrace{\underbrace{\underbrace{J_f(g(h(x)))}_1 \cdot J_g(h(x))}_2 \cdot J_h(x)}_3$$

2. From right to left:
    $$J_{f \circ g \circ h}(x) = \underbrace{J_f(g(h(x))) \cdot \underbrace{J_g(h(x)) \cdot \underbrace{J_h(x)}_1}_2}_3$$

## Function evaluation: the forward way

We call **forward** way, the order of evaluation when calculating $f(g(h(x))$:

1. $x_1 = x$,
2. $x_2 = h(x_1)$,
3. $x_3 = g(x_2)$,
4. $x_4 = f(x_3)$.

## Gradient evaluation: forward accumulation

Following the same order to evaluate the Jacobian:
1. $M_1 = J_h(x_1) \quad\quad\quad \in \mathcal{M}_{B, A}$,
2. $M_2 = J_g(x_2) \cdot M_1 \quad \in \mathcal{M}_{C, A}$,
3. $M_3 = J_f(x_3) \cdot M_2 \quad \in \mathcal{M}_{D, A}$.

$J_{f \circ g \circ h}(x) = M_3$

## Gradient evaluation: reverse accumulation
Or **backpropagation**.

1. $M_1 = J_f(x_3) \quad\quad\quad \in \mathcal{M}_{D, C}$,
2. $M_2 = M_1 \cdot J_g(x_2) \quad \in \mathcal{M}_{D, B}$,
3. $M_3 = M_2 \cdot J_h(x_1) \quad \in \mathcal{M}_{D, A}$.

$J_{f \circ g \circ h}(x) = M_3$

## And for a loss function ?!!

In a machine learning process, we want to differentiate a **loss function** (a $\mathbb{R}^N \to \mathbb{R}$ function): we have $D = 1$ and other dimensions can be pretty high.

The **forward accumulation** expresses as:
1. $M_1 = J_h(x_1) \quad\quad\quad \in \mathcal{M}_{B, A}$,
2. $M_2 = J_g(x_2) \cdot M_1 \quad \in \mathcal{M}_{C, A}$,
3. $M_3 = J_f(x_3) \cdot M_2 \quad \in \mathcal{M}_{1, A}$.

The **reverse accumulation** expresses as:
1. $M_1 = J_f(x_3) \quad\quad\quad \in \mathcal{M}_{1, C}$,
2. $M_2 = M_1 \cdot J_g(x_2) \quad \in \mathcal{M}_{1, B}$,
3. $M_3 = M_2 \cdot J_h(x_1) \quad \in \mathcal{M}_{1, A}$.

Supposing we can do the matrix product with the Jacobian without storing the Jacobian matrix in memory, we get the following **complexities**:
- **forward** accumulation: about $\mathcal{O}(AB + ABC + BC)$ in computational cost and $\mathcal{O}(\max(AB, AC, A))$ in memory cost.
- **reverse** accumulation: about $\mathcal{O}(C + BC + AB)$ in computational cost and $\mathcal{O}(\max(A, B, C))$ in memory cost.

$\Longrightarrow$ **reverse accumulation** (**backpropagation**)!!!

## Example: the dummy function

We recall our example function $f : \mathbb{R}^2 \to \mathbb{R}$ defined by:

$$x, y \longmapsto \frac{x + \max(x, y)}{y + (x+y)^2}$$

We first rewrite it as a composition of basics functions:

$$ f(x, y) = \mathrm{div}(\mathrm{add}(x, \mathrm{max}(x, y)), \mathrm{add}(y, \mathrm{sqr}(\mathrm{add}(x, y))) $$

with the following function expressions and derivatives:
- $\mathrm{add}(a, b) = a + b$ with derivatives $\partial_a \mathrm{add}(a,b) = \partial_b \mathrm{add}(a,b) = 1$,
- $\mathrm{sqr}(a) = a^2$ with derivative $\partial_a \mathrm{sqr}(a) = 2a$,
- $\mathrm{div}(a, b) = \frac{a}{b}$ with derivatives $\partial_a \mathrm{div}(a,b) = \frac{1}{b}$ and $\partial_b \mathrm{div}(a,b) = - \frac{a}{b^2}$,
- $\mathrm{max}(a, b)$ with derivatives $\partial_a \mathrm{max}(a, b) = \mathbb{1}(a >= b)$ and $\partial_b \mathrm{max}(a, b) = \mathbb{1}(b >= a)$.

### Forward pass

We first evaluate the function during the forward pass with $x = 2$ and $y = 1$:

1. $a_1 = x = 2$,  $b_1 = y = 1$, $v_1 = \mathrm{max}(a_1, b_1) = 2$
2. $a_2 = x = 2$, $b_2 = v_1 = 2$, $v_2 = \mathrm{add}(a_2, b_2) = 4$
3. $a_3 = x = 2$, $b_3 = y = 1$, $v_3 = \mathrm{add}(a_3, b_3) = 3$
4. $a_4 = v_3 = 3$, $v_4 = \mathrm{sqr}(a_4) = 9$
5. $a_5 = y = 1$, $b_5 = v_4 = 9$, $v_5 = \mathrm{add}(a_5, b_5) = 10$
6. $a_6 = v_2 = 4$, $b_6 = v_5 = 10$, $v_6 = \mathrm{div}(a_6, b_6) = 0.4$

Thus, we recover the result $f(2,1) = 0.4$.

### Backward pass

Now, we rollback the previous forward pass to calculate the derivatives $f(x,y)$ at $x=2$ and $y=1$:

1. Step 6: $da_6 = \partial_a \mathrm{div}(a_6, b_6) = \frac{1}{b_6} = 0.1$ and
   $db_6 = \partial_b \mathrm{div}(a_6, b_6) = - \frac{a_6}{b_6^2} = -0.04$
2. Step 5: $da_5 = 1 \times dv_5 = db_6 = -0.04$ and
   $db_5 = 1 \times dv_5 = db_6 = -0.04$
3. Step 4: $da_4 = 2 a_4 \times dv_4 = 2 a_4 db_5 = -0.24$
4. Step 3: $da_3 = 1 \times dv_3 = da_4 = -0.24$ and
   $db_3 = 1 \times dv_3 = da_4 = -0.24$
5. Step 2: $da_2 = 1 \times dv_2 = da_6 = 0.1$ and
   $db_2 = 1 \times dv_2 = da_6 = 0.1$
6. Step 1: $da_1 = \mathbb{1}(a_1 >= b_1) \times dv_1 = db_2 = 0.1$ and
   $db_1 = \mathbb{1}(b_1 >= a_1) \times dv_1 = 0$

Finally, we get the derivative of $f$ with respect to $x$ and $y$ by summing the contributions to $dx$ and $dy$:

$$\frac{\partial f}{\partial x}(2, 1) = da_1 + da_2 + da_3 = 0.1 + 0.1 - 0.24 = -0.04$$

$$\frac{\partial f}{\partial y}(2, 1) = db_1 + db_3 + da_5 = 0 - 0.24 - 0.04 = -0.28$$

In [None]:
def f(x, y):
    return (x + max(x, y)) / (y + (x + y)**2)

x = 2
y = 1

# Forward pass
a1 = x; b1 = y;  v1 = max(a1, b1)
a2 = x; b2 = v1; v2 = a2 + b2
a3 = x; b3 = y; v3 = a3 + b3
a4 = v3; v4 = a4**2
a5 = y; b5 = v4; v5 = a5 + b5
a6 = v2; b6 = v5; v6 = a6 / b6
print(f"f({x}, {y}) = {v6}")

# Backward pass
da6 = 1 / b6; db6 = -a6 / b6**2
da5 = db6; db5 = db6
da4 = 2 * a4 * db5
da3 = da4; db3 = da4
da2 = da6; db2 = da6
da1 = (a1 >= b1) * db2; db1 = (b1 >= a1) * db2 # Booleans are convertible to int (False <=> 0, True <=> 1)
dfdx = da1 + da2 + da3; dfdy = db1 + db3 + da5
print(f"Df({x}, {y}) = ({dfdx}, {dfdy})")

# Gradient checking using finite differences
def grad_check(f, x, y, h):
    return ((f(x+h, y) - f(x-h, y)) / (2*h), 
            (f(x, y+h) - f(x, y-h)) / (2*h))
    
dfdx_fd2, dfdy_fd2 = grad_check(f, x, y, 1e-8)
print(f"Grad check: Df({x}, {y}) = ({dfdx_fd2}, {dfdy_fd2})")