# Day 2: Automatic differentiation, and introduction to CUDA

## Automatic differentiation

Having knowledge of the gradients of a multivariable function with respect to its arguments is a powerful tool in mathematical computations, e.g., in
- computing Hessians for optimization problems or extracting uncertainties in a likelihood:
$$\begin{bmatrix} \frac{\delta^2 f}{\delta x_1^2} & \frac{\delta^2 f}{\delta x_1 \delta x_2} \\ \frac{\delta^2 f}{\delta x_1 \delta x_2} & \frac{\delta^2 f}{\delta x_2^2} \end{bmatrix}$$
- computing gradients to find descent to minimum loss in machine learning:
$$\vec{x}_{n+1} = \vec{x}_n - \gamma \nabla L(\vec{x})|_{\vec{x}=\vec{x}_n}$$


There are several ways to compute gradients:
- Use symbolic language that maps function names to their definitions and other functions for derivatives. Here is an example using Mathematica:
![Derivatives in Mathematica](mathematica_screnshot_diff.png)
- Use numerical differentiation with small step sizes.
- Embed differentials into variable definitions in programming (**automatic differentiation** - topic of today!).

Suppose we have a function $f(g(h(x)))$. The chain rule gives us
$$v=h(x)$$
$$u=g(v)$$
$$y=f(u)$$
$$\frac{\delta f}{\delta x} = \frac{\delta y}{\delta u} \frac{\delta u}{\delta v} \frac{\delta v}{\delta x}$$

This gives us two directions from which we can start differentiation:
- Forward: $\frac{\delta y}{\delta u} \frac{\delta u}{\delta v} \frac{\delta v}{\delta x} \leftarrow$ start computing derivatives in this direction.
- Reverse: start computing derivatives in this direction $\rightarrow \frac{\delta y}{\delta u} \frac{\delta u}{\delta v} \frac{\delta v}{\delta x}$.

Why does it matter? Don't we have the same number of derivatives?

$\rightarrow$ That is correct, but for a function $f: U^n \to V^m$, we would have a different number of matrix multiplication steps (derivatives $\equiv$ Jacobian matrices). For the sake of example, assume the dimensionality $k$ for both $g$ and $h$. Then,
- forward differentiation involves $(n \times k^2) + (n \times k \times m)$ multiplications, whereas
- reverse differentiation would need $(m \times k^2) + (n \times k \times m)$ multiplications.

If $n \cong m$, or $k$ is not large compared to $n$ or $m$, the direction does not matter much. On the other hand, if $k$ is large, it is better to use forward differentiation for $n \ll m$, and reverse differentiation otherwise.

In a typical neural network, gradient descent uses *forward propagation* to calculate the cost for a given loss function , and *backpropagation* (through reverse autodifferentiation - **why?**) to calculate gradients of the cost with respect to the weights at each layer.

### Exercise: Dual numbers
Consider an extension of a real variable $r \to (r, \delta r)$. Denoting $u = (x, \delta x)$ and $v = (y, \delta y)$, define the minimal set of algebraic properties as follows:
- $-u = (-x, -\delta x)$
- $u^{-1} = (1/x, -\delta x/x^2)$
- $u+v = (x+y, \delta x + \delta y)$
- $u \times v = (x \times y, y \times \delta x + x \times \delta y)$

For $f(x, y) = x^{2 y^3}$, write a simple Python script to compute $f(2,3)$ and $\frac{\delta f}{\delta y} |_{(x,y)=(2,3)}$.

#### Solution:

In [12]:
import numpy as np

class DualValue:
  def __init__(self, value, differential):
    self.value = value
    self.differential = differential
  def __add__(self, other):
    return DualValue(self.value + other.value, self.differential + other.differential)
  def __radd__(self, other):
    return DualValue(self.value + other, self.differential)
  def __sub__(self, other):
    return DualValue(self.value - other.value, self.differential - other.differential)
  def __rsub__(self, other):
    return DualValue(other - self.value, -self.differential)
  def __mul__(self, other):
    return DualValue(self.value * other.value, self.differential * other.value + self.value * other.differential)
  def __rmul__(self, other):
    return DualValue(self.value * other, self.differential * other)
  def __truediv__(self, other):
    return DualValue(self.value / other.value, (self.differential * other.value - self.value * other.differential) / (other.value * other.value))
  def __rtruediv__(self, other):
    return DualValue(other / self.value, -self.differential * other / (self.value * self.value))
  def __neg__(self):
    return DualValue(-self.value, -self.differential)
  def __lt__(self, other):
    return self.value < other.value
  def __le__(self, other):
    return self.value <= other.value
  def __eq__(self, other):
    return self.value == other.value
  def __ne__(self, other):
    return self.value != other.value
  def __gt__(self, other):
    return self.value > other.value
  def __ge__(self, other):
    return self.value >= other.value
  def __abs__(self):
    return DualValue(abs(self.value), self.differential * np.sign(self.value))
  def __str__(self):
    return str(self.value)
  def __pow__(self, other):
    if isinstance(other, DualValue):
      return DualValue(self.value**other.value, self.value**other.value * other.differential * np.log(self.value) + self.differential * other.value * self.value**(other.value - 1))
    else:
      return DualValue(self.value**other, other * self.value**(other - 1) * self.differential)
  def __rpow__(self, other):
    return DualValue(other**self.value, other**self.value * self.differential * np.log(other))
  
x = DualValue(2, 0)
y = DualValue(3, 1)

f = x**(2*y**3)
print(f.value, f.differential)

18014398509481984 6.742779949618588e+17


In [None]:
import numpy as np
import tensorflow as tf
import tensorflow.keras.layers