# Automatic differentiation

There are lots of applications in engineering that requires the computation of the derivative of a function. The simplest example is Newton's method which is a fast iterative algorithm that finds the root of a nonlinear equation:

$$
x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}.
$$

There are, of course, more complicated applications where (apart from the function values) the derivatives are required, for instance training a neural network requires the minimization of a cost function by gradient descent which requires the derivatives of the cost function with respect to the parameters (weights, biases).

### How to calculate the derivative of a function?

* symbolic differentiation: either the derivative of $f$ is known exactly in advance or a computer algebraic program calculates it
* numerical differentiation: approximate the value of $f'(x)$ with finite differences:
$$
f'(x)\approx \frac{f(x+\epsilon) - f(x-\epsilon)}{2\epsilon}
$$
* automatic differentiation: use a mechanism that calculates the derivative together with the function value without any extra effort, that is, automatically.

### Dual numbers

A dual number can be represented as a symbolic expression $a + b\cdot\varepsilon$, where $\varepsilon^2 = 0$. More precisely, a dual number is a pair of real numbers $(a, b)$, such that addition, subtraction, multiplication and division is defined as follows:

* $(a + b\varepsilon) + (c + d\varepsilon) = (a + c) + (b + d)\varepsilon$
* $(a + b\varepsilon) \cdot (c + d\varepsilon) = ac + (ad + bc)\varepsilon$
* $-(a + b\varepsilon) = -a + (-b)\varepsilon$
* $(a + b\varepsilon) \,/\, (c + d\varepsilon) = \frac{a}{c} + \frac{bc - ad}{c^2}\varepsilon$

The dual numbers form a two-dimensional commutative unital associative algebra over the real numbers.

Let us assume that $p$ is a polynomial of degree $n$:

$$
p(x) = a_0 + a_1x + a_2x^2 + \ldots + a_nx^n.
$$

Let $x = a + b\epsilon$ be a dual number, then

$$
p(x) = p(a + b\epsilon) = p(a) + b\cdot p'(a)\cdot\epsilon.
$$

Then for any analytical function $f$, the dual number $(a, b)$ is mapped to the dual number $(f(a), bf'(a))$. This means that by extending the usual function definitions to dual numbers, the derivative is calculated along the function value with no extra effort.

As an example, the trigonometric sine function should be overriden with the following new definition:

$$
\sin x = \sin\,(a, b) := (\sin a, b\cdot cos a).
$$

In [None]:
class Dual:
    def __init__(self, value, derivative):
        self.value = value
        self.derivative = derivative
        
    def __repr__(self):
        return f'{self.value} + {self.derivative}*ε'
        
    def __add__(self, that):
        return Dual(self.value + that.value, self.derivative + that.derivative)
    
    def __neg__(self):
        return Dual(-self.value, -self.derivative)
    
    def __sub__(self, that):
        return self + (-that)
    
    def __mul__(self, that):
        return Dual(self.value*that.value, self.value*that.derivative + self.derivative*that.value)
    
    def __truediv__(self, that):
        a = self.value / that.value
        b = (self.derivative*that.value - self.value*that.derivative) / (that.value ** 2)
        return Dual(a, b)

In [None]:
import math


def sin(d):
    # sin x = sin(a, b) := (sin a, b * cos a).
    return Dual(math.sin(d.value), math.cos(d.value) * d.derivative)


def cos(d):
    return Dual(math.cos(d.value), -math.sin(d.value) * d.derivative)


def tan(d):
    return Dual(math.tan(d.value), d.derivative / math.pow(math.cos(d.value), 2))


def cot(d): 
    return Dual(1, 0) / tan(d)

In [None]:
# Can you implement the following functions for dual numbers?

def log(d):
    pass


def sqrt(d):
    pass


def exp(d):
    pass


def power(d, alpha):
    pass

$$
f(x) = \sqrt{\log x}, \qquad f'(3) = ?
$$

$$
f'(x) = \frac{1}{2}\frac{1}{\sqrt{\log x}}\cdot\frac{1}{x}
$$

In [None]:
def f(d):
    return sqrt(log(d))


x = Dual(3, 1)
y = f(x)

print(f'Function value at 3: {y.value}')
print(f'Derivative value at 3: {y.derivative}')

Links:

https://en.wikipedia.org/wiki/Dual_number

https://en.wikipedia.org/wiki/Automatic_differentiation

From scratch: reverse-mode automatic differentiation (in Python)
https://sidsite.com/posts/autodiff/

Automatic differentiation in machine learning: a survey
https://arxiv.org/abs/1502.05767

What is Automatic Differentiation?
https://www.youtube.com/watch?time_continue=2&v=wG_nF1awSSY&feature=emb_logo

Reverse mode automatic differentiation
https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation

In [None]:
import numpy as np
import tensorflow as tf

To differentiate automatically: 
* TensorFlow needs to remember what operations happen in what order during the forward pass. 
* During the backward pass, it traverses this list of operations in reverse order to compute gradients.

TensorFlow provides the ```tf.GradientTape``` API for automatic differentiation.

https://www.tensorflow.org/api_docs/python/tf/GradientTape

https://www.tensorflow.org/guide/autodiff

https://www.tensorflow.org/guide/advanced_autodiff

In [None]:
# This is the same function we have tested with our manually implemented Dual class
def f(x):
    return tf.sqrt(tf.math.log(x))


x = tf.Variable(3.0, dtype=tf.float64)

with tf.GradientTape() as tape:
    y = f(x)

In [None]:
dx = tape.gradient(y, x)

print(dx.numpy())

In [None]:
with open("data/log_reg_1.txt") as f:
    X = []
    y = []
    for line in f:
        x0, x1, label = line.split(',')
        X.append((float(x0), float(x1)))
        y.append(int(label))
        
X = np.array(X)
y = np.expand_dims(np.array(y), 1)

In [None]:
# Copied from the logistic regression part
def activation(Z):
    return 1 / (1 + np.exp(-Z))


def calc_gradient(X, y, w, b):
    m = len(X)
    A = activation(np.matmul(X, w) + b)
    cost = (-1 / m) * np.sum(np.multiply(y, np.log(A)) + np.multiply(1 - y, np.log(1 - A)))
    
    dZ = A - y
    dw = (1 / m) * np.matmul(X.T, dZ)
    db = (1 / m) * np.sum(dZ)
    return cost, dw, db

In [None]:
# Using zero initial values calculate the gradient of the loss function with respect to w and b

_, nr_features = X.shape
w = np.zeros((nr_features, 1), dtype=np.float_)
b = 0.0

_, dw, db = calc_gradient(X, y, w, b)

print(dw)
print(db)

In [None]:
# Calculate dw and db without explicit derivative calculation

w = tf.Variable(tf.zeros((nr_features, 1), dtype=tf.float64), name='w')
b = tf.Variable(0.0, dtype=tf.float64, name='b')
x = tf.Variable(X)
m, _ = x.shape

with tf.GradientTape(persistent=True) as tape:
    z = tf.add(tf.matmul(x, w), b)
    a = tf.sigmoid(z)
    loss = tf.reduce_mean(tf.losses.binary_crossentropy(y_true=y, y_pred=a))

In [None]:
[dl_dw, dl_db] = tape.gradient(loss, [w, b])

print(dl_dw.numpy())
print(dl_db.numpy())

![](images/forward_mode_autodiff.png)

![](images/reverse_mode_autodiff.png)