<!-- File automatically generated using DocOnce (https://github.com/doconce/doconce/):
doconce format ipynb ReverseAutoDiff.do.txt  -->

## Reverse-mode automatic differentiation
Deep learning frameworks are built upon the foundation of automatic differentiation. 
Training deep learning models typically involves gradient-based techniques, 
with autodiff streamlining the gradient acquisition process, even for large and intricate models.
The majority of deep learning frameworks utilize 'reverse-mode autodiff' due to its efficiency and precision.

In this module, we will delve into the generalization of the chain rule for automatic differentiation of any function. 
This is achieved by understanding that all functions are composed of basic operations such as addition, multiplication, 
subtraction, and division.

**Notice.**

The initial examples and code base are modified from <https://sidsite.com/posts/autodiff/>

In [1]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (4, 2)
plt.rcParams['figure.dpi'] = 150

## Exercise 1: Simple example

We begin with a simple example where we want to compute the gradients of a function $d$ with respect to its children $a$ and $b$.
The point of this exercise is to show that you can use a computational graph directly to state the gradients.
Given the function

$$
\begin{align*}
a &= 4\\
b &= 3\\
c &= a + b\\
d &= a c
\end{align*}
$$

we can build a computational graph of the function $d$ shown in the figure below:

<!-- dom:FIGURE: [figures/auto-diff-simple-graph.png] <div id="fig:auto-diff-simple-graph"></div> The function $d$ is composed of basic operations (addition, multiplication). By representing the function as a graph, we can visualize the flow of information through the function. When computing gradients, we start at the end of the graph and work our way backwards, multiplying the gradients of the children and adding the branches.  -->
<!-- begin figure -->
<div id="fig:auto-diff-simple-graph"></div>

<img src="figures/auto-diff-simple-graph.png" ><p style="font-size: 0.9em"><i>Figure 1:  The function $d$ is composed of basic operations (addition, multiplication). By representing the function as a graph, we can visualize the flow of information through the function. When computing gradients, we start at the end of the graph and work our way backwards, multiplying the gradients of the children and adding the branches.</i></p>
<!-- end figure -->

Work through this simple example and find the gradients $\frac{\partial d}{\partial a}$ and $\frac{\partial d}{\partial b}$ using the graph.

<!-- --- begin solution of exercise --- -->
**Solution.**

<!-- Equation labels as ordinary links -->
<div id="_auto1"></div>

$$
\begin{equation}
\frac{\partial d}{\partial a} = c + a\frac{\partial c}{\partial a}
\label{_auto1} \tag{1}
\end{equation}
$$

<!-- Equation labels as ordinary links -->
<div id="_auto2"></div>

$$
\begin{equation}  
\frac{\partial d}{\partial b} = \frac{\partial d}{\partial c}\frac{\partial c}{\partial b}
\label{_auto2} \tag{2}
\end{equation}
$$

<!-- --- end solution of exercise --- -->

## Exercise 2: Compute the partial derivatives automatically

Work through this simple example and implement a minimal version of a variable class and a method to compute gradients.

To do so we have created a `Variable` class that stores the value of the variable and the gradients with respect to its children.
The gradients are stored as a tuple of tuples, where each tuple contains a reference to the child variable and it correspoding partial
derivative.

Fill in the missing code in the `mul` function to compute the gradients of the variables with respect to their children.
Then fill in the missing code in the `compute_gradients` function to compute the gradients of the variables with respect to their children.

**a)**

In [2]:
from collections import defaultdict

class Variable:
    def __init__(self, value, gradients=None):
        self.value = value
        self._gradients = gradients if gradients is not None else ((self, np.sign(value)),)
        self._stored_gradients = None

    @property
    def gradients(self):
        return compute_gradients(self)
    
def add(a, b):
    "Create the variable that results from adding two variables."
    value = a.value + b.value    
    gradients = (
        (a, 1),  # the local derivative with respect to a is 1
        (b, 1)   # the local derivative with respect to b is 1
    )
    return Variable(value, gradients)

def mul(a, b):
    "Create the variable that results from multiplying two variables."
    # ---> TODO: fill in the missing code <---
    value = ___
    gradients = (
        (a, ___), # the local derivative with respect to a is b.value
        (b, ___)  # the local derivative with respect to b is a.value
    )
    return Variable(value, gradients)

def compute_gradients(variable):
    """ Compute the first derivatives of `variable` 
    with respect to child variables.
    """
    gradients = defaultdict(lambda: 0)
    
    def _compute_gradients(variable, total_gradient):
        for child_variable, child_gradient in variable._gradients:
            # ---> TODO: fill in the missing code <---
            # "Multiply the edges of a path":
            gradient = ___
            # "Add together the different paths":
            gradients[child_variable] = ___

            # if the child variable only has itself as a gradient 
            # we have reached the end of recursion
            criteria = (
                len(child_variable._gradients) == 1 and 
                child_variable._gradients[0][0] is child_variable
            )
            if not criteria:
                # recurse through graph:
                _compute_gradients(child_variable, gradient)
    
    _compute_gradients(variable, total_gradient=1)
    # (total_gradient=1 is from `variable` differentiated w.r.t. itself)
    return gradients

<!-- --- begin solution of exercise --- -->
**Solution.**

In [3]:
from collections import defaultdict

class Variable:
    def __init__(self, value, gradients=None):
        self.value = value
        self._gradients = gradients if gradients is not None else ((self, np.sign(value)),)
        self._stored_gradients = None

    @property
    def gradients(self):
        if self._stored_gradients is None:
            self._stored_gradients = dict(compute_gradients(self))
        return self._stored_gradients
    
def add(a, b):
    "Create the variable that results from adding two variables."
    value = a.value + b.value    
    gradients = (
        (a, 1),  # the partial derivative with respect to a is 1
        (b, 1)   # the partial derivative with respect to b is 1
    )
    return Variable(value, gradients)

def mul(a, b):
    "Create the variable that results from multiplying two variables."
    value = a.value * b.value
    gradients = (
        (a, b.value), # the partial derivative with respect to a is b.value
        (b, a.value)  # the partial derivative with respect to b is a.value
    )
    return Variable(value, gradients)

def compute_gradients(variable):
    """ Compute the first derivatives of `variable` 
    with respect to child variables.
    """
    gradients = defaultdict(lambda: 0)
    
    def _compute_gradients(variable, total_gradient):
        for child_variable, child_gradient in variable._gradients:
            # "Multiply the edges of a path":
            gradient = total_gradient * child_gradient
            # "Add together the different paths":
            gradients[child_variable] += gradient
            # if the child variable only has itself as a gradient 
            # we have reached the end of recursion
            criteria = (
                len(child_variable._gradients) == 1 and 
                child_variable._gradients[0][0] is child_variable
            )
            if not criteria:
                # recurse through graph:
                _compute_gradients(child_variable, gradient)
    
    _compute_gradients(variable, total_gradient=1)
    # (total_gradient=1 is from `variable` differentiated w.r.t. itself)
    return gradients

<!-- --- end solution of exercise --- -->

**b)**
Test function against true value

In [4]:
a = Variable(4)
# TODO: fill in the missing code
b = ___
c = ___
d = ___

assert d.gradients[a] == ___

<!-- --- begin solution of exercise --- -->
**Solution.**

In [5]:
a = Variable(4)
b = Variable(3)
c = add(a, b)
d = mul(a, c)

assert d.gradients[a] == 11

<!-- --- end solution of exercise --- -->

## Linear rate neurons

Starting with the loss $\mathcal{L}=\frac{1}{2}(y - \hat{y})^2$ we have the computation graph in the figure below.

<!-- dom:FIGURE: [figures/auto-diff-graph.png] <div id="fig:auto-diff-graph"></div> -->
<!-- begin figure -->
<div id="fig:auto-diff-graph"></div>

<img src="figures/auto-diff-graph.png" ><p style="font-size: 0.9em"><i>Figure 2</i></p>
<!-- end figure -->

## Exercise 3: Validate the gradients

In this exercise we want to show the benefit of autodiff. 
In the previous module: LinearRegression, we saw that we had to compute gradients of the function manually - this is what autodiff is gonna automate for us. 
We will simply define the forward pass of our function (**fill in the blanks `___` in the functions `loss`, `sigma` and `y_hat` below**).
In this exercise we will use a predefined Variable class from `variable.py` to automatically compute its gradient. 
We will compare with the ground truth analytical gradient (we restrict ourselves to compare the partial derivative of one variable $w_1$).

In [6]:
# import a "complete" version of the above Variable class.
from variable import *

def loss(y, y_hat):
    return ___ # <---- FILL IN

def sigma(x):
    return ___ # <---- FILL IN

def y_hat(w_0, w_1, x_1):
    return ___ # <---- FILL IN

def dy_hat_dw_1(w_0, w_1, x_1):
    return x_1*y_hat(w_0, w_1, x_1) * (1 - y_hat(w_0, w_1, x_1))

def dloss_dw_1(y, w_0, w_1, x_1):
    return (y_hat(w_0, w_1, x_1) - y) * dy_hat_dw_1(w_0, w_1, x_1)

x_1 = Variable(0.1, name='x_1')
w_0 = Variable(4, name='w_0')
w_1 = Variable(3, name='w_1')
y = Variable(10, name='y')

def isclose(a, b):
    """
    Checks if two variables are similiar
    """
    a = a if not isinstance(a, Variable) else a.value
    b = b if not isinstance(b, Variable) else b.value
    return np.isclose(a, b)

# Test if analytic and autodiff for computing d_loss/d_w_1 is similar
assert isclose(loss(y, y_hat(w_0, w_1, x_1)).gradients[w_1], dloss_dw_1(y, w_0, w_1, x_1))
print('success!!!!!')

<!-- --- begin solution of exercise --- -->
**Solution.**

In [7]:
# import a "complete" version of the above Variable class.
from variable import *

def loss(y, y_hat):
    return 0.5 * (y - y_hat)**2

def sigma(x):
    return 1. / (1. + exp(-x))

def y_hat(w_0, w_1, x_1):
    return sigma(w_0 + w_1*x_1)

def dy_hat_dw_1(w_0, w_1, x_1):
    return x_1*y_hat(w_0, w_1, x_1) * (1 - y_hat(w_0, w_1, x_1))

def dloss_dw_1(y, w_0, w_1, x_1):
    return (y_hat(w_0, w_1, x_1) - y) * dy_hat_dw_1(w_0, w_1, x_1)

x_1 = Variable(0.1, name='x_1')
w_0 = Variable(4, name='w_0')
w_1 = Variable(3, name='w_1')
y = Variable(10, name='y')

def isclose(a, b):
    """
    Checks if two variables are similiar
    """
    a = a if not isinstance(a, Variable) else a.value
    b = b if not isinstance(b, Variable) else b.value
    return np.isclose(a, b)

# Test if analytic and autodiff for computing d_loss/d_w_1 is similar
assert isclose(loss(y, y_hat(w_0, w_1, x_1)).gradients[w_1], dloss_dw_1(y, w_0, w_1, x_1))
print('success!!!!!')

<!-- --- end solution of exercise --- -->