# AD20 Milestone 2
Group 20: Lindsey Brown, Xinyue Wang, Kevin Yoon

# Table of Contents

**1. Introduction**

    1.1 Automatic Differentiation as a Solution to the Problem of Computing Derivatives
    1.2 Application of AD Techniques
    
**2. Background**

    2.1 Chain Rule
    2.2 Computational Graph Structure
    2.3 Dual Numbers
    2.4 Elementary Functions
    
**3. Package Usage**

    3.1 User Interaction
    3.2 Getting Started via Virtual Environments
    3.2 Importing AD20
    3.3 Instantiating AD20 Objects
    
**4. Software Organization**

    4.1 Directory Structure
    4.2 Modules and Functionality
    4.3 Testing and Coverage
    4.4 Package Distribution
    
**5. Implementation**

    5.1 Core Data Structures
    5.2 Classes
    5.3 Class Methods and Attributes
    5.4 External Dependencies
    5.5 Elementary Functions
    5.6 What Comes Next?

# 1. Introduction
The AD20 package performs the forward mode of automatic differentiation of user defined functions, evaluating both the function and its derivatives to machine precision.

## 1.1 Automatic Differentiation as a Solution to the Problem of Computing Derivatives

Differentiation is a fundamental operation for computational science. Used in a variety of applications from optimization to sensitivity analysis, differentiation is most useful when two conditions are met: it must be exact (up to machine precision) and computationally efficient.

Automatic differentiation (AD) (i.e. algorithmic differentiation, computational differentiation) computes the derivative of a function, unique for its ability to handle complex combinations of functions without sacrificing the accuracy. Regardless of how complex the function may be, AD takes advantage of the fact that the function can be decomposed into a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). 

Through computing the derivatives of these basic elementary functions and repeatedly applying the chain rule, AD meets the two aforementioned conditions and distinguishes itself from other modes of differentiation, namely numerical differentiation and symbolic differentiation. 

**Numerical Differentiation through Finite Difference Methods:** 
This class of techniques uses the definition of a derivative,
$$\frac{df(x)}{dt} = \lim_{h \rightarrow 0} \frac{f(x+h)-f(x)}{h}$$
to approximate the derivative by evaluating the right hand side for small $h$.  Such a technique is easy to code because it requires only defining and evaluating $f$, but it has limitations in precision due to truncation and roundoff errors, alongside the challenge of choosing an appropriately sized $h$.  While more complex finite difference schemes have been used to increase accuracy, all numerical differentiation remains only approximate with sensitivity to the choice in step size.

**Symbolic Differentiation:**
Symbolic differentiation addresses the shortcomings of approximation in numerical differentiation by computing derivatives to machine precision using expression trees, which quickly become inefficient to compute. 

Both of these options have been used in a variety of different applications to compute derivatives, but both have shortcomings that are addressed by automatic differentiation.  For this reason, for many computational applications, automatic differentiation is preferred:

- While numerical differentiation may be easy to implement and can flexibly handle any type of function, accuracy is sacrificed due to truncation and rounding errors - numerical differentiation serves more as an estimation technique based on small inputs. Unlike numerical differentiation, automatic differentiation does not rely on approximating the derivative through the choice of a small perturbation in the input, and instead computes derivatives exactly to machine precision, thus avoiding these accuracy and stability problems.


- While symbolic differentiation may ensure accuracy up to machine precision, computational efficiency is sacrified due to the nature of building complex expression trees. For complex functions, these expression trees can quickly become very large with many mathematical expressions. Unlike symbolic differentiation, automatic differentiation views functions as compositions of basic operations, remains accurate up to machine precision, and maintains computational efficiency since it does not require the buildup and evaluation of complex expression trees.
 
Thus, it is clear that automatic differentiation has advantages over other commonly used techniques for computing derivatives. These advantages make the use of AD attractive to many scientific applications. 

## 1.2 Application of AD Techniques

Through its improved accuracy and efficiency, AD has many different applications where accuracy, precision, and efficiency is crucial in computation. Some potential applications include: 

- Machine learning (ability to understand data and make models/predictions), where backpropagation is used to parameterize neural nets among other parameter optimization techniques
- Parameter optimization (ability to choose best parameter values under given conditions), where methods requiring derivatives may be used to find the optima
- Sensitivity analysis (ability to understand different factors and their impact), which requires computing partial derivatives with respect to different inputs and parameters
- Physical modeling (ability to visualize and depict data through models), where different physical properties are related through derivatives (for example, acceleration is the derivative of velocity)
- Probabilistic inference, where many sampling methods (for example, Hamiltonian Monte Carlo) are derivative based

This large range of applications motivates the development of a package that can easily be used to compute derivatives up to machine precision efficiently, precisely the problem solved by automatic differentiation.

# 2. Background

Here we discuss the mathematical concepts and computational process needed to perform automatic differentiation.

## 2.1 The Chain Rule

The chain rule forms the core concept behind automatic differentiation.  The chain rule gives the formula for calculating derivates of a composition of functions by sequentially taking derivatives of the outermost to innermost function.  The chain rule gives us that for a function $f\left(g\left(t\right)\right)$, we find the derivative of $f$ with respect to $t$ by multiplying a series of derivatives,
$$\dfrac{\partial f}{\partial t} = \dfrac{\partial f}{\partial g}\dfrac{\partial g}{\partial t}.$$

Similarly, AD performs differentiation by decomposing complex functions into combinations of simpler, more elementary functions then computing the derivatives of the elementary functions to piece them together to get the overall derivative. By expressing the function as a composition of elementary functions and operations, the derivative of the function can be calculated by applying the chain rule.

The chain rule can be generalized to multiple dimensions, just as automatic differentiation can compute gradients of functions of multiple inputs.  For example, for the function $f(g(u, v), h(u, v))$, we have
$$\dfrac{\partial f}{\partial u} = \dfrac{\partial f}{\partial g}\dfrac{\partial g}{\partial u} + \dfrac{\partial f}{\partial h}\dfrac{\partial h}{\partial u}$$
and 
$$\dfrac{\partial f}{\partial v} = \dfrac{\partial f}{\partial g}\dfrac{\partial g}{\partial v} + \dfrac{\partial f}{\partial h}\dfrac{\partial h}{\partial v}$$
where the gradient is given by
$$\nabla f = \left(\dfrac{\partial f}{\partial u}, \dfrac{\partial f}{\partial v}\right)$$

In the most general form, we have for $f(g(x))$ with $x \in \mathbb{R}^m$ and $g \in \mathbb{R}^m$,

$$ \nabla_x f = \sum_{i=1}^n \frac{\partial f}{\partial g{i}}\nabla_x g_i $$

## 2.2 Graph Structure of Calculations

Automatic differentation computes derivatives by constructing a function through a sequence of elementary operations and functions.  We can visualize the process of constructing a function in this way through a computational graph, where the nodes are functions and the directed edges describe the computations which connect the nodes.  By tracing a path from the input nodes to the output final function, we trace the same computations which are done by the automatic differentiation method to evaluate the function and its derivative.

### Example: The Computational Graph
Consider the example function $$f\left(x,y\right) = x^{3} + \sin(5y)$$
evaluated at $(x,y) = (1, \frac{\pi}{5})$.

 The evaluation trace looks like:

| Trace | Elementary Function | Numerical Value |  Elementary Function Derivative | $\nabla_{x}$ Value  | $\nabla_{y}$ Value  | 
| :---: | :-----------------: |:-------:| :-----------: | :-------------: | :----------------------: | :---------------------: | 
| $x_{1}$ | $x$ | $1$ | $\dot{x_{1}}$|$1$|$0$|
| $x_{2}$ | $y$ | $\frac{\pi}{5}$ | $\dot{x_{2}}$|$0$|$1$|
| $x_{3}$ | $x_{1}^3$ |1 |$3x_{1}^2\dot{x_{1}}$|$3$|$0$|
| $x_{4}$ | $5x_{2}$ | $\pi$ | $5\dot{x_{2}}$|$0$|$5$|
| $x_{5}$ | $\sin{x_{4}}$ |$\sin(\pi) = 0$ | $\cos(x_{4})\dot{x_{4}}$|$0$|$5\cos(\pi) = 5$|
| $x_{6}$ | $x_{3}+x_{5}$ |$1$| $\dot{x_{3}}+\dot{x_{5}}$ |$3$|$5$|

The evaluation trace can be visualized with a computational graph.

![comp-graph](figs/compgraphppt.jpg)

Through the same fundamental process, building complex functions through the composition of elementary functions, automatic differentiation computes derivatives exactly to machine precision.  Our package exploits different data structures, classes, and methods to implement this technique.

## 2.3 Dual Numbers
Analogously to complex numbers, a dual number has a real part and a dual part.  We write $$f = y + \epsilon y^{\prime}$$, where $y$ is the real part and $y^{\prime}$ is the dual part.  The number $\epsilon$ is defined as a special constant such that $\epsilon^{2} = 0$.

#### Properties of Dual Numbers
Again by analogy to complex numbers, we can define several properties of the dual numbers, which are useful in performing calculations with them:
* Conjugate:  $f^{*} = y - \epsilon y^{\prime}$.
* Magnitude: $\left|f\right|^{2} = ff^{*} = \left(y+\epsilon y^{\prime}\right)\left(y-\epsilon y^{\prime}\right) = y^{2}$.
* Polar form: $f = y\left(1 + \dfrac{y^{\prime}}{y}\right)$.

More significantly, the dual numbers have the property that they can be used to compute derivatives of functions.  If we wish to compute the derivative of $f(x)$, the real part of $f(x+x'\epsilon)$ gives us the value of the function while the dual part gives us the value of the derivative as can be seen in the following example from lecture,

#### Example
Let $f(z)=z^{2}$.  We know that the derivative is $f^{\prime}(z) = 2zz^{\prime} = 2z$.

Consider instead evaluating, $f(z+z'\epsilon)$, where we have extended z to the dual numbers, 

$$ f = \left(z + \epsilon z^{\prime}\right)^{2}$$
$$= z^{2} + 2zz'\epsilon + z'^{2}\epsilon^{2} $$
Since $\epsilon^2 = 0$, 
$$ = z^{2} + 2zz^{\prime}\epsilon $$
$$ = f(z) + f'(z)\epsilon $$


where we see that the real part is just the original function and the dual part is the derivative.

While our implementation of automatic differentiation does not use dual numbers explicitly, it is inspired by this structure.  Each object that we evaluate will contain an attribute of its value, corresponding to the real part of a dual number, and an attribute for its derivative, corresponding to the dual part of a dual number.


## 2.4 Elementary Functions
Any complex equation can be broken into combinations of elementary operations and functions. The elementary operations include the addition, subtraction, multiplication, division, and exponentiation.  The elementary functions include the natural exponential, natural logarithm, and the trigonometric functions.  These functions are considered elementary because they form the core set of elements used to build more complex functions.

We know how to explicitly compute the derivatives of both elementary operations and functions, given by the standard set of calculus rules.  We will not go into details about how to calculate the derivatives of those functions here, but more information can be found on the following link.

http://www.nabla.hr/FU-DerivativeA5.htm

Because every function is composed of elementary functions and we can take the derivative of a composition of functions by applying the chain rule, we can take the derivative of any function by knowing the derivative of these elementary functions.  It is this principle that forms the core of the automatic differentiation algorithm, implemented in this package.


# 3. Package Usage

## 3.1 User Interaction
Users should use ADnum objects to represent mathematical objects for which they would like to evaluate a value or a derivative.  By forming ADnum objects for the function inputs, the elementary operations and functions defined for the ADnum class can be composed to create any desired function, which will also be of the ADnum class, with associated value and derivative attributes. All operations are defined for an ADnum object. Users need to create an ADnum object for each input variable and use all the mathematical functions defined in the ADmath library to implement special functions.

## 3.2 Getting Started via Virtual Environments
For now, use the following sequence to access the module.

    git clone https://github.com/CS207-AD20/cs207-FinalProject.git

You now have downloaded our git repository of the AD20 project. Once we get `pip` running, use the following command.

    pip install AD20

For deployment purposes, we recommend using a virtual environment in case you may have different packages with different dependencies. For a quick ramp up, a virtual environment will make deployment easier by keeping multiple environments separate. 

Download virtualenv if you don't have it.

    sudo easy_install virtualenv

At the top of your directory, initialize a new virtual environment called `env`, activate the virtual environment, and install all necessary packages

    virtualenv env
    source env/bin/activate
    pip install -r requirements.txt

`requirements.txt` has been created to help you install all the packages that you may need. The last command `pip install -r requirements.txt` will install those packages for you automatically.

To deactivate the virtual environment once finished, deactivate it.

    deactivate
    

## 3.3 Importing AD20
In order to import and use ADnum, the user can use 
    
    import AD20
    
to import complete functionality of the package or 

	from AD20 import ADnum
    
	from AD20 import ADmath
    
	from AD20 import ADgraph

to use only specific modules.

## 3.4 Instantiating AD20
After importing AD20, a user creates a class instance of an ADnum with some value to be used as input to a more complex function. AD20 will be able to handle scalar or vector functions with scalar or vector inputs (i.e. one-dimensional vs two-dimensional).

### Scalar Functions with Scalar Inputs
Sequence is as follows
    1. initialize a variable (i.e. `x`) with a specific value that it will be evaluated on
    2. create a function (i.e. `f`) with the variable and any other elementary functions
    3. `f.val` will return the value of the function evaluated at the specific value
    4. `f.der(x)` will return the derivative

```python
    #simple scalar function, single variable
    >>> x = ADnum(2)
    >>> f = 3 * x**2
    >>> print(f.val) 
    >>> print(f.der) 
    12
    12
```


```python
    #complex scalar function, single variable
    >>> x = ADnum(1)
    >>> f = 3 * x**3 + ADmath.sin(x)
    >>> print(f.val)
    >>> print(f.der)
    0
    1
```

In case case with a function with more than one variable, the sequence is similar except
    1. calculate each gradient separately (`f.der(x`) & `f.der(y)`)
    2. return the entire gradient as a list when `f.gradient` is called 
    
```python
    #complex scalar function, multi variable
    >>> x = ADnum(2)
    >>> y = ADnum(3)
    >>> f = 3 * x**3 + 2 * y**3
    >>> print(f.val)
    >>> print(f.der(x))
    >>> print(f.der(y))
    >>> print(f.gradient)
    78
    36
    54
    [36, 54]
```


### Scalar Functions with Vector Inputs
Sequence is as follows
    1. initialize separate values for each variable as a list (make sure all the variables are of same dimension)
    2. each derivative calculation will now return a list of derivatives, evaluated at all the separate values initialized earlier (will have the same dimension as the input size)

```python
    #complex scalar function, multi variable
    >>> x = ADnum([2,3])
    >>> y = ADnum([3,4])
    >>> f = x**3 + y**2
    >>> print(f.val)
    >>> print(f.der(x))
    >>> print(f.der(y))
    >>> print(f.gradient)
    [17, 43] #f.val
    [12, 27] #f.der(x)
    [6, 8]   #f.der(y)
    [[12, 27], [6, 8]] #f.gradient
```

Both a and b are ADnum objects, which have the attributes described in the class implementation below.  In particular, a is just an input variable where the function is being evaluated at 2, and b represents a function (in this case the sine function) of the input variable, where the ADmath.sin() has been defined to take an ADnum object as input and output an ADnum object.  Through more complex combinations of basic operations and elementary functions, the user can define any function of the input as an ADnum object.


# 4. Software Organization
We would like to let the user use all numerical operations defined in our AD20 package. The AD20 package contains the `ADnum` module, the `ADmath` module, and the `ADgraph` module.

For either a scalar or vector input (either as a numpy array or a list), we will convert the input into an `ADnum` object, which can interact with the other modules. `ADnum` will also contain an overloaded version of basic operations, including addition, subtraction, multiplication, division, and exponentiation, so that the value and derivative are correctly updated after combining ADnum objects through each of these operations.

For special functions, we will use `ADmath` to compute the numerical values and the corresponding derivatives. In particular, `ADmath` will contain functions abs, exp, log, sin, cos, and tan.

To show a calculation graph, we use `ADgrap`h (and `ADtable`) to show the forward mode calculation process.

###  4.1 Directory Structure
    AD20/
        docs/
            Milestone 1.ipynb
            Milestone 2.ipynb
            requirements.txt
            figs/
        AD20/
            __init__.py
                ADnum/
                    __init__.py
                    ADnum.py
                ADmath/
                    __init__.py
                    ADmath.py
                ADgraph/
                    __init__.py
                    ADgraph.py
                    ADtable.py
        Tests/
            __init__.py
            test_AD20.py
    README.md
    setup.py
    LICENSE

###  4.2 Modules and Functionality
Our package consists of three main modules:

- **ADnum:** Contains the `ADnum` class (fully described below).  Create `ADnum` objects, which (inspired by the dual numbers) are defined by the attributes of a value and a derivative, from numbers or tuples.  Define all of the numerical operations for `ADnum` objects, so that they correctly track all derivatives.

- **ADmath:** Define elementary functions for `ADnum` objects, correctly tracking all of the derivatives.

- **ADgraph:** Create `ADgraph` objects, which can be used to show the computation process in either a graph (ADgraph.py) or table (ADtable.py)

###  4.3 Testing and Coverage
The tests will be stored in the tests directory (see the repo structure above).  We will use pytest to perform our testing, using `TravisCI` and `Coveralls` for continuous integration and verifying code coverage respectively.

###  4.4 Package Distribution
We will use `PIP` in `PyPi` to distribute our package. This will allow the user to install the package by using the command

    pip install AD20

# 5. Implementation
Automatic differentiation will be implemented through the use of `ADnum` objects and building the functions for which we want to take derivatives from these `ADnum` objects as well as the special functions defined for `ADnum` objects in the `ADmath` module.  Each of these functions is itself an `ADnum` object so has an associated value and derivative which was updated when constructing the `ADnum` object through basic operations and special functions.

### 5.1 Core Data Structures
The main data structure used to represent the functions on which we are performing automatic differentiation will be tuples, with the first entry the value of the ADnum object and the second entry its derivative.  In the case of scalar input, the derivative is also a float.  For vector valued input, the derivative is the gradient of the function, stored as a numpy array.

In order to build and store computational graphs in the ADgraph module, we will use a dictionary to represent the graph, where the keys are the nodes of the graph, stored as ADnum objects, and the values associated with each key are the children of that node, stored as lists of ADnum objects.

### 5.2 Implemented Classes
The main class will be implemented in the `ADnum` module, which will create `ADnum` objects.  It takes as input a single scalar input or a vector input (as either a numpy array or list) and outputs an `ADnum` object.  The `ADnum` objects will store the current value of the function and its derivative as attributes.  By combining simple `ADnum` objects with basic operations and simple functions, we can construct any function we like.  For example,

```python
#ADnum.py
def __mul__(self,other):
    try:
        return ADnum(self.var*other.val, self.val*other.der+self.der*other.val)
    except AttributeError:
        return ADnum(self.val*other, other*self.der)
    
def __pow__(self, other, modulo=None):
    try:
        return ADnum(self.val**other.val, other.val*(self.val**(other.val-1))*self.der+(self.der**other.val)*np.log(self.val)*other.der)

    except AttributeError:# when other is constant
        return ADnum(self.val**other, other*(self.val**(other-1))*self.der)
```

```python
    X = AD20.ADnum(4)
    Y = AD20.ADnum(0)
    F = X + ADmath.sin(Y)
```    
Where F is now an `ADnum` object, and ADmath.sin() is a specially defined sine function which takes as input an `ADnum` object and returns an `ADnum` object, which allows us to evaluate F and its derivative,

```python
    F.val = 4
    F.der = [1, 1] 
    X.val = 4
    X.der = 1
```

Notice that F.der now gives the gradient of F with respect to the input `ADnum` objects X and Y.

In addition to the sine function used in the example above, the `ADmath` module will also implement the other trigonometric functions, the natural exponential, and the natural logarithm.  All of the functions defined in the `ADmath` module define elementary functions of `ADnum` objects, so that the output is also an `ADnum` object with the val and deriv attributes updated appropriately.  For example,

```python
#ADmath.py
def sin(X):
    try:
        return adn.ADnum(np.sin(X.val), np.cos(X.val)*X.der)
    except AttributeError:
        X = adn.ADnum(X, 0)
        return sin(X)
    
def log(X):
    try:
        return adn.ADnum(np.log(X.val), 1/X.val*X.der)
    except AttributeError:
        X = adn.ADnum(X, 0)
        return log(X)
```

We will also implement a class, `ADgraph`, for computational graphs.  The constructor takes as input a dictionary, as described above where the keys are nodes and values are the children of the key node. 	This can then be used to perform forward propagation and could be extended later to include back propagation as an extension of our project.
 
### 5.3 Class Methods and Attributes
Each `ADnum` object will have two attributes for the two major functions desired of the class.  The val attribute will be the ADnum object evaluated at the given value and the der attribute will be its derivative. The constructor for this class, sets the value of the object and optionally also sets the value of its derivative,

```python
#ADnum.py
class ADnum():
    def __init__(self, a, d = 1):
        self.val = a
        self.der = d
        self.graph = {}
```

In addition, each `ADnum` object will have a graph attribute, which stores the dictionary which can be used to build a computational graph in the ADgraph class.  The ADnum class will also include methods to overload basic operations, __add__(), __radd__(), __mul__(), __rmul__(), __sub__(), __truedivide__(), and __pow__().  The result of overloading is that the adding, subtracting, multiplying, dividing, or exponentiating two `ADnum` objects returns an ADnum object as well as addition or multiplication by a constant.  For example, Y1, Y2, and Y3 would all be recognized as ADnum objects:

```python
    X1= ADnum(7)
    X2 = ADnum(15)
    Y1 = X1+X2
    Y2 = X1*X2+X1
    Y3 = 5*X1+X2+100
```

The resulting ADnum objects have both a value and derivative.

The `ADgraph` class will be constructed from a dictionary, stored in the attribute dict.  This class will also have an attribute inputs, which stores the nodes which have no parents.  This class will implement a deriv method which returns the derivative from the computational graph.

### 5.4 External Dependencies
In order to implement the elementary functions, our class will rely on numpy’s implementation of the trigonometric functions, exponential functions, and natural logarithms for evaluation of these special functions, as demonstrated in the definition of the sine function for `ADnum` objects above.

We will also use numpy to implement matrix and vector multiplication in cases where the function is either vector valued or takes a vector as an input.

### 5.5 Elementary Functions
As outlined above, all elementary operations will be defined for `ADnum` objects within the `ADnum` class and we will have a special ADmath module which defines the trigonometric, exponential, and logarithmic functions to be used on ADnum objects, so that they both take as input and return an `ADnum` object, completing the set of defintions of all elementary operations and functions that can be composed to construct more complex functions.

### 5.6 What Comes Next?
Once this basic implementation is done, we hope to speed up the differentiation process by creating a computational graph that will be capable of storing the intermediate derivatives - this removes the need for the program to recompute derivatives for every distinct value the user passes in. By keeping track of the derivatives at every step and evaluating only when necessary, it will improve the efficiency of the program.

Another implementation may be the 