# Documentation: CS207 Group 21

# Introduction

Differentiation has ubiquitous applications in many areas of mathematics, sciences and engineering. As such, it is certainly useful and convenient if computer programs could carry out differentiation automatically for application in a wide variety of cases. For computationally heavy projects, the ability to compute derivatives automatically becomes even more critical as manually working out deriatives in such projects is certainly an impossible task. Even though there exists methods such as *numerical diffentiation* and *symbolic differentiation* in determining derivatives computationally, these two methods have their limitations. In the following, we shall briefly review *numerical diffentiation* and *symbolic differentiation* to highlight some of their difficulties before moving on to describing *automatic differentiation* and the advantages it brings over the other two methods.   

### Numerical Differentiation
In *numerical differentiation*, the value of derivatives is approximated using the following formula:

$$
\frac{\partial{f(x)}}{\partial{x}} \approx \frac{f(x+h)-f(x)}{h}
$$

However, when the h values are too small, the numerical approximation fluctuates about the analytical answer. This is because the step size is too small, leading to a round-off error of the floating points caused by the limited precision of computations. On the other hand, when the h values are too large, the numerical approximation becomes inaccurate. This is because the step size is too big, leading to an error of approximation known as truncation error.

### Symbolic Differentiation
In *symbolic differentiation*, expressions are manipulated automatically to obtain the required derivatives. At its heart, *symbolic differentiation* applies transformations that captures the various rules of differentiation in order to manipulate the expressions. However, *symbolic differentiation* requires careful and sometimes ingenious design as accidental manipulation can easily produce large expressions which take up a lot of computational power and time, which leads to a problem known as expression swell.

### Automatic Differentiation
As seen from above, both *numerical diffentiation* and *symbolic differentiation* have their respective issues when it comes to computing derivatives. These issues are further exacerbated when calculating higher order derivatives, where both errors and complexity increases. *Automatic differentiation* overcomes these issues by recognizing that every differentiation, no matter how complicated, can be executed in a stepwise fashion with each step being an execution of either the elementary arithmetic operations (addition, substraction, multiplication, division) or the elementary functions (sin, sqrt, exp, log, etc.). To track the evaluation of each step, *automatic differentiation* produces computational graphs and evaluation traces. To compute the derivatives, *automatic differentiation* applies the chain rule repeatedly at all steps. By taking a stepwise approach and using the chain rule, *automatic differentiation* circumvents the issues encountered by both *numerical diffentiation* and *symbolic differentiation* and automatically compute derivatives that are both accurate and with a high level of precision. In order to further understand *automatic differentiation*, we present the mathematical background and essential ideas of *automatic differentiation* in the next section.

There are two primary _modes_ of automatic differentation: _forward mode_ and _reverse mode_.  We will cover each mode in greater depth in the background section, but at a high level, _forward mode_

Note - In our research of automatic differentiation, we referred to the following resources:

Baydin, A.G., Pearlmutter, B.A., Radul, A. A. & Siskind, J.M. (2018). Automatic differentiation in machine learning: A survey. *Journal of Machine Learning Research, 18*, 1-42.

Geeraert, S., Lehalle, C.A., Pearlmutter, B., Pironneau, O. & Reghai, A. (2017). Mini-symposium on automatic differentiation and its applications in the financial industry. *ESAIM: Proceedings and Surverys* (pp. 1-10).

Berland, H. (2006). *Automatic differentiation* [PowerPoint Slides]. Retrieved from http://www.robots.ox.ac.uk/~tvg/publications/talks/autodiff.pdf

Rufflewind (2016). Reverse-mode automatic differentiation: a tutorial. Retrieved Nov 19, 2019, from https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation

# Background

As mentioned before, *automatic differentiation* employs a stepwise approach and chain rule to automatically compute derivatives. We shall first state the chain rule in calculus before showing an example production of an evaluation trace and computational graph. Next, we discuss one mode of *automatic differentiation*, namely the forward mode. In particular, the demonstration of the use of chain rule at each step to determine derivatives will be shown here. Finally, we touch on the use of dual numbers in *automatic differentiation*. 

### Chain Rule 
For a function $f(u(t),v(t))$, the chain rule is given by

$$
\begin{align}
 \frac{\partial f}{\partial t} = \frac{\partial f}{\partial u}\frac{\partial u}{\partial t} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial t}
\end{align}
$$

The chain rule is essential for automatic differentiation as the forward mode applies the chain rule repeatedly at each step of the evaluation trace in order to determine the derivatives at each step (see below).

### Example Production of Evaluation Trace & Computational Graph
The most straightforward way to show the generation of an evaluation trace and computational graph is to consider an example. For this purpose, we study the following function 

$$
f(x,y) = sin(x) + 4y
$$

#### Evaluation Trace
The evaluation trace breaks the function into individual steps and creates a buildup of the function starting with the input variables. At each step, only either an elementary arithmetic operation (addition, substraction, multiplication, division) or an elementary function (sin, sqrt, exp, log, etc.) is used to build the function for the next step. The evaluation trace for our function of interest is shown in the table below.

| Trace | Elementary Function | Current Value | Comment               | 
| :---: | :-----------------: | :-----------: | :-------------------: | 
| $x_{1}$ | $x_{1}$           | $x$           | Input x               |
| $x_{2}$ | $x_{2}$           | $y$           | Input y               |
| $x_{3}$ | $sin(x_{1})$      | $sin(x)$      | Elementary function   |
| $x_{4}$ | $4*x_{2}$         | $4y$          | Elementary arithmetic |
| $x_{5}$ | $x_{3}+x_{4}$     | $sin(x) + 4y$ | Elementary arithmetic |


#### Computational Graph 
The computational graph translates the essence of the evaluation trace into a graph and captures the relationship between each step. Refer to the figure below for the computational graph of our function of interest.  

![computational-graph](Computational_Graph.png)

### Forward Mode
Armed with the knowledge of the chain rule, evaluation trace and computational graph, we can now consider the forward mode of *automatic differentiation*. The table below shows the earlier evaluation trace table that has now been expanded to include columns that store derivatives. At each step, the chain rule is applied to the elementary function to determine the elementary function derivative.

For instance, 

Trace $x_{3}$

$$
\begin{align}
\dot{x_{3}} &= \frac{\partial{sin(x_{1})}}{\partial{x_{1}}} \dot{x}_{1} \\
&= cos(x_{1})\dot{x}_{1}
\end{align} 
$$

Trace $x_{5}$
$$
\begin{align}
\dot{x_{5}} &= \frac{\partial{(x_{3}+x_{4}})}{\partial{x_{3}}} \dot{x}_{3} +  \frac{\partial{(x_{3}+x_{4}})}{\partial{x_{3}}} \dot{x}_{4} \\
&= \dot{x}_{3}+\dot{x}_{4}
\end{align} 
$$

| Trace | Elementary Function | Current Value | Elementary Function Derivative | $\nabla_{x}$ Value  | $\nabla_{y}$ Value  | 
| :---: | :-----------------: | :-----------: | :--------------------------: | :---------------------: | :---------------------: | 
| $x_{1}$ | $x_{1}$       | $x$           | $\dot{x}_{1}$             | $1$      | $0$ |
| $x_{2}$ | $x_{2}$       | $y$           | $\dot{x}_{2}$             | $0$      | $1$ |
| $x_{3}$ | $sin(x_{1})$  | $sin(x)$      | $cos(x_{1})\dot{x}_{1}$   | $cos(x)$ | $0$ |
| $x_{4}$ | $4*x_{2}$     | $4y$          | $4\dot{x}_{2}$            | $0$      | $4$ |
| $x_{5}$ | $x_{3}+x_{4}$ | $sin(x) + 4y$ | $\dot{x}_{3}+\dot{x}_{4}$ | $cos(x)$ | $4$ |

As seen from the table above, the derivative of elementary functions such as $sin$ has to be done manually and this has implications for our design of the *automatic differentiation* package later. Specifically speaking, we would need to define separate classes for each elementary function. For more details, refer to the Implementation section below.

In addition, the first and second row has initial values for $\nabla_{x}$ and $\nabla_{y}$ as (1,0) and (0,1) respectively. These are actually seed values for the stepwise propagation of the values of derivatives. The forward mode actually calculates the dot product between the gradient of our function with the seed vector (ie directional derivative). In this case, we have a scalar function with two variables, but in the case of a vector function of vectors, the forward mode actually calculates the dot product between the Jacobian matrix ($J$) and seed vector ($p$) (ie $J.p$). 

### Dual Numbers
Dual numbers extend the real number line in another direction by adding a second component. This extension is analagous to the extension of real numbers by imaginary numbers. The general form of a dual number is given by 

$$ x = a + \epsilon b, $$

where $\epsilon$ is defined as $\epsilon^2 = 0$, $a$ is the real part and $b$ is the dual part of the dual number.

In our *automatic differentiation* package, we can define a dual class that has two attributes. One of these attributes stores the value of the function while the other stores the value of the derivatives. This is similar to having a dual number with the value of a function as the real part and the value of derivatives as the dual part. Having such a dual number structure allows us to carry out the expected arithmetic operations between two dual instances.

#### Addition

$$ 
\begin{align}
(x +\epsilon \dot{x}) + (y +\epsilon \dot{y}) &= (x+y) + \epsilon(\dot{x}+\dot{y})
\end{align}
$$ 

#### Subtraction

$$ 
\begin{align}
(x +\epsilon \dot{x}) - (y +\epsilon \dot{y}) &= (x-y) + \epsilon(\dot{x}-\dot{y})
\end{align}
$$ 

#### Multiplication

$$ 
\begin{align}
(x +\epsilon \dot{x})*(y +\epsilon \dot{y}) &= xy+\epsilon x\dot{y}+\epsilon \dot{x}y+\epsilon^2\dot{x}\dot{y}\\
&= xy + \epsilon(x\dot{y} + \dot{x}y)
\end{align}
$$ 

#### Division

$$ 
\begin{align}
(x +\epsilon \dot{x}) / (y +\epsilon \dot{y}) &= \frac{(x +\epsilon \dot{x})(y -\epsilon \dot{y})}{(y +\epsilon \dot{y})(y - \epsilon \dot{y})} \\
&= \frac{xy-\epsilon x\dot{y}+\epsilon \dot{x}y-\epsilon^2\dot{x}\dot{y}}{y^2-\epsilon^2\dot{y}^2} \\
&= \frac{xy + \epsilon(-x\dot{y} + \dot{x}y)}{y^2} \\
&= \frac{x}{y} + \frac{\epsilon(y\dot{x}-x\dot{y})}{y^2} 
\end{align}
$$ 

In sum, this section covers the mathematical background and essential ideas of *automatic differentiation* for a scalar function with two variables. These basic concepts can be extended easily to higher dimensions if needed. In fact, our *automatic differentiation* package will not only handle scalar functions of scalar and vector values, but also vector functions of vectors.

## Reverse Mode

The reverse mode is fundamentally different in its approach to automatic differentiation as compared to the forward mode. In particular, the reverse mode consists of both the forward pass and reverse pass, with no chain rule applied in the forward pass (only partial derivatives are stored). Similar evaluation trace and computational graph can also be derived for the reverse mode, but there exists crucial differences. We shall revisit our earlier example in the forward mode but use reverse mode instead to obtain the derivatives.

$$
f(x,y) = sin(x) + 4y
$$

For the evaluation trace, the forward pass of reverse mode produces the following.

| Trace | Elementary Function | Current Value | Derivative w.r.t $Child_{1}$ | $Child_{1}$ Value  | Derivative w.r.t $Child_{2}$ | $Child_{2}$ Value  | 
| :---: | :-----------------: | :-----------: | :--------------------------: | :---------------------: | :---------------------: |:---------------------: | 
| $x_{1}$ | $x_{1}$       | $x$           | $1$            | $1$      |  -  | - |
| $x_{2}$ | $x_{2}$       | $y$           | $1$            | $1$      |  -  | - |
| $x_{3}$ | $sin(x_{1})$  | $sin(x)$      | $cos(x_{1})$   | $cos(x)$ |  -  | - |
| $x_{4}$ | $4*x_{2}$     | $4y$          | $4$            | $4$      |  -  | - |
| $x_{5}$ | $x_{3}+x_{4}$ | $sin(x) + 4y$ | $1$            | $1$      | $1$ |$1$|

Carrying out the reverse pass, we have

$$
\bar{x}_{5} = \frac{\partial{f}}{\partial{{x}_{5}}} = 1
$$

$$
\bar{x}_{4} = \frac{\partial{f}}{\partial{{x}_{5}}}\frac{\partial{{x}_{5}}}{\partial{{x}_{4}}} = 1
$$

$$
\bar{x}_{3} = \frac{\partial{f}}{\partial{{x}_{5}}}\frac{\partial{{x}_{5}}}{\partial{{x}_{3}}} = 1
$$

$$
\bar{x}_{2} = \frac{\partial{f}}{\partial{{x}_{4}}}\frac{\partial{{x}_{4}}}{\partial{{x}_{2}}} = 4
$$

$$
\bar{x}_{1} = \frac{\partial{f}}{\partial{{x}_{3}}}\frac{\partial{{x}_{3}}}{\partial{{x}_{1}}} = cos(x)
$$

The result of a reverse mode is only determined after the reverse pass is done, and the value of each variable or parent node at each stage depends on the values of its children nodes (refer to the earlier computational graph for children nodes). 

Comparing the results of the reverse mode with that of forward mode, we see that we arrive at the same result (which is expected). However, unlike the forward mode, the reverse mode calculates all elements of the Jacobian matrix in a single move, whereas the forward mode requires different seed vectors to determine all elements of the Jacobian matrix.

The reverse mode actually calculates the dot product between the transpose of the Jacobian matrix ( 𝐽 ) and seed vector ( 𝑝 ) (ie  $𝐽^{T}.𝑝$ ), and overall, the reverse mode is more efficient that the forward mode when the number of inputs is greater than the number of functions.

In sum, this section covers the mathematical background and essential ideas of *automatic differentiation* for a scalar function with two variables. These basic concepts can be extended easily to higher dimensions if needed. In fact, our *automatic differentiation* package will not only handle scalar functions of scalar and vector values, but also vector functions of vectors.

# How to Use Package

## Installation

To begin, the user has to work in a `python` environment (preferably version >= 3.7). It is advisable for the user to create a new virtual environment for interacting with our package. To create a new virtual environment, enter the following command in the terminal:

`conda create -n env_autodiff python=3.7`

After which, activate the environment with the following command:

`conda activate env_autodiff`

Since we have used PyPI to host our package, users can download our Automatic Differentiation package with the following command in the terminal:

`pip install autodiffing`

As we have set up the pip package in a way such that the required dependencies will be installed during `pip install autodiffing`, users need not worry about not having the required dependencies when using the Automatic Differentiation package. 

In case users fail to get the required dependencies during pip install, users can still refer to the contents of requirements.txt below to pip install the main dependencies that are required. If not, users can visit https://pypi.org/project/autodiffing/#files to download the latest gunzip tar file and unzip the contents to get the requirements.txt file. In the directory with the unzipped folder containing the requirements.txt file, users need to run the following command in the terminal to download the required dependencies:

`pip install -r requirements.txt`

Within our requirements.txt, we have the a number of packages that come with the installation of `python` version 3.7 and our main packages, but the main packages that we require for our Automatic Differentiation package are: 

`numpy==1.17.4`\
`matplotlib==3.1.1`\
`scipy==1.3.2`
`math`

`numpy` is essential for our Automatic Differentiation package as we require it for the calculation of our elementary functions, and for dealing with arrays and matrices when there are vector functions and vector inputs.

`matplotlib` is needed for any potential visualization of our outputs.

`scipy` is a good package to have for its optimization and linear algebra abilities.

`math` is a package used for a couple numerical comparisons and to restrict the domains of certain functions.

## Using the Package

Once users have installed all the dependencies and the package itself, they may begin to use our package to quickly find derivatives of functions.  For this section, we walk through three different examples of how users can interact with the package for their purposes.

Note that we made some implementation and stylistic changes from Milestone 2 to the final documentation.  We will review a few of these changes here and will give greater justification in the software organization and implementation details sections for the changes.  First, the user should import necessary modules as:

```python
from AD.DualNumber import DualNumber
from AD import ElementaryFunctions as EF
from AD.Parallelized import Parallelized_AD
```

Whereas in Milestone 2, we limited user imports to the DualNumber and ElementaryFunctions modules, we now add functionality for vectorized inputs and outputs, which we implement in our Parallelized module (and specifically Parallelized_AD class).  Moreover, we extend our class to have reverse mode functionality, the background for which we described in the introduction and background sections.  Once users have imported our classes, they can initialize functions and variables as follows:

```python
# func and var will eventually be inputs to our Automatic Differentiation implementation
func = ['_x + sin(_y)*_z', 'sqrt(_y + _x) - cos(_z)']
varis = ['x', 'y', 'z']
```

Note that when the user initializes a function, he or she _must_ adhere to the format seen above.  That is, all variables must be preceded by a leading underscore '\_'.   This is a significant departure from our first two milestones, when we asked the user to set up variables as x = DualNumber(5), func = EF.Sin(x).  Our new choice for the user interface we believe makes our software more intuitive and easier to use.  In particular, rather than having to set up a new DualNumber for _every_ new variable, the user may just list the variables in an array.  Then, the user may just symbolically specify the function, which we believe more closely resembles a "natural" syntax.

That is, in the final documentation, we _hid_ the implementation details from the user and allowed them to interact with the class 'symbolically'.  We provide the user with a dictionary of functions, including, but not limited to, _sin_ , _cos_ , _log_ , and _logistic_ (as four examples).

So, for example, after the user has initialized the function and variables as above, he/she may use the reverse mode of automatic differentation as follows:

```python
>>> PAD = Parallelized_AD(fun = func, var = varis)
>>> PAD.get_Jacobian([1,2,3])
array([[ 1.        , -1.24844051,  0.90929743],
       [ 1.        , -8.35853265, 18.26372704]])

>>> PAD.get_value([1,2,3])
array([ 3.72789228, 19.26372704])

>>> PAD.get_Jacobian([1,2,3], forward=True)
array([[ 1.        , -1.24844051,  0.90929743],
       [ 1.        , -8.35853265, 18.26372704]])
```

Note that the default differentation mode is reverse mode, but this can be changed with setting forward=True as we did in the last line.  The user should always instatiate a Parallelized automatic differentation object before taking the Jacobian or returning a value.  During initialization, the user should pass in one argument for the function, which may be vector or scalar valued, as an array of strings (one for each function), and one argument for each variable, an array of strings for each variable's name.

To extract the Jacobian, the user must specify a point at which to take the Jacobian.  In the example above, the Jacobian was taken at [x,y,z] = [1,2,3].  If, for example, the user took the Jacobian at [1,2], or some other invalid point, then our Parallelized_AD class would throw an error.


However, it is more likely that users will want to use our class for more complicated uses.  Here we wish to find the value of the function

$$f_1(x,y,z) = x + \sin(y)z, f_2(x,y,z) = x+\sin(y)\exp(z)$$

That is, our function output is a two-dimensional vector which takes in a three-dimensional vector as an input.  We demo our code below:


In [23]:
# note that sys here is just for the sake of the jupyter notebook hosted on github to make this session interactive,
import sys
from scipy.optimize import fsolve
import numpy as np

sys.path.insert(0, '../AutoDiff/AD')
import ElementaryFunctions as EF
from DualNumber import DualNumber
from Parallelized import Parallelized_AD
func = ['_x + sin(_y)*_z', '_x + sin(_y)*exp(_z)']
PAD = Parallelized_AD(fun = func, var = ['x', 'y', 'z'])

print("VALUE: ")
print(PAD.get_value([1,2,3]))



VALUE: 
[ 3.72789228 19.26372704]


Now suppose the user wants to add a third function, with a fourth variable $w$.  The new function will look like:

$$f_1(x,y,z,w) = x + \sin(y)z, f_2(x,y,z,w) = x+\sin(y)\exp(z), f_3(x,y,z,w) = w$$

Our package allows the user to simply add a variable with our add_var() method and add a function with the add_function() method.  Perhaps for some reason the user wants to calculate derivatives using the forward mode, instead of the reverse mode, in which case the user should specify `forward=True`.

In [24]:
PAD.add_var('w')
PAD.add_function('_w')

print('JACOBIAN FORWARD:')
print(PAD.get_Jacobian([1,2,3,4], forward=True))


print("VALUE: ")
print(PAD.get_value([1,2,3,4]))

JACOBIAN FORWARD:
[[ 1.         -1.24844051  0.90929743  0.        ]
 [ 1.         -8.35853265 18.26372704  0.        ]
 [ 0.          0.          0.          1.        ]]
VALUE: 
[ 3.72789228 19.26372704  4.        ]


Users have access to our dictionary of functions, including `sin`, `cos`, `tan`, `exp`, `power`, `arcSin`, `log`, `sqrt`, `arcSin`, `arcCos`, `arcTan`, `sinh`, `cosh`, `tanh`, and `logistic`.

Note that the syntax must be met _exactly_ (including capitalization!) during the string input to the function; failure to do so will result in either an `AssertionError` or erroneous results.  Our software can be used for a variety of purposes, including optimization and gradient descent.  This section gave users general guidelines for interacting with our package; however, there are a few more details worth mentioning:

1. We have implemented functionality for taking logarithms with different bases.  If, for example, the user wants to take log10 of some variable, he or she may specify func = $['log(\_x, 10)']$ to get the base 10 log.  In general, the second argument to the log function can be used to change from the default base of $e$.  For exponential functions with different bases, users may simply specify func = $['b**x']$ where $b$ is the base.
2. All functions must be inputted as list of strings, where each variable is preceded by an underscore.  The functions _must_ come from our dictionary of functions as shown above.  Corresponding variable names must be inputted as an array of strings.
3. The location at which the Jacobian or value of the function must be specified as an array when calling PAD.get_Jacobian().

Users may also compare two different Parallelized_AD objects using our overloaded __eq__() dunder method, which returns true if and only if _both_ the derivative and the value are equal and otherwise returns false.  That is, if one object PAD1 has value 0.4 and PAD2 has value 0.5, == will evaluate to false, even if they are built from the same function.  Similarly, if PAD1 and PAD2 have the same value but different derivatives, == will evaluate to false.  Thus, == will evaluate to true only when both the value and Jacobians are equal (this will be trivially True if the same function is evaluated at the same location by two different objects).

# Software Organization
This section gives a high-level overview of our software organization, including our modules, test suites, and directory structure.  

## Directory Structure
At a high level, our directory structure is as follows:

```
AutoDiff/
│   README.md
│   LICENSE
│   setup.py
│
└───AD/
│   │   __init__.py
│   │   DualNumber.py
│   │   ElementaryFunctions.py
│   │   tests.py
│   │   Parallelized.py
│   │   Adam_py.py
│
│    driver.py
│    requirements.txt
```

## Modules

`DualNumber.py` contains a class to hold the basic DualNumber object, which the typical end user does _not_ interact with.  We, however, give users access to the class in case they wish to build their own automatic differentation tools (or modify ours).  Note that this class was used in Milestones 1 and 2 as part of the base package and is capable of handling scalar functions of scalars (see Milestone 1 and 2 documentation for details).  This class also overloads basic operators, such as addition, subtraction, division, and multiplication, among others.

`ElementaryFunctions.py` is a module with functions defined for each of the elementary functions.  Like DualNumber, the typical end user will _not_ interact with this module unless they wish to build upon our base classes and modules.  This module contains all the elementary functions we listed in the _How to Use_ section, including, but not limited to, sine, cosine, tangent, exponential, hyperbolic, and logistic functions.  Since the ElementaryFunctions module contains functions for each of our base functions, we implement reverse and forward modes for each (detailed in Implementation section); that is, each function takes an argument for whether it is reverse or forward mode, and the Jacobian is updated differently for each.

`Parallelized.py` contains a class Parallelized_AD to give the user an interface for taking Jacobians.  This class enables users to use the reverse and forward modes of automatic differentation.  We provide users with the get_Jacobian() and get_value() methods to calculate the Jacobian and value of a function at specific points.  Users may wish to add a function (i.e. another dimension to their vector) with the add_function() method and/or add a variable with the add_var() method.

`Adam.py` contains an example of a one-dimensional optimizer using our package's backend.

The `tests.py` module contains our unit tests for `DualNumber.py`, `ElementaryFunctions.py`, and `Parallelized.py`, which uses the `pytest` module to test for functionality (since `Parallelized.py` depends `ElementaryFunctions.py` on `DualNumber.py`, we combined the test suites).  We also implemented doctests, but we will go into further detail on this in the next section.

Lastly, we have a file `driver.py` which contains an example use of the package for evaluating Jacobians.


## Testing
Users who wish to run the test suites may do so by typing ```pytest AutoDiff/AD/tests.py``` into the terminal to run our test suites of the DualNumber class, for example.  We have integrated our tests with Travis and codecov, and these badges live within the README.md.  We have also provided doctests in our Parallelized, ElementaryFunctions, and DualNumber modules based on [doctest](https://docs.python.org/3/library/doctest.html) as described in lecture.  To run these and test for code coverage, just use ```pytest --doctest-modules --cov --cov-report term-missing tests.py```, for example.

## Demo 
Last, our `driver.py` file contains an example use of the package for evaluating a Jacobian in the forward and reverse mode.  Note that since our package can evaluate gradients, users may use our package for basic optimization tasks such as gradient descent.

# Implementation details

## Classes and Data Structures

### Parallelized_AD()
The Parallelized_AD() class within the Parallelized module is the primary user-facing code in our package.  In particular, this class enables users to run automatic differentation on vector functions of vectors (and scalars) using both the reverse and forward modes.  Below is an overview of our implementation:

![flowchart](AD-Flowchart.jpg)

The typical end user will strictly interact with the Parallelized_AD object by instantiating a string representation of the function and specifying the variable names.  During instatiation, no further functions are called.  The function and variable arrays are stored as class attributes self.function and self.value.  Moreover, self.\_value and self.\_Jacobian are initialized to None.

Once the user asks for a value or Jacobian, through get_value() or get_Jacobian(), the string input must first be processed through our _preprocess()_ method to get into a form which can utilize the rest of our code.  In particular, we convert the string ['\_sin(x)'] for example to EF.Sin(x), where EF is our elementary functions module (described later in this section).  Our preprocessing function exactly maps function strings to dual numbers and elementary functions, which is why it is important that users follow our guidelines for string inputs (e.g. 'sin(\_x)' not 'Sin(\_x)').

Our get_Jacobian() method then loops through this processed function and for each function calculates a gradient with respect to each of the variable.  For each core variable (e.g. 'x' is a core variable), get_Jacobian() sets up a new DualNumber and then accumulates derivatives by calling different elementary functions and overloaded operations.  The forward mode of the get_Jacobian method looks as follows:

```python
     def get_Jacobian(self,loc,forward=False):

        assert (len(loc)==len(self.varname) or loc.shape[1] == len(self.varname))
        self._Jacobian=np.zeros((len(self.function),len(self.varname)))
        
        # for each function, if forward, do forward mode calculation, else do reverse
        # see documentation for details on reverse mode calculation
        for i, fun in enumerate(self.function):
            if forward:
                # pre-process each function to be differentatied
                translated_fun=self.preprocess(fun)
                # for each variable, take the derivative at the value specified
                for j in range(len(self.varname)):
                    self.variable=[DualNumber(value,dual=0) for value in loc]   
                    self.variable[j]=DualNumber(loc[j],dual=1) 
                    element=eval(translated_fun)
                    self._Jacobian[i,j]=element.der
        return self._Jacobian
        ```


Note that for each individual function, a new core variable is set up, even if the variable was in another function.  For example, if the whole function is ['\_y', 'sin(\_y)'], the variable $y$ will be set up as a new core dual number for each of the functions.  While this is not the most efficient algorithmic implementation (we repeat the instantiation of the variables $m$ times for each of the $m$ functions), it justifies our implementation of the reverse mode, which attempts to solve this problem.

Note further that get_Jacobian() stores self.\_Jacobian and returns this value.  If the user wishes to later access the Jacobian, he or she may do so with self.Jacobian, as we have built a method with the property decorator which returns the private attribute self.\_Jacobian.

In the default reverse mode calculation, get_Jacobian again loops through all the variables but instantiates each core variable _only once_.  That is, rather than looping through each function and instatiating a new variable for each function, the reverse mode calculation loops calculates _all_ the partial derivatives of a given function at once.  Unlike the forward mode, in which we must give different seed vectors for a given function to calculate each partial, the forward mode differentiates each individual function with respect to _all_ of its variables within each loop.

We can see this implemented in our code; in particular, we first construct an array to store all the variables and then differentiate the input with respect to this array.  This is especially beneficial in machine learning applications, where the objective function is often scalar valued (or many fewer dimensions than the input dimension).

Additionally, we provide users the ability to augment existing functions by adding functions and variables.  The code for this is below:

```python
     def add_function(self,fun=None):
        if fun:
            assert isinstance(fun,str)
            self.function=[self.function] if isinstance(self.function,str) else self.function
            self.function.append(fun)
        ```
Since the preprocess() function is called only after the user requests a value or Jacobian, this method is sufficient to add a function.  We have a similar method for adding a variable.

### DualNumber()
The DualNumber() class within the DualNumber module once again makes up the core of our backend setup. Our package utilizes the DualNumber module for differentation.  For example, suppose at some point we wish to take the derivative of a variable $x$ with respect to $x$ where $x=5$ (imagine this is one simple sub-problem within a larger differentation problem).  Our Parallelized_AD class will then set up a new core variable $x$ as:

    x = DualNumber(5)

DualNumber will then store the variable value, 5, in the DualNumber data structure. This initialization will also store the derivative as `x._der = 1`.

    x = DualNumber(5,10)
    
Which would initialize the derivative of this variable to 10. To see a more concrete case of when we'd want to set the derivative to something other than 1, just consider the following example:

    x = DualNumber(5)
    y = x + x
    
Then, when we initialize y internally, we will set `y = DualNumber(x._val + x._val, x._der + x._der)`.  To generally protect the user from manipulating the values and derivatives of DualNumber objects, we have made these attributes private with the leading underscore (although this is not actually private, just Python 'private').  We have used the `@property` decorator around the `val()` and `der()` methods, however, which enable the user to access the value and derivative of their variable with `y.val` and `y.der`, while still keeping the attributes themselves private.  These decorators are primarily for user debugging purposes if they wish to see exactly what derivatives and values are being stored.

Aside from initialization, the DualNumber class overloads many of the basic arithmetic operators, including, but not limited to, addition, subtraction, division, multiplication, and negation (technically a unary operator).

To demonstrate the functionality of this class, let's focus on the overloaded __add__ operator. Note that there are two regimes we should consider: (1) the case in which both the operands are DualNumbers and (2) the case in one is a DualNumber and the other is a float.  To account for these cases, we have used the following design:

```python
     def __add__(self, other):
        try:
            val2 = self.val + other.val
            der2 = self.der + other.der
            return DualNumber(val2, der2)
        except AttributeError:
            assert(isinstance(other, float) or isinstance(other, int)), "Check the type of objects in function!"
            val2 = self.val + other
            der2 = self.der
            return DualNumber(val2, der2)
        ```
            
The case in which both are DualNumbers is relatively straightforward, and we simply add the derivatives and values.  In the case in which one is a float, however, we must check that the other is a constant (either int or float) and then add the values only (since derivative remains constant).  We implement this checking through a try-except design.

We used this as a simple but instructive example of the principles guiding our implementation for the DualNumber() class.  With the other basic operators, we use the same idea but have different rules for updating the value and derivative (use product rule for multiplication, for example).

### ElementaryFunctions
We also implement a module ElementaryFunctions which we use to define functions of core variables. Each of these functions constructs a DualNumber object.

Note that we should observe how the function is called when we have multiple variables. For example, suppose we define:
    
```python
    import AD.ElementaryFunctions as EF
    import AD.DualNumber as DN
    a = DualNumber(5)
    b = EF.Exp(a)
    c = EF.Cos(EF.Sqrt(a)) + b**2
    ```

If we were to trace how our package is calculating derivatives and values, note that in the first step, `a` is initialized to a value of 5.  We then create a new variable `b` which is `Exp(a)`.  If we were to write this mathematically, we'd have that

$$a=5, b=e^a$$

Then, `c` is defined as a more complicated combination of these two variables.  

Note that `c` is returned as a DualNumber object, as any of the elementary functions should return an object with the value and the derivative of the new variable.  Let's look at one of the elementary functions in ElementaryFunctions:

```python
    def Sin(x):
    if data_type_check(x) == 0:
        if x._rev:
            z=DualNumber(np.sin(x._val),Reverse=True)
            x.children.append((np.cos(x._val),z))
            return z
        return DualNumber(np.sin(x._val),np.cos(x._val)*x._der)
    else:
        return DualNumber(np.sin(x),0)
        ```

This is fairly straightforward; we specify the function such that it returns a DualNumber() with value as the value of the sine function evaluated at the input, and the derivative evaluated as the derivative of the sine function at that point.  If we are in reverse mode, return a new variable $z$ which takes on the value of the $\sin(x)$ and append the children nodes with the derivative and the new variable.

Our function `data_type_check(x)` is similar to our `assert` condition in the DualNumber initialization, where we check to make sure that the input x is either a DualNumber (return 0) or float/int.  If a string is entered, for example, `data_type_check(x)` would raise an error.

```python
    def data_type_check(x):
        try:
            if x._der==None and x._rev==True:
                return 0  # returns 0 if x is DualNumber
            float(x._val)+float(x._der)
            return 0  # returns 0 if x is DualNumber
        except AttributeError:
            try:
                float(x)
                return 1 # returns 1 if x is real
            except:
                raise AttributeError('Input must be dual number or real number!')

```

Lastly, our functions through errors when a value outside the domain is specified.  For example, if, somehow, the user wishes to find the value of $\arcsin(x)$ at $x=1000$, our function for arcsin in the ElementaryFunctions module would throw and error.

In summary, these three modules, Parallelized, ElementaryFunctions and DualNumber, form the crux of our implementation.  Our core data structure is the DualNumber object, which overloads basic arithmetic operators and stores the private attributes `_val` and `_der`.  ElementaryFunctions is a module which defines a set of functions for the user to interact with.  It depends on DualNumber() and returns DualNumber objects.  Parallelized strings all these parts together with the Parallelized_AD() class, in which it provides users a new symbolic interface to more easily interact with the package.  It enables users to use forward and reverse modes of automatic differentiation, and calls on ElementaryFunctions and DualNumber to accumulate Jacobians of vector-valued functions.

### External Dependencies
All the dependencies are written in `setup.py`.  All dependencies are automatically downloaded during the initial user download through our `setup.py` script. We use numpy for calculating the value and derivatives of functions in the ElementaryFunctions module.  We also use pytest for our test suite, which has further external dependencies.  We do include `requirements.txt` but do not encourage users to use this.

# Reverse Mode Extension

We have gone over all the core parts of our reverse mode extension in the Implementation Details and Background sections; however, we will give an overview of the ideas in our extension again.

### Reverse Mode

The reverse mode is fundamentally different in its approach to automatic differentiation as compared to the forward mode. In particular, the reverse mode consists of both the forward pass and reverse pass, with no chain rule applied in the forward pass (only partial derivatives are stored). The result of a reverse mode is only determined after the reverse pass is done, and the value of each variable or parent node at each stage depends on the values of its children nodes. As such, this has three important implications for the design of our package for reverse mode. 

Like the forward mode, the Parallelized module for reverse mode calls on DualNumber and ElementaryFunctions.  However, unlike the forward mode, derivatives are accumulated at the end of the chain rule, as opposed to one by one from the beginning.  To see this, note the differences between our implementations of the forward and reverse modes:

```python
# REVERSE MODE
if len(loc) == len(self.varname):
    self.variable=[DualNumber(value,Reverse=True) for value in loc]
else:
    self.variable=[DualNumber(value,Reverse=True) for value in loc[0]]
translated_fun=self.preprocess(fun)
element=eval(translated_fun)
element.set_der(1)
for j in range(len(self.varname)):
    self._Jacobian[i,j]=self.variable[j].der
```

```python
# FORWARD MODE
translated_fun=self.preprocess(fun)
# for each variable, take the derivative at the value specified
for j in range(len(self.varname)):
    self.variable=[DualNumber(value,dual=0) for value in loc]   
    self.variable[j]=DualNumber(loc[j],dual=1) 
    element=eval(translated_fun)
    self._Jacobian[i,j]=element.der
```

Note that in reverse mode, we first set up the variables and evaluated the translated function, and at the very end set the derivatives, whereas in the forward mode, we set up the variables and derivatives at once in each pass.

The accumulation of forward and reverse mode derivatives can be seen in how the elementary functions are implemented:

```python
def Sin(x):
    if data_type_check(x) == 0:
        if x._rev:
            z=DualNumber(np.sin(x._val),Reverse=True)
            x.children.append((np.cos(x._val),z))
            return z
        return DualNumber(np.sin(x._val),np.cos(x._val)*x._der)
    else:
        return DualNumber(np.sin(x),0)
```

When reverse is True, we append the children of the variable (and then evaluate the derivative once we've reached a value in the chain rule).  Otherwise, in forward mode, we simply return the derivative of this variable.

The primary challenge for reverse mode is to ensure that its classes, methods and attributes are kept separate from that of forward mode even though we would want them to share certain similarity. We ensured this by introducing a binary condition for whether we were using reverse or forward mode in the ElementaryFunctions and Parallelized modules.

# Future Features

### Visualization

When users interact with our package, it would be useful for them to have a way of visualizing the on-going calculations or final results. We hope to include some code that will print outputs for the status of on-going calculations and meaningful results. In addition, we hope to define a new method called `post_process` which will be found in the different classes of our package where visual outputs of either key calculations or important results is possible. For instance, `post_process` within the reverse mode class might produce tables of forward pass and reverse pass. The method `post_process` primarily uses the `matplotlib` library and takes in `directory_out` as an argument for users to indicate the directory in which they wish to save the visualization outputs. 

### Hessian Matrices, Higher Order Derivatives
Ideally, we would like to implement the ability for users to calculate $n$-th order derivatives (including tensor products and other forms that cannot be represented with normal matrices).  Ideas about this already exist within scientific literature, such as http://people.maths.ox.ac.uk/gilesm/files/devendra_AIAA-07.pdf.  To implement this, we'd have a function called get_deriv which takes a value and an argument for the order.  We would implement this using ideas similar to Parallelized, except we'd now 'parallelize' over the results of each step of differentiation.

This would enable users to calculate non-linear approximations to functions (such as Taylor approximations), among other applications.