In [None]:
%reload_ext autoreload
%autoreload 2
import numpy as np
import matplotlib.pylab as plt

# Fix the random seed to facilitate grading
np.random.seed(1)

# HW1.1 - essentials of differentiation

## 1.a Finite-difference tests (11 pts) 

Autodifferentiation allows us to evaluate the derivative of $f$ at any point $x$. To test what we will implement below, we review a very common sanity test in mathematical programming: the finite-difference test. We can use it to verify, using essentially the definition of the derivative, that our gradient implementation is correct. 

Using the skeleton below to fill in the missing code, run the cell to confirm it executes without errors. Then, answer the follow-up questions below.

**1.a.1** Implement direction-wise forward-differences test (5 pts)

In [None]:
def f(x):
    return x**2 + np.log(1 + x) - np.sin(x)

def f_grad(x):
    return 2*x + 1/(1 + x) - np.cos(x)

In [None]:
x = np.random.rand(5)

grad_f_ours = f_grad(x)
eps = 1e-7

for i in range(len(x)):
    ### ============= 1.a.1 Fill in below ==============
    ### Implement direction-wise forward-differences test
    ### using x as the evaluation point. 
    ### Compute:
    ### - grad_f_fd : gradient approximated using forward-difference method

    grad_f_fd = None

    ### ===========================

    np.testing.assert_allclose(grad_f_fd[i] , grad_f_ours[i], atol=1e-5)

### Follow-up questions: 

**1.a.2** The precision of the test above is not great. The tolerance `atol=1e-5` is relatively loose. How can we increase precision? (2 pts)

Answer: 

**1.a.3** Let's say the function has `np.log(x + 0.001)` instead of `np.log(x + 1)`. Do you foresee any problems with this function? (2 pts)

Answer: 

**1.a.4** The chosen delta and tolerance are a bit arbitrary. Do you have an idea for implementing this test in a more principled way? (Hint: use the error rate of the finite-difference gradient estimate) (2 pts)

Answer: 

## 1.b From gradients to linear maps (24pts)

To prepare for automatic differentiation, we now reimplement the ideas above using **linear maps**. Recall that the **directional derivative** of a function can be viewed as a linear map. We have:

$$
\partial f(x)[v] = \lim_{\delta \rightarrow 0} \frac{f(x+\delta v) - f(x)}{\delta}
$$

So, just like before, we can approximate this directional derivative using **finite difference** method. 

**1.b.1** Implement forward-difference test for the directional derivative (5 pts)

In [None]:
def f_lmap(x, v): 
    assert len(x) == len(v)
    return f_grad(x) * v

x = np.ones(5)

### ============= 1.b.1 Fill in below ==============
### Implement forward-difference test for the directional derivative.
### Use x as the evaluation point and sample a random direction v = np.random.rand(5)
### Compute:
### - grad_f_ours: directional derivative computed via the linear map
### - grad_f_est : directional derivative approximated using finite differences method

grad_f_ours = None
grad_f_est = None

### =========================== 

np.testing.assert_allclose(grad_f_ours, grad_f_est, rtol=1e-5) 

So far, we have focused on functions $f : \mathbb{R}^P \to \mathbb{R}$, that take a vector as input and produce a scalar as output. However, functions that use **matrix or even tensor inputs/outputs** are commonplace in neural networks. In this example, let's consider the function that takes a matrix as input:

$$
y = f(W) = Wx, W\in\mathbb{R}^{m\times n}
$$

You can think of this as a neural network with one layer and no activation functions. Here, $x \in \mathbb{R}^D$ is the input, $W \in \mathbb{R}^{M \times D}$ are the weights that we try to learn, $y \in \mathbb{R}^M$ are the output labels. How would you differentiate $f(W)$ with respect to $W$? 

This is where linear maps start to shine. Answer the questions below to convince yourself of the utility of linear maps. Then, implement the code below. 

**1.b.2** What is the expression of $\partial f(W)[V]$, the directional derivative of $f$ at $W \in \mathbb{R}^{M \times D}$ in the direction of $V \in \mathbb{R}^{M \times D}$? (5 pts)

Answer:


**1.b.3** In the previous question, we expressed the directional derivative $\partial f(W)[V]$ as a linear map. Now, we would like to express the same directional derivative using the standard gradient. Since the function takes a matrix as input, we introduce new variables: $w:=\mathrm{vec}(W)$, $v:=\mathrm{vec}(V)$. What is the expression of the standard gradient with respect to $w$?  (5 pts)

Hint: You can calculate the gradient by massaging the expression of $\partial f(W)[V]$ in the inner-product form $\langle \cdot, v \rangle$, using the Kronecker–vec identity

$$
Vx = (I_m \otimes x^\top) \mathrm{vec}(V)
$$

Answer:

Next, implement the above to verify the math and to observe the differences! 

In [None]:
def f(x, W):
    return W @ x

### ============= 1.b.2/1.b.3 Fill in below ==============
### Implement the directional derivative and the gradients that you found above
### Verify 1.b.2 and 1.b.3 using the provided code.

def f_lmap(x, W, V):
    # Implement the directional derivative via the linear map

    return None

def f_jac(x, W):
    # Implement the Jacobian respect to vec(W)
    # hint: Using the Kronecker product (np.kron)

    return None
### =========================== 

In [None]:
m = 7
p = 8
W = np.random.rand(m, p)
V = np.random.rand(m, p)

x = np.random.rand(p)

J = f_jac(x, W)

plt.matshow(J)
plt.show()

lmap_naive = J @ V.flatten()
lmap_better = f_lmap(x, W, V)

np.testing.assert_allclose(lmap_naive, lmap_better)

**1.b.4** Implement the forward-difference test for f_lmap (5 pts)

In [None]:
### ============= 1.b.4 Fill in below ==============

### Implement forward-difference test for f_lmap.
### Compute:
### - lmap_fd : directional derivative approximated using finite differences method
### - lmap_ours: directional derivative computed via the linear map

lmap_fd = None
lmap_ours = None
### ===========================

np.testing.assert_allclose(lmap_fd, lmap_ours, atol=1e-10)

In [None]:
%timeit -n 100 -r 100 f_jac(x, W) @ V.flatten()
%timeit -n 100 -r 100 f_lmap(x, W, V)

**1.b.5** Comment on the following behavior. (4 pts) 
- Why does the forward-difference test pass to very high accuracy in this example?

    Answer:

- Why is f_grad significantly slower than f_lmap?   

    Answer: 


## 1.c Chain rule from three perspectives. (15 pts)

Now that we understand the role of linear maps and gradients, we will start stitching together multiple operations. 

Let's define 

$$
f(W) = Wx,\quad g(u) = \log \sum_{j=1}^{M} e^{u_j},
$$

where $W \in \mathbb{R}^{M \times D}$, function $g:\mathbb{R}^M \to \mathbb{R}$ is the log-sum-exp operator. We consider the composite function

$$
L(W) = (g \circ f)(W)
$$ 

Our goal is to use three different methods to calculate the directional derivative $\partial L(W)[V]$ based on the chain rule, as shown in the implementation below. 

**Write the mathematical expressions of the three different ways to compute the directional derivative via the chain rule. You may use the provided implementation for inspiration.**

**1.c.1** Using the multiplication of the respective Jacobians. (5 pts)

Answer:

**1.c.2** Using standard linear map (as seen in forward-mode autodiff). (5 pts)

Answer:

**1.c.3** Using adjoints (see also reverse-mode autodiff). (5 pts)

Answer:

In [None]:
def g_grad(y):
    return np.exp(y) / np.sum(np.exp(y))

def g_lmap(y, v):
    return g_grad(y) @ v

def f_lmap(x, W, V):
    # argument W is kept for consistency although not used here
    return V @ x

def f_lmap_adj(x, W, u):
    # computes u x.T even when u, x are arrays (u @ x.T doesn't work in that case)
    return np.outer(u, x)

In [None]:
J = f_jac(x, W)
grad_concat = g_grad(f(x, W)) @ J
grad_concat_V = grad_concat @ V.flatten()

grad_lmap = g_lmap(f(x, W), f_lmap(x, W, V))

grad_adjoint = f_lmap_adj(x, W, g_grad(f(x, W)))
grad_adjoint_V = np.trace(grad_adjoint.T @ V)

np.testing.assert_allclose(grad_concat_V, grad_lmap)
np.testing.assert_allclose(grad_concat_V, grad_adjoint_V)

## 1.d Towards forward-mode vs. reverse-mode autodifferentiation (20 pts)

Finally, we want to show how easy it is to get the gradient from the adjoint linear map as opposed to the normal linear map. 

Below, implement two ways of calculating the derivative of $L(W)$ w.r.t. $W$:

$$
\nabla_W L(W) \in \mathbb{R}^{m \times p}
$$

**1.d.1** (10 pts) First, compute the gradient by perturbing the weights along each coordinate direction of weight matrix.
Specifically, for each matrix entry $(i, j)$, define the direction

$$
V_{ij} \in \mathbb{R}^{m \times p}
\quad \text{with} \quad
(V_{ij})_{kl} =
\begin{cases}
1, & (k,l) = (i,j), \\
0, & \text{otherwise}.
\end{cases}
$$

Using the chain rule with standard linear map, compute each entry of the gradient as
$$
(\nabla_W L(W))_{ij} = \partial L(W)[V_{ij}] = \partial g(f(W))\big[\partial f(W)[V_{ij}]\big]
$$

In [None]:
### ============= 1.d.1 Fill in below ==============
### Compute the gradient using forward-mode autodifferentiation

grad_lmap_vec = None
### ===========================

**1.d.2** (10 pts) Second, use the adjoint operator to obtain the gradient vector in one step. Recall from the chain rule in adjoint form:

$$
\nabla_{W}L(W) = \partial f(W)^*\left[ \nabla g(f(W)) \right]
$$

In [None]:
### ============= 1.d.2 Fill in below ==============
### Compute the gradient using reverse-mode autodifferentiation

grad_adjoint_vec = None
### ===========================

np.testing.assert_allclose(grad_lmap_vec, grad_adjoint_vec)

# 1.e Packaging of functions (10 pts)

So far, all the functions you wrote live only in this notebook. In this last part of this notebook, we will follow a few standard coding practices to package things in a more usable form for what's coming next. 

In particular, it is better practice to: 
- move functions that are ready to use to python modules (.py files).
- create a local python package that we can easily import from all subfolders etc. to avoid having a mess with paths etc.
- create appropriate unit tests. (eg. finite-difference test)

**1.e.1** (5pts) Create a python package, for example ``mytorch''. Move the relevant functions to this package, and create appropriate tests. After doing this, you should be able to install the package using `pip install -e mytorch`. Then make sure you create some tests and show their outputs, as in the example below. One test that you should definitely be able to use is the finite-difference test that you created in the very beginning of this notebook. 

In [None]:
!pytest mytorch/tests

**1.e.2** (5 pts) Answer the questions below about the package that you created. Note that there is no right or wrong; these questions serve to make you reflect about what you have implemented. 
- What tests did you implement? 
- Are you sure that your code works based on these tests? 
- Is the interface to your code straightforward (i.e., how many lines of code are required to run your tests?) 

## Acknowledgment of Collaboration and/or Tool Use

Please choose from below (simply delete the lines that do not apply) and add a few additional notes

- “I worked alone on this assignment.”, or
- “I worked with ~~~~~~ [person or tool] on this assignment.” and/or
- “I received assistance from ~~~~~~ [person or tool] on this assignment.”

For the last two cases, specify how the person or tool helped you and explain why this amplified your learning process:

Answer: