In [None]:
%reload_ext autoreload
%autoreload 2
import numpy as np
import matplotlib.pylab as plt

# Fix the random seed to facilitate grading
np.random.seed(1)

# HW1.2 - Autodifferentiation

In the previous notebook, we learned about some fundamental concepts for autodifferentiation; in particular 
- the computational benefit of seeing gradients as linear maps
- the computational benefit of using the adjoint operator for computing gradients.

Now, we will use these concepts in larger, more complex computation graphs. The goal is to better understand the computational benefits of the different approaches. 

In the previous notebook, you have already implemented several functions that are part of a standard neural network architecture. In the last part of the notebook, you exported the relevant tasks into their own module, making it easy to use these functions now.

## 2.a Forward-mode Autodifferentiation w.r.t. Inputs. (20 pts)

For pedagogical reasons, we first consider the following neural network. 

$$
y = f(x, W_1, W_2, W_3) = f_3(f_2(f_1(x, W_1), W_2), W_3) = W_3 W_2 W_1 x
$$

In other words, we concatenate three times the linear map that we implemented in the previous notebook. 

We solve a regression problem using square error loss

$$
L(x, z) = \| f(x) - z \|^2
$$

where $z$ are target data points from the dataset. At first, we minimize the loss w.r.t. the input variables $x$ (not the weights). To do this, we need a couple of ingredients. 

**2.a.1** (6 pts)
- Implement the linear map of the function f(x), returning both $d/dx$ and $d/dW$.
- Implement the loss and its gradient with respect to $y$. 
- Implement the directional derivative of the loss at $y$ in the direction $dy$ via linear map.

In [None]:
from mytorch.layers import linear

### ============= 2.a.1 Fill in below ==============

def linear_lmap(x, W, dx=None, dW=None):
    ### Implement the linear map of the function at x or W in the direction dx or dW
    ### return d/dx and d/dW
    
    lmap_x = None      # d/dx 
    lmap_W = None      # d/dW
    return lmap_x, lmap_W

def sqe(y, z):
    ### Implement the sqe loss
    
    return None

def sqe_grad(y, z):
    ### Implement the gradient of the loss w.r.t. y

    return None

def sqe_lmap(y, dy, z):
    ### Implement the directional derivative of the loss at y in the direction dy via a linear map
    
    return None

### ===========================

Let us create a random instance of this forward network and calculate the loss. The input variable x has shape $\mathbb{R}^{p \times N}$, where $p$ is the input dimension and $N$ is the number of data points. To keep things simple, we first consider the single-input case with $N = 1$. We will extend this to the multiple-input (batch) case later.

In [None]:
m1, m2, m3, p = 3, 5, 1, 2

# We start with just one datapoint
N = 1

W_list = [np.random.rand(mi, mj) for mi, mj in zip([m1, m2, m3], [p, m1, m2])]
x = np.random.rand(p, N) * 2 * np.pi

z = np.random.rand(m3, N)
y = linear(linear(linear(x, W_list[0]), W_list[1]), W_list[2])

loss = sqe(y, z)
print(loss)

**2.a.2** Using the above computation chain, implement the forward pass to calculate the directional derivative of the loss at with respect to $x$ in a randomly chosen direction $u$. (5 pts)

In [None]:
# create a random direction with the same shape as x
u = np.random.randn(x.size).reshape(x.shape)
u = u / np.linalg.norm(u)

### ============= 2.a.2 Fill in below ==============
### Compute the directional derivative of the loss at x in the above random direction u.

# forward pass
s_0 = x
s_1 = None
s_2 = None
s_3 = None

# create a variation direction 
t_0 = u
t_1 = None
t_2 = None
t_3 = None
t_4 = None

### ===========================
print(t_4)

**2.a.3** Compute the gradient of the loss with respect to x using the linear map, evaluated at $p$ canonical directions, and ensure it is the same as the analytical gradient. (5 pts)

In [None]:
grad_forward = np.zeros((p, 1))

### ============= 2.a.3 Fill in below ==============
### Do a forward pass in multiple unit directions to construct the gradient of loss function respect to x. Compare it with the analytical gradient.
### Compute:
### - grad_forward: the gradient of the loss function w.r.t x. computed via linear maps
### - grad_analytical: the analytical gradient of the loss function w.r.t x.

# the gradient of the loss function via linear maps
for i in range(p):
    grad_forward[i] = None

# analytical gradient
grad_analytical = None

### ===========================

np.testing.assert_allclose(grad_analytical, grad_forward)

**2.a.4** What are two suboptimal aspects of the above implementation?  (4 pts)

Answers:

## 2.b Reverse-mode Autodifferentiation w.r.t. Inputs (7 pts)

In the previous section, we have seen that computing the gradient using forward-mode differentiation is cumbersome. Following what you have seen in class, compute the gradient using the adjoint operator and the backpropagation algorithm (reverse-mode autodifferentiation). 

**2.b.1** Implement the adjoint operator of the linear map of the function f(x). (2 pts)

In [None]:
### ============= 2.b.1 Fill in below ==============
def linear_lmap_adjoint(x, W, du):
    ### Implement the adjoint linear map of the function f(x) at x or W in the direction du
    ### return du/dx, du/dW
    
    lmap_adjoint_x = None      # du/dx 
    lmap_adjoint_W = None      # du/dW
    return lmap_adjoint_x, lmap_adjoint_W

### ===========================

**2.b.2** Implement the forward and backward pass of the computation chain to compute the gradient of the loss with respect to input x. (5 pts)


In [None]:
### ============= 2.b.2 Fill in below ==============
### Do forward and backward passes to compute the gradients of loss with respect to x.
### Compute:
### - grad_reverse: the gradient of loss function w.r.t x.

grad_reverse = None
### ===========================

np.testing.assert_allclose(grad_analytical, grad_reverse)

## 2.c Autodifferentiation w.r.t. Network Weights (30 pts)

So far, we have performed differentiation with respect to the inputs $x$. This is interesting when we solve optimal control or estimation problems, for example: $f(x)$ could represent our measurement or dynamics model, and we try to find $u$ to minimize the difference between a target and the outcome. 

In machine learning, a more common task is to tune the network parameters in order to create a predictor based on given training data. In this part of the notebook, we adopt the above pipeline to neural network training. 

Below, we compute the analytical gradients of the cost w.r.t. $W_0$, $W_1$, and $W_2$. That way you can ensure that you have
no mistakes moving forward. 

In [None]:
# Below uses hard-coded gradients, for simplicity
# dL/dAk=[2A_{k+1}'A_{k+2}'...A_{K}'(y^−y)]⋅[A_{k−1} … A_1 x]⊤
yhat = W_list[2] @ W_list[1] @ W_list[0] @ x
grad_W0_analytical = 2 * W_list[1].T @ W_list[2].T @ (yhat - z) @ x.T
grad_W1_analytical = 2 * W_list[2].T @ (yhat - z) @ (W_list[0] @ x).T
grad_W2_analytical = 2 * (yhat - z) @ (W_list[1] @ W_list[0] @ x).T

**2.c.1** Below, calculate the gradient w.r.t. the network weights $W_k$ using a linear map. As in **1.d.1**, we first do this the "inefficient way" so that we see the benefit of backpropagation. (5 pts) 

(Hint: A perturbation in $W_k$ only affects layer $k$ and subsequent layers, so the forward-mode propagation should start at layer $k$.)

In [None]:
import itertools
    
k = 1  # We want gradient w.r.t W_list[k]
W = W_list[k]
grad_Wk_forward = np.zeros_like(W)

s_list = [s_0, s_1, s_2, s_3]

for m, n in itertools.product(range(W.shape[0]), range(W.shape[1])):
    V = np.zeros_like(W)
    V[m, n] = 1.0

    ### ============= 2.c.1 Fill in below ==============
    ### Populate the gradient element (m,n) using the forward pass.
    ### Compute:
    ### - grad_Wk_forward: the gradient of loss function w.r.t W_list[k] 
    
    grad_Wk_forward[m, n] = None
    
    ### ===========================
if k == 0:
    np.testing.assert_allclose(grad_Wk_forward, grad_W0_analytical)
elif k == 1:
    np.testing.assert_allclose(grad_Wk_forward, grad_W1_analytical)
elif k == 2:
    np.testing.assert_allclose(grad_Wk_forward, grad_W2_analytical)

**2.c.2** Now, compute the gradient of loss function w.r.t. the network weights $W_k$ using backpropagation. (5 pts)

In [None]:
### ============= 2.c.2 Fill in below ==============
### Compute the gradient of the loss function w.r.t. the network weights $W_k$ using backpropagation.
### Compute:
### - grad_Wk_reverse: the gradient of loss function w.r.t W_list[k] using backpropagation.

grad_Wk_reverse = None
### ===========================
if k == 0:
    np.testing.assert_allclose(grad_Wk_reverse, grad_W0_analytical)
elif k == 1:
    np.testing.assert_allclose(grad_Wk_reverse, grad_W1_analytical)
elif k == 2:
    np.testing.assert_allclose(grad_Wk_reverse, grad_W2_analytical)

## Adding nonlinearities

Hopefully you find the above example useful to get a grasp of how autodifferentiation works, and to debug your code. Now let's move to more realistic examples so that you can actually train your own small neural network. The main missing elements, compared to standard neural networks, are biases and nonlinearities. For now, let's just add nonlinearities and see how far that can get us. 

As we have seen, to add nonlinearities, all we need is to define the derivative and the adjoint function, both defined as linear maps. Let's do this for two standard nonlinearities. 

**2.c.3** Fill in the lmap and lmap_adjoint for the ReLu and the sigmoid activation functions below. (5 pts)

In [None]:
# relu
def relu(x):
    output = np.zeros_like(x)
    output[x >= 0] = x[x >= 0]
    return output

### ============= 2.c.3 Fill in below ==============
def relu_lmap(x, v):
    ### Implement the linear map for ReLU at x in the direction v
    ### by convention, set the gradient at zero to zero. 
    
    return None

def relu_lmap_adjoint(x, u):
    ### Implement the adjoint linear map of the ReLU at x in the direction u
    
    return None
### ===========================


x = np.linspace(-1, 1, 100)
v = np.ones_like(x)
plt.figure()
plt.plot(x, relu(x))
plt.plot(x, relu_lmap(x, v))
plt.xlabel("x")
plt.title("ReLU")

In [None]:
# sigmoid
import matplotlib.pyplot as plt

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

### ============= 2.c.3 Fill in below ==============
def sigmoid_lmap(x, v):
    ### Implement the linear map for sigmoid at x in the direction v

    return None

def sigmoid_lmap_adjoint(x, u):
    ### Implement adjoint map of the sigmoid at x in direction u
    
    return None
### ===========================


x = np.linspace(-np.pi, np.pi, 100)
v = np.ones_like(x)

plt.figure()
plt.plot(x, sigmoid(x))
plt.plot(x, sigmoid_lmap(x, v))
plt.xlabel("x")
plt.title("Sigmoid")

Now we have all the ingredients to create our own little neural networks. 

For example, let's create the following neural network: Two layers of 128 dimensions, sigmoid activation functions, mean squared error (MSE) as a loss function. The network takes a batch of input $x \in \mathbb{R}^{3 * N}$, the produces outputs $y \in \mathbb{R}^{4 * N}$, where $N$ is the batch size. We will use this network in the next practical to map from end effector positions in 3D to the four motor commands for our robot example. 

$$
x \in \mathbb{R}^{3 \times N} \rightarrow \mathrm{linear}(W_1) \rightarrow sigmoid \rightarrow \mathrm{linear}(W_2) \rightarrow sigmoid \rightarrow \mathrm{linear}(W_3) = y \in \mathbb{R}^{4 \times N}
$$

To train the network, we use the mean squared error (MSE) loss:
$$
L(y,z) = \frac{1}{4N}\,\lVert y - z \rVert_F^2,
\qquad y, z \in \mathbb{R}^{4 \times N}.
$$

Here, $\lVert \rVert_F$ denotes the Frobenius norm, and the loss averages the squared error over both the output dimensions and the batch.

In [None]:
from mytorch.layers import linear

m0 = 3
m1, n1 = 128, m0
m2, n2 = 128, m1
m3, n3 = 4, m2

W_list = [np.random.rand(mi, mj) for mi, mj in zip([m1, m2, m3], [n1, n2, n3])]

N = 100
x = np.random.rand(m0, N)
z = np.random.rand(4, N)
print(x.shape, z.shape)

forward = (linear, sigmoid, linear, sigmoid, linear) 
W_for_layer = (W_list[0], None, W_list[1], None, W_list[2])
lmap = (linear_lmap, sigmoid_lmap, linear_lmap, sigmoid_lmap, linear_lmap)
lmap_adjoint = (linear_lmap_adjoint, sigmoid_lmap_adjoint, linear_lmap_adjoint, sigmoid_lmap_adjoint, linear_lmap_adjoint)

In [None]:
# write out the analytical gradient for W_list[0], W_list[1], W_list[2] to ensure that you implement the correct gradient.
s1 = W_list[0] @ x
s2 = sigmoid(s1)
s3 = W_list[1] @ s2
s4 = sigmoid(s3)
y = W_list[2] @ s4

# mse loss
loss = np.mean((y-z)**2)

# backward (batch)
delta5 = 2 * (y - z) / y.size                         # dL/dy, (4, B)
grad_W2_analytical = delta5 @ s4.T                    # (4,B)@(B,128) -> (4,128)
delta4 = (W_list[2].T @ delta5) * s4 * (1 - s4)       # (128,B)
grad_W1_analytical = delta4 @ s2.T                    # (128,B)@(B,128) -> (128,128)
delta2 = (W_list[1].T @ delta4) * s2 * (1 - s2)       # (128,B)
grad_W0_analytical = delta2 @ x.T                     # (128,B)@(B,3) -> (128,3)

**2.c.4** Compute the gradients w.r.t. the network parameters $W_k$ using backpropagation. Fill in the missing parts in the forward and backward pass (10 pts) 

In [None]:
def compute_loss_and_grad(x_batch, z_batch, W_for_layer, forward, lmap_adjoint):
    ### ============= 2.c.4 Fill in below ==============
    ### Do forward and backward passes to the gradient w.r.t. the network weights $W_k$ using backpropagation.
    ### Fill in the missing parts in the forward and backward pass
    
    # Forward pass 
    s_list = [x_batch]

    for fun, W in zip(forward, W_for_layer):
        s = None                         # (forward propagate through layer)
        s_list.append(s)

    y = s_list[-1]

    # Loss
    loss = np.mean((y - z_batch) ** 2)

    # Backward pass
    r = None                             # (gradient of MSE loss w.r.t. y)
    rw_list = []

    for func, W, s in zip(reversed(lmap_adjoint), reversed(W_for_layer), reversed(s_list[:-1])):
        r, rw = None                     # (backward propagate through linear layer)
        rw_list.insert(0, rw)

    return loss, rw_list
### ===========================
loss, rW_list = compute_loss_and_grad(x, z, W_for_layer, forward, lmap_adjoint)

np.testing.assert_allclose(rW_list[0], grad_W0_analytical)  
np.testing.assert_allclose(rW_list[1], grad_W1_analytical)  
np.testing.assert_allclose(rW_list[2], grad_W2_analytical)  

**2.c.5** Implement the same network using pytorch and ensure that the gradients are the same. (5 pts)

In [None]:
import torch

dtype = torch.float64
W_torch = []

# define the pytorch parameters from numpy
W_0 = torch.tensor(W_list[0], dtype=dtype, requires_grad=True)
W_1 = torch.tensor(W_list[1], dtype=dtype, requires_grad=True)
W_2 = torch.tensor(W_list[2], dtype=dtype, requires_grad=True)
xt = torch.tensor(x, dtype=dtype)
zt = torch.tensor(z, dtype=dtype)

### ============= 2.c.5 Fill in below ==============
### Use pytorch to compute the gradient w.r.t. the network weights $W_k$.
### - W_k.grad.numpy(): the gradient of loss function w.r.t W_list[k] 


### ===========================

np.testing.assert_allclose(W_0.grad.numpy(), rW_list[0])
np.testing.assert_allclose(W_1.grad.numpy(), rW_list[1])
np.testing.assert_allclose(W_2.grad.numpy(), rW_list[2])

# 2.d Packaging of functions (10 pts)

**2.d.1** As in the last notebook, make sure to package some of this code so that you can reuse it in the next notebook. You will need the linear layers, the activation functions, and the squared error loss. Move the code to these modules and write corresponding tests. (5 pts)

In [None]:
!pytest mytorch/tests

**2.d.2** Answer the questions below about the package that you created. Note that there is no right or wrong; these questions serve to make you reflect about what you have implemented. (5 pts)
- What tests did you implement? 
- Are you sure that your code works based on these tests? 
- Is the interface to your code straightforward (i.e., how many lines of code are required to run your tests?) 

## Acknowledgment of Collaboration and/or Tool Use

Please choose from below (simply delete the lines that do not apply) and add a few additional notes

- “I worked alone on this assignment.”, or
- “I worked with ~~~~~~ [person or tool] on this assignment.” and/or
- “I received assistance from ~~~~~~ [person or tool] on this assignment.”

For the last two cases, specify how the person or tool helped you and explain why this amplified your learning process:

_add answer here_

# References

[1] Elements of Differentiable Programming