# Part 3: Reverse Mode Automatic Differentiation with PyTorch

In [1]:
# Execute this code block to install dependencies when running on colab
try:
    import torch
except:
    from os.path import exists
    from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
    platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
    cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
    accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

    !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.0.0-{platform}-linux_x86_64.whl torchvision

PyTorch implements Dynamic Reverse Mode Automatic Differentiation, much like we did in the previous exercise. There is one really major difference in what PyTorch provides over our simple example: it works directly with matrices (`Tensor`s) rather than with scalars (although obviously a matrix can represent a scalar).

In this tutorial, we'll explore PyTorch's AD implementation. Note that we're using the API of PyTorch 0.4 or later which simplifies use of AD (previous versions required wrapping all `Tensor`s that you wanted to compute gradients of in `Variable` objects; PyTorch 0.4 removes the need to do this and allows `Tensor`s themselves to track gradients).

We'll start with the simple example we tried earlier in the code block below:

__Task:__ Run the following code and verify the solution is correct

In [2]:
import torch

# set up the problem
x = torch.tensor(0.5, requires_grad=True) #set requires_grad = True to track gradient
y = torch.tensor(4.2, requires_grad=True)
z = x * y + torch.sin(x)

print("z = " + str(z.item()))

z.backward() # this goes through the computation graph and accumulates the gradients in the cached .grad attributes
print("dz/dx = " + str(x.grad.item()))
print("dz/dy = " + str(y.grad.item()))

z = 2.57942533493042
dz/dx = 5.077582359313965
dz/dy = 0.5


As with our own AD implementation, PyTorch lets us differentiate through an algorithm.

__Task__: Use the block below to compute the gradient $\partial z/\partial x$ of the following pseudocode algorithm and store the result in the `dzdx` variable:

    x = 0.5
    z = 1
    i = 0
    while i<2:
        z = (z + i) * x * x
        i = i + 1

In [None]:
# set up the problem
x = torch.tensor(0.5, requires_grad=True) #set requires_grad = True to track gradient
z = torch.tensor(1.0, requires_grad=True)
i = 0
while(i < 2):
  z = (z + torch.tensor(float(i), requires_grad=True)) * x * x

print("z = " + str(z.item()))

z.backward() # this goes through the computation graph and accumulates the gradients in the cached .grad attributes
print("dz/dx = " + str(x.grad.item()))

In [None]:
assert dzdx


## PyTorch limitations: in-place operations and aliasing

PyTorch will throw an error at runtime if you try to differentiate through an in-place operation on a tensor. 

__Task__: Run the following code to see this in action.

In [4]:
x = torch.tensor(1.0, requires_grad=True)

y.add_(3) # inplace addition
y.backward()

RuntimeError: ignored

在计算y的时候, x是等于某个值的, y对于x的导数是和这时候的x值相关的
但是计算完add_(3)之后, y的值被in-place, 这就会导致 f.backward() 对于 x的导数计算出错误, 为了防止这种错误, pytorch 选择了报错的形式.
造成这个问题的主要原因是因为 在执行 y = x.tanh() 这句的时候, pytorch 的反向求导机制 保存了x的引用为了之后的 反向求导计算.

Aliasing is also something that can't be differentiated through and will result in a slightly more cryptic error.

__Task__: Run the following code to see this in action. If you don't understand what this code does add some `print` statements to show the values of `x` and `y` at various points.

In [3]:
x = torch.tensor([1, 2, 3, 4], requires_grad=True, dtype=torch.float)
print(x)
y = x[:1]
print(y)
y.add_(3)
y.backward()

tensor([1., 2., 3., 4.], requires_grad=True)
tensor([1.], grad_fn=<SliceBackward0>)


RuntimeError: ignored

## Dealing with multiple outputs

PyTorch can deal with the case where there are multiple output variables if we can formulate the expression in terms of tensor operations. Consider the example from the presentation for example:

$$\begin{cases}
     z = 2x + \sin x\\
     v = 4x + \cos x
\end{cases}$$

We could formulate this as:

$$
\begin{bmatrix}z \\ v\end{bmatrix} = \begin{bmatrix}2 \\ 4\end{bmatrix} \odot \bar{x} + \begin{bmatrix}1 \\ 0\end{bmatrix} \odot \sin\bar x + \begin{bmatrix}0 \\ 1\end{bmatrix} \odot \cos\bar x
$$

where 

$$
\bar x = \begin{bmatrix}x \\ x\end{bmatrix}
$$

and $\odot$ represents the Hadamard or element-wise product. This is demonstrated using PyTorch in the following code block.

__Task:__ run the code below.

In [12]:
x = torch.tensor([[1.0],[1.0]], requires_grad=True)

zv = ( torch.tensor([[2.0],[4.0]]) * x +
         torch.tensor([[1.0], [0.0]]) * torch.sin(x) +
         torch.tensor([[0.0], [1.0]]) * torch.cos(x) )
        
zv.backward(torch.tensor([[1.0],[1.0]])) # Note as we have "multiple outputs" we must pass in a tensor of weights of the correct size

print(x.grad)

tensor([[2.5403],
        [3.1585]])


The gradient is related to initial values of inputs since gradients sometimes are functions of some inputs and neurons of hidden layers.

## Gradient descent & gradients with respect to a vector
Let's put everything together and using automatically computed gradients to find the minima of a function by taking steps down the gradient from an initial position. Rather than explicitly creating each input variable as a scalar as in the previous examples, we'll use a vector instead (so our gradients will be with respect to each element of the vector).

__Task:__ work through the following example to see how taking gradients with respect to a vector works & how simple gradient descent optimisation can be implemented.

In [20]:
# This is our starting vector
initial=[[2.0], [1.0], [10.0]]

# This is the function we will optimise (feel free to work out the analytic minima!)
def function(x):
    return x[0]**2 + x[1]**2 + x[2]**2

x = torch.tensor(initial, requires_grad=True, dtype=torch.float)
for i in range(0,100):
    # manually dispose of the gradient (in reality it would be better to detach and zero it to reuse memory)
    x.grad = None
    # evaluate the function
    J = function(x) 
    # auto-compute the gradients at the previously evaluated point x
    J.backward()
    # compute the update
    x = x - x.grad*0.1 
    print(x.requires_grad)
    if i%10 == 0:
        print((x.grad_fn, function(x).item()))

True
(None, 67.19999694824219)
True
True
True
True
True
True
True
True
True
True
(None, 0.7747630476951599)
True
True
True
True
True
True
True
True
True
True
(None, 0.008932411670684814)
True
True
True
True
True
True
True
True
True
True
(None, 0.00010298370034433901)
True
True
True
True
True
True
True
True
True
True
(None, 1.1873213452417986e-06)
True
True
True
True
True
True
True
True
True
True
(None, 1.3688886468798955e-08)
True
True
True
True
True
True
True
True
True
True
(None, 1.5782215811999123e-10)
True
True
True
True
True
True
True
True
True
True
(None, 1.8195657654207498e-12)
True
True
True
True
True
True
True
True
True
True
(None, 2.0978168543115717e-14)
True
True
True
True
True
True
True
True
True
True
(None, 2.418617877589929e-16)
True
True
True
True
True
True
True
True
True


In [10]:
#to show the use of requires_grad
#also, we can use requires_grad_ to in-place
a = torch.randn((2, 2), requires_grad=True)
print(a)
print(a.requires_grad)
a = ((a * 3) / (a - 1))
print(a.requires_grad)
a.requires_grad_(False)
print(a.requires_grad)
b = (a * a).sum()
print(b.requires_grad)
print(b.grad_fn)

tensor([[ 0.3982,  1.1602],
        [-0.8490,  1.8391]], requires_grad=True)
True
True


RuntimeError: ignored

__Task__: Answer the following question in the box below: Why must the update in the code above be assigned to `x.data` rather than `x`?

tensor.data的两点总结：

（1）tensor .data 返回和 x 的相同数据 tensor,而且这个新的tensor和原来的tensor是共用数据的，一者改变，另一者也会跟着改变，而且新分离得到的tensor的require s_grad = False, 即不可求导的。（这一点其实detach是一样的）

（2）使用tensor.data的局限性。文档中说使用tensor.data是不安全的, 因为 x.data 不能被 autograd 追踪求微分 。什么意思呢？由于我更改分离之后的变量值,导致原来的张量值也跟着改变了，但是这种改变对于autograd是没有察觉的，它依然按照求导规则来求导，导致得出完全错误的导数值却浑然不知。它的风险性就是如果我再任意一个地方更改了某一个张量，求导的时候也没有通知我已经在某处更改了，导致得出的导数值完全不正确，故而风险大。

tensor.detach()的两点总结：

（1）tensor .detach() 返回和 x 的相同数据 tensor,而且这个新的tensor和原来的tensor是共用数据的，一者改变，另一者也会跟着改变，而且新分离得到的tensor的require s_grad = False, 即不可求导的。（这一点其实 .data是一样的）

（2）使用tensor.detach()的优点。从上面的例子可以看出，由于我更改分离之后的变量值,导致原来的张量out的值也跟着改变了，这个时候如果依然按照求导规则来求导，由于out已经更改了，所以不会再继续求导了，而是报错，这样就避免了得出完全牛头不对马嘴的求导结果。

相同点：tensor.data和tensor.detach() 都是变量从图中分离，但而这都是“原位操作 inplace operation”。

不同点：

（1）.data 是一个属性，二.detach()是一个方法；

（2）.data 是不安全的，.detach()是安全的。

## Differentiating through random operations

We'll end with an example that will be important later in the course: differentiating with respect to the parameters of a random number generator.

Assume that as some part of a differentiable program that we write we wish to incorporate a random element where we sample values, $z$ from a Normal distribution: $z \sim \mathcal{N}(\mu,\sigma^2)$. We want to learn the parameters of the generator $\mu$ and $\sigma^2$, but how can we do this? In a differentiable program setting we want to differentiate with respect to these parameters, but at first glance it isn't at all obvious what this means as the generator _just_ produces numbers which themselves don't have gradients.

The answer is often called the _reparameterisation trick_: ***If we note that sampling a Normal distribution with a specfic mean and variance is equivalent to drawing numbers from a standard Normal distribution and scaling and shifting them: $z \sim \mathcal{N}(\mu,\sigma^2) \equiv z \sim \mu + \sigma\mathcal{N}(0,1)\equiv z = \mu + \sigma \zeta\, \rm{where}\, \zeta\sim\mathcal{N}(0,1)$. With this reparameterisation the gradients with respect to the parameters are obvious.***

The following code block demonstrates this in practice; each of the gradients can be interpreted as how much an infintesimal change in $\mu$ or $\sigma$ would change $z$ if we could repeat the sampling operation again with the same value of `torch.randn(1)` being produced:

In [21]:
mu = torch.tensor(1.0, requires_grad=True)
sigma = torch.tensor(1.0, requires_grad=True)

for i in range(10):
    mu.grad = None
    sigma.grad = None
    
    z = mu + sigma * torch.randn(1) #1 row 1 column
    z.backward()
    print("z:", z.item(), "\tdzdmu:", mu.grad.item(), "\tdzdsigma:", sigma.grad.item())

z: 1.2051912546157837 	dzdmu: 1.0 	dzdsigma: 0.2051912248134613
z: 1.2875546216964722 	dzdmu: 1.0 	dzdsigma: 0.28755462169647217
z: -2.492189645767212 	dzdmu: 1.0 	dzdsigma: -3.492189645767212
z: -1.1215157508850098 	dzdmu: 1.0 	dzdsigma: -2.1215157508850098
z: 0.45651155710220337 	dzdmu: 1.0 	dzdsigma: -0.5434884428977966
z: 1.753296971321106 	dzdmu: 1.0 	dzdsigma: 0.753296971321106
z: 0.3965885639190674 	dzdmu: 1.0 	dzdsigma: -0.6034114360809326
z: -0.6901881694793701 	dzdmu: 1.0 	dzdsigma: -1.6901881694793701
z: 1.898858666419983 	dzdmu: 1.0 	dzdsigma: 0.8988586664199829
z: -0.9623163938522339 	dzdmu: 1.0 	dzdsigma: -1.9623163938522339
