<br>

<div align=center><font color=maroon size=8><b>Autograd</b></font></div>

<br>

<font size=4><b>References:</b></font>
* `Tutorials >` Deep Learning with PyTorch: A 60 Minute Blitz > <a href="https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html" style="text-decoration:none;">A Gentle Introduction to torch.autograd</a>
* `Tutorials > Learn the Basics >` <a href="https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html" style="text-decoration:none;">Automatic Differentiation with torch.autograd</a> (Automatic Differentiation)
* 
* <a href="https://pytorch.org/docs/stable/index.html" style="text-decoration:none;">Docs > PyTorch documentation</a>
    * 
    * **Developer Notes**
        * `Docs > 2` <a href="https://pytorch.org/docs/stable/notes/autograd.html" style="text-decoration:none;">Autograd mechanics</a>
        * `Docs > 10` <a href="https://pytorch.org/docs/stable/notes/gradcheck.html" style="text-decoration:none;">Gradcheck mechanics</a>
    * 
    * `Docs >` <a href="https://pytorch.org/docs/stable/autograd.html" style="text-decoration:none;">Automatic differentiation package - torch.autograd</a>
    * `Docs >` Automatic differentiation package - torch.autograd > <a href="https://pytorch.org/docs/stable/generated/torch.autograd.grad.html" style="text-decoration:none;">torch.autograd.grad</a>
    * `Docs >` Automatic differentiation package - torch.autograd > <a href="https://pytorch.org/docs/stable/generated/torch.autograd.backward.html" style="text-decoration:none;">torch.autograd.backward</a>
    * 
    * `Tutorials >` <a href="https://pytorch.org/tutorials/intermediate/forward_ad_usage.html" style="text-decoration:none;">Forward-mode Automatic Differentiation (Beta)</a> （forward-mode AD tutorial）
    * `Docs >` <a href="https://pytorch.org/cppdocs/notes/inference_mode.html" style="text-decoration:none;">Inference Mode</a>
    * Docs > <a href="" style="text-decoration:none;"></a>   

<br>
<br>

# Tutorials

<font color=gray size=3>Tutorials > Deep Learning with PyTorch: A 60 Minute Blitz > </font>

## <font style="font-size:120%;color:maroon;font-weight:bold">A Gentle Introduction to torch.autograd</font> <a href="https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html" style="text-decoration:none;"><font size=2>[link]</font></a>

详见:
* 上面蓝色 [link] 的链接，
* 或者见自己笔记：`D:\KeepStudy\0_Coding\Pytorch\1 Tutorials\4 Learning PyTorch\4-1 Deep Learning with PyTorch - A 60 Minute Blitz .ipynb`

<br>
<br>
<br>

<font color=gray size=3>Tutorials > Learn the Basics</font>

## <font style="font-size:120%;color:maroon;font-weight:bold">Automatic Differentiation with torch.autograd</font> <a href="https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html" style="text-decoration:none;"><font size=2>[link]</font></a>

When training neural networks, the most frequently used algorithm is ***back propagation***. In this algorithm, parameters (model weights) are adjusted according to the **gradient** of the loss function with respect to the given parameter.

To compute those gradients, PyTorch has a built-in differentiation engine called <font color=blue size=3>**torch.autograd**</font>. <font color=maroon>It supports automatic computation of gradient for any computational graph.</font>

<br>

Consider the simplest one-layer neural network, with input `x`, parameters `w` and `b`, and some loss function. It can be defined in PyTorch in the following manner:

In [1]:
import torch

x = torch.ones(5)      # input tensor
y = torch.zeros(3)     # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)

z = torch.matmul(x, w)+b

loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

<br>
<br>

### Tensors, Functions and Computational graph

This code defines the following <font size=4 color=blue>**computational graph**:</font>

<img src="../1 Tutorials/images/simple computational graph.png" width=600px>

In this network, `w` and `b` are parameters, which we need to optimize. Thus, we need to be able to compute the gradients of loss function with respect to those variables. <font color=maroon>In order to do that, we set the ***requires_grad*** property of those tensors.</font>

<div class="alert alert-block alert-info">

<font size=3 color=red><b>NOTE: </b></font>

You can set the value of **`requires_grad`** when creating a tensor, or later by using **`x.requires_grad_(True)`** method.

</div>

<br>

<font size=3><font color=maroon>A function that we apply to tensors to construct computational graph is in fact an object of class **`Function`**.</font> This object knows how to compute the function in the ***forward*** direction, and also how to compute its derivative during the ***backward propagation*** step. <br>
<font color=maroon>A reference to the backward propagation function is stored in ***`grad_fn`*** property of a tensor.</font> You can find more information of `Function` <a href="https://pytorch.org/docs/stable/autograd.html#function" style="text-decoration:none;">in the documentation</a>.</font>

In [2]:
print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

Gradient function for z = <AddBackward0 object at 0x000002158DC27970>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x000002158DC27610>


<br>
<br>

### Computing Gradients

To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to parameters, namely, we need <font size=4>$\frac{\partial loss}{\partial w}$</font> and <font size=4>$\frac{\partial loss}{\partial b}$</font>  <font color=maroon>under some fixed values of `x` and `y`.</font> 

To compute those derivatives, we call `loss.backward()`, and then retrieve the values from `w.grad` and `b.grad`:

In [3]:
loss.backward()

print(w.grad)
print(b.grad)

tensor([[0.1149, 0.0778, 0.0487],
        [0.1149, 0.0778, 0.0487],
        [0.1149, 0.0778, 0.0487],
        [0.1149, 0.0778, 0.0487],
        [0.1149, 0.0778, 0.0487]])
tensor([0.1149, 0.0778, 0.0487])


<div class="alert alert-block alert-info">

<font size=3 ><font color=red><b>NOTE: </b></font>

* We can only obtain the `grad` properties for the leaf nodes of the computational graph, which have `requires_grad` property set to `True`. For all other nodes in our graph, gradients will not be available.
    
    <br>
    
* We can only perform gradient calculations using `backward` once on a given graph, for performance reasons. If we need to do several `backward` calls on the same graph, we need to pass `retain_graph=True` to the `backward` call.
</font>
</div>

<br>
<br>

### Disabling Gradient Tracking

<font color=maroon>By default, all tensors with `requires_grad=True` are tracking their computational history and support gradient computation. </font>
    
However, there are some cases when we do not need to do that, for example, when we have trained the model and just want to apply it to some input data, i.e. we only want to do forward computations through the network. 

<br>

<font size=3 color=maroon>**①** We can stop tracking computations by surrounding our computation code with `torch.no_grad()` block:</font>

In [4]:
z = torch.matmul(x, w)+b
print(z.requires_grad)


with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)

True
False


<br>

<font size=3 color=maroon>**②** Another way to achieve the same result is to use the `detach()` method on the tensor:</font>

In [5]:
z = torch.matmul(x, w)+b
print(z.requires_grad)


z_det = z.detach()
print(z_det.requires_grad)

True
False


<br>

<font size=3><font color=maroon>There are reasons you might want to disable gradient tracking:</font>
* To mark some parameters in your neural network as **frozen parameters**. This is a very common scenario for <a href="https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html" style="text-decoration:none;font-size:120%">finetuning a pretrained network</a>.
    
    
* To **speed up computations** when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.</font>

<br>
<br>

### More on Computational Graphs

Conceptually, autograd keeps a record of data (tensors) & all executed operations (along with the resulting new tensors) in a <font size=3 color=blue><b>directed acyclic graph `(`DAG`)`</b></font> consisting of 
<a href="https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function" style="text-decoration:none;"><font size=3>Function</font></a> objects.

<font size=3>In this DAG, <font color=blue><b>leaves</b></font> are the `input tensors`, <font color=blue><b>roots</b></font> are the `output tensors`. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.</font>

In a <font size=3 color=maroon>**forward pass**</font>, autograd does two things simultaneously:

* run the `requested operation` to compute a resulting tensor, and
* maintain the operation’s `gradient function` in the DAG.


The <font size=3 color=maroon>**backward pass**</font> kicks off when `.backward()` is called on the DAG root. `autograd` then:

* computes the gradients from each `.grad_fn`,
* accumulates them in the respective tensor’s `.grad` attribute, and
* using the chain rule, propagates all the way to the leaf tensors.

<div class="alert alert-block alert-info">

<font size=3 color=red><b>NOTE: </b></font><br><br>
<font size=3>
**DAGs are dynamic in PyTorch** An important thing to note is that the graph is recreated from scratch; after each `.backward()` call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.
</font>
</div>

<br>
<br>

### Optional Reading: Tensor Gradients and Jacobian Products

<font color=maroon>In many cases, we have a ***`scalar`***` loss function`, and we need to compute the gradient with respect to some parameters. However, there are cases when the output function is an arbitrary tensor. In this case, PyTorch allows you to compute so-called ***Jacobian product***, and **not the actual gradient**.</font>

For a vector function <font size=4>$\vec{y}=f(\vec{x})$</font>, where <font size=4>$\vec{x}=\langle x_1,\dots,x_n\rangle$</font> and <font size=4>$\vec{y}=\langle y_1,\dots,y_m\rangle$</font>, a gradient of <font size=4>$\vec{y}$</font> with respect to <font size=4>$\vec{x}$</font> is given by **Jacobian matrix**:

<font size=4>
$$
J = \left(\frac{∂\mathcal{y}}{∂x_1} \cdots \frac{∂\mathcal{y}}{∂x_n} \right)
  = 
\left(
\begin{matrix}
\frac{∂y_1}{∂x_1}  & \cdots & \frac{∂y_1}{∂x_n}      \\
\vdots             & \ddots & \vdots \\
\frac{∂y_m}{∂x_1}  & \cdots & \frac{∂y_m}{∂x_n}      \\
\end{matrix}
\right)
$$
</font>

<font size=3 color=maroon>Instead of computing the Jacobian matrix itself, PyTorch allows you to compute ***Jacobian Product*** <font size=4>$ \ \ \ v^T\cdot J$</font> for a given input vector <font size=4>$v=(v_1 \dots v_m)$</font>. This is achieved by calling **`backward`** with <font size=4>$v$</font> as an argument. </font>
    
<font size=3>The size of <font size=4>$v$</font> should be the same as the size of the original tensor, with respect to which we want to compute the product:</font>

In [9]:
inp = torch.eye(5, requires_grad=True)   # x
out = (inp+1).pow(2)                     # y = (x+1)^2

print(inp)
print(out)

tensor([[1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1.]], requires_grad=True)
tensor([[4., 1., 1., 1., 1.],
        [1., 4., 1., 1., 1.],
        [1., 1., 4., 1., 1.],
        [1., 1., 1., 4., 1.],
        [1., 1., 1., 1., 4.]], grad_fn=<PowBackward0>)


In [10]:
# ∂y/∂x = 2*(x+1)
out.backward(torch.ones_like(inp), retain_graph=True)
print(f"First call\n{inp.grad}")


# ∂^2{y}/∂^2{x} = 
out.backward(torch.ones_like(inp), retain_graph=True)
print(f"\nSecond call\n{inp.grad}")


out.backward(torch.ones_like(inp), retain_graph=True)
print(f"\nThird call\n{inp.grad}")


inp.grad.zero_()
out.backward(torch.ones_like(inp), retain_graph=True)
print(f"\nCall after zeroing gradients\n{inp.grad}")

First call
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.],
        [2., 2., 2., 2., 4.]])

Second call
tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.],
        [4., 4., 4., 4., 8.]])

Third call
tensor([[12.,  6.,  6.,  6.,  6.],
        [ 6., 12.,  6.,  6.,  6.],
        [ 6.,  6., 12.,  6.,  6.],
        [ 6.,  6.,  6., 12.,  6.],
        [ 6.,  6.,  6.,  6., 12.]])

Call after zeroing gradients
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.],
        [2., 2., 2., 2., 4.]])


<font size=3 color=maroon>**Notice that**</font> when we call `backward` for the second time with the same argument, the value of the gradient is different. <font size=3 color=maroon>This happens because when doing `backward` propagation, PyTorch **accumulates the gradients**, i.e. the value of computed gradients is added to the `grad` property of all leaf nodes of computational graph. If you want to compute the proper gradients, you need to **zero out the** `grad` property before. In real-life training an ***optimizer*** helps us to do this.</font>

<div class="alert alert-block alert-info">

<font size=3 color=red><b>NOTE: </b></font>

Previously we were calling `backward()` function without parameters. This is essentially equivalent to calling `backward(torch.tensor(1.0))`, which is a useful way to compute the gradients in case of a scalar-valued function, such as loss during neural network training.

</div>

<br>
<br>

### Further Reading

* <a href="https://pytorch.org/docs/stable/notes/autograd.html" style="text-decoration:none;font-size:120%;color:maroon">Autograd Mechanics</a>

<br>
<br>
<br>

<font color=gray size=3>Docs > Notes9 </font>

# <font style="font-size:120%;color:maroon;font-weight:bold">Gradcheck mechanics</font> <a href="https://pytorch.org/docs/stable/notes/gradcheck.html" style="text-decoration:none;"><font size=2>[link]</font></a>

This note presents an overview of how the <a href="https://pytorch.org/docs/stable/generated/torch.autograd.gradcheck.html" style="text-decoration:none;"><font size=4>gradcheck()</font></a> and <a href="https://pytorch.org/docs/stable/generated/torch.autograd.gradgradcheck.html" style="text-decoration:none;"><font size=4>gradgradcheck()</font></a> functions work.

It will cover both **`forward and backward mode AD`** for both real and complex-valued functions as well as higher-order derivatives. This note also covers both the default behavior of gradcheck as well as the case where `fast_mode=True` argument is passed (referred to as fast gradcheck below).

<br>

## <font style="color:maroon;font-size:110%">Notations and background information</font>

Throughout this note, we will use the following convention:
* <font size=3>$x,y,a,b,v,u,ur$</font> and <font size=3>$ui$</font> are `real-valued vectors` and <font size=3>$z$</font> is a `complex-valued vector` that can be rewritten in terms of two real-valued vectors as <font size=3>$z=a+ib$</font>.


* <font size=3>$N$</font> and <font size=3>$M$</font> are two integers that we will use for the dimension of the input and output space respectively.


* <font size=3>$f: \ R^N → R^M$</font> is our basic real-to-real function such that <font size=3>$y=f(x)$</font>.


* <font size=3>$g: \ C^N → R^M$</font> is our basic complex-to-real function such that <font size=3>$y=g(x)$</font>.

<br>

<font color=magenta size=4>For the simple <b>real-to-real case</b></font>, we write as <font size=4 color=maroon><b>$J_f$</b></font> the Jacobian matrix associated with <font size=3><b>$f$</b></font> of size <font size=3 color=maroon><b>$M*N$</b></font>. This matrix contains all the partial derivatives such that the entry at position <font size=3><b>$(i,j)$</b></font> contains <font size=5><b>$\frac{\partial y_i}{\partial x_j}$</b></font>. 

* **Backward mode AD** is then computing, for a given vector <font size=4 color=maroon><b>$v$</b></font> of size <font size=3 color=maroon><b>$M$</b></font>, the quantity <font size=4 color=maroon><b>$v^T J_f$</b></font>. 


* **Forward mode AD** on the other hand is computing, for a given vector <font size=4 color=maroon><b>$u$</b></font> of size <font size=3 color=maroon><b>$N$</b></font>, the quantity <font size=4 color=maroon><b>$J_f u$</b></font>.

For functions that contain `complex values`, the story is a lot more complex. We only provide the gist here and the full description can be found at `Docs > Autograd mechanics > `<a href="https://pytorch.org/docs/stable/notes/autograd.html#complex-autograd-doc" style="text-decoration:none;"><font color=maroon>Autograd for Complex Numbers</font></a>.

<br>

<div class="alert alert-block alert-danger">

<font size=3><b>$上面是 \ real \ case  \ 的情况，下面是 \ complex \ case  \ 的情况$</b></font>

</div>

### <font style="color:blue;font-size:110%;font-weight:bold;">Wirtinger calculus</font>

The constraints to `satisfy complex differentiability` (<font size=3 color=blue>Cauchy-Riemann equations</font>) are <font color=red>too restrictive</font> for `all real-valued loss functions`, so we instead opted to use <font size=3 color=blue><b>Wirtinger calculus</b></font>. 

In a basic setting of `Wirtinger calculus`, the chain rule requires access to both the **`Wirtinger derivative`** (called <font size=3><b>$W$</b></font> below) and the **`Conjugate Wirtinger derivative`** (called <font size=3><b>$CW$</b></font> below). 

<br>

<font color=maroon size=3>Both <b>$W$</b> and <b>$CW$</b> <b>need to be propagated</b> because in general, despite their name, <b>one is not the complex conjugate of the other</b>.</font>

<br>

<font color=maroon size=4>To avoid having to propagate both values:</font>

* <font color=maroon size=3><b>For backward mode AD</b>, we always work under the assumption that</font> `the function whose derivative is being calculated is either a real-valued function or is part of a bigger real-valued function`. This assumption <font color=maroon><b>means that</b></font> all the intermediary gradients we compute during the backward pass are also associated with real-valued functions. <br><br>In practice, this assumption is not restrictive when doing optimization as such problem require real-valued objectives (as there is no natural ordering of the complex numbers).
<br>
<br>
Under this assumption, using <font size=3><b>$W$</b></font> and <font size=3><b>$CW$</b></font> definitions, we can show that <font size=3><b>$W = CW^*$</b></font> (we use <font size=3><b>$*$</b></font> to denote complex conjugation here) and so only one of the two values actually need to be “backwarded through the graph” as the other one can easily be recovered.
<br>
<br>
<font color=maroon>To simplify internal computations, <b>PyTorch</b> uses <font size=3><b>$2∗CW$</b></font> as the value it backwards and returns when the user asks for gradients.</font>
<br>
<font size=3>Similarly to the real case, when the output is actually in $R^M$, backward mode AD does not compute $2∗CW$ <font color=maroon>but only $v^T (2 * CW)$ for a given vector $v∈R^M$.</font></font>

* <font size=3 color=maroon><b>For forward mode AD</b>, we use a similar logic,</font> `in this case, assuming that the function is part of a larger function whose input is in`***`R`***. 
<br>
<br>
Under this assumption, we can make a similar claim that every intermediary result corresponds to a function whose input is in <font size=3>$R$</font> and in this case, using <font size=3>$W$</font> and <font size=3>$CW$</font> definitions, we can show that <font size=3>$W=CW$</font> for the intermediary functions. 
<br>
<br>
To make sure the forward and backward mode compute the same quantities in the elementary case of a one dimensional function, the forward mode also computes <font size=3>$2∗CW$</font>. 
<br>
<font size=3>Similarly to the real case, when the input is actually in $R^N$, forward mode AD does not compute $2∗CW$ <font color=maroon>but only $(2∗CW)u$ for a given vector $u∈R^N$.</font></font>

<br>
<br>

## <font style="color:maroon;font-size:110%"><b>Default</b> backward mode gradcheck behavior</font>

<br>

### <font color=red>Real-to-real</font> functions

To test a function <font size=4>$f:R^N \to R^M, x\to y$</font>, we reconstruct the full Jacobian matrix <font size=4>$J_f$</font> of size <font size=3>$M×N$</font> in two ways: <font size=4 color=maroon><b>analytically</b> and <b>numerically</b>. </font>

* The ***`analytical version`***` uses our `**`backward mode AD`** 
* while the ***`numerical version`***` uses `**`finite difference`**. 

<font color=maroon size=3>The two reconstructed <b>Jacobian matrices</b> are then compared <i>elementwise</i> for equality.</font>

<br>

#### <font style="font-size:110%">Default real input `numerical` evaluation</font>

If we consider the elementary case of a one-dimensional function (N=M=1), then we can use the basic finite difference formula from <a href="https://en.wikipedia.org/wiki/Finite_difference" style="text-decoration:none;">the wikipedia article</a>. We use the <font size=3 color=maroon><b>“central difference”</b> for better numerical properties</font>:

<br>

<font size=4>$$\frac{∂y}{∂x}≈\frac{f(x+eps)-f(x-eps)}{2*eps}$$</font>

* This formula easily generalizes for multiple outputs <font size=4>$(M \gt 1)$</font> by having <font size=5>$\frac{\partial y}{\partial x}$</font> be <font size=4 color=maroon>a <b>column vector</b> of size $M \times 1$ like $f(x + eps)$</font>. In that case, the above formula can be re-used as-is and approximates the full Jacobian matrix with only two evaluations of the user function (namely <font size=4>$f(x+eps) \ and \ f(x−eps)$</font>).


* It is more computationally expensive to handle the case with multiple inputs <font size=4>$(N \gt 1)$</font>. In this scenario, we loop over all the inputs one after the other and apply the <font size=4>$eps$</font> perturbation for each element of <font size=4>$x$</font> one after the other. This allows us to reconstruct the <font size=4 color=maroon>$J_f$ matrix <b>column by column</b>.</font>

<br>

#### <font style="font-size:110%">Default real input `analytical` evaluation</font>

For the analytical evaluation, we use the fact, as described above, that backward mode AD computes <font size=4 color=maroon>$v^TJ_f$</font>. 

* For functions with a single output, we simply use <font size=3>$v=1$</font> to `recover the full Jacobian matrix` with a single backward pass.


* For functions with more than one output, we `resort to a for-loop` which iterates over the outputs where <font color=maroon size=4>each $v$ is a <b>one-hot vector</b></font> corresponding to each output one after the other. This allows to <font size=4 color=maroon>reconstruct the $J_f$ matrix <b>row by row</b>.</font>

<br>

### <font color=red>Complex-to-real</font> functions

To test a function <font size=4 color=maroon>$g: \mathcal{C}^N \to \mathcal{R}^M, z \to y$</font> with <font size=4>$z=a+ib$</font>, we reconstruct the (complex-valued) matrix that contains <font size=4>$2∗CW$</font>.

<br>

#### <font style="font-size:110%">Default complex input `numerical` evaluation</font>

Consider the elementary case where <font size=3>$N=M=1$</font> first. We know from (chapter 3 of) this research paper that:

<font size=5 color=maroon>$$CW:=\frac{∂y}{∂z^*}=\frac{1}{2}(\frac{∂y}{∂a}+i\frac{∂y}{∂b})$$</font>

<font color=maroon size=3><b>Note that</b></font> <font size=5>$\frac{\partial y}{\partial a}$</font> and <font size=5>$\frac{\partial y}{\partial b}$</font>, in the above equation, are <font size=4>$\mathcal{R} \to \mathcal{R}$</font> derivatives. To evaluate these numerically, we use the method described above for the real-to-real case. <font size=3 color=maroon>This allows us to compute the $CW$ matrix and then multiply it by $2$.</font>

<br>

Note that the code, as of time of writing, computes this value in a slightly convoluted way:

```python
# Code from https://github.com/pytorch/pytorch/blob/58eb23378f2a376565a66ac32c93a316c45b6131/torch/autograd/gradcheck.py#L99-L105
# Notation changes in this code block:
# s here is y above
# x, y here are a, b above

ds_dx = compute_gradient(eps)
ds_dy = compute_gradient(eps * 1j)
# conjugate wirtinger derivative
conj_w_d = 0.5 * (ds_dx + ds_dy * 1j)
# wirtinger derivative
w_d = 0.5 * (ds_dx - ds_dy * 1j)
d[d_idx] = grad_out.conjugate() * conj_w_d + grad_out * w_d.conj()

# Since grad_out is always 1, and W and CW are complex conjugate of each other, the last line ends up computing exactly 
# `conj_w_d + w_d.conj() = conj_w_d + conj_w_d = 2 * conj_w_d`.
```

<br>

#### <font style="font-size:110%">Default complex input `analytical` evaluation</font>

Since backward mode AD computes exactly twice the <font size=4>$CW$</font> derivative already, we simply use the same trick as for the real-to-real case here and reconstruct the matrix row by row when there are multiple real outputs.

<br>
<br>

## <font style="color:maroon;font-size:110%">Functions with <b>complex outputs</b></font>

In this case, the user-provided function does not follow the assumption from the autograd that the function we compute backward AD for is real-valued. This means that using autograd directly on this function is not well defined. 


To solve this, we will replace the test of the function <font size=4>$h: \mathcal{P}^N \to \mathcal{C}^M$</font> (where <font size=4>$\mathcal{P}$</font> can be either <font size=4>$\mathcal{R}$</font> or <font size=4>$\mathcal{C}$</font>), with two functions: <font size=4>$hr$</font> and <font size=4>$hi$</font> such that:

<font size=5 color=maroon>$$hr(q):=real(f(q))$$</font>
<br>
<font size=5 color=maroon>$$hi(q):=imag(f(q))$$</font>

where <font size=4>$q \in \mathcal{P}$</font>. We then do a basic gradcheck for both <font size=4>$hr$</font> and <font size=4>$hi$</font> using either the real-to-real or complex-to-real case described above, depending on <font size=4>$\mathcal{P}$</font>.

<font color=maroon size=3><b>Note that,</b></font> the code, as of time of writing, does not create these functions explicitly but perform the chain rule with the <font size=4 color=maroon>$real$</font> or <font size=4 color=maroon>$imag$</font> functions manually by passing the <font size=3 color=maroon>$\text{grad_out}$</font> arguments to the different functions. 

* When <font size=3>$\text{grad_out} = 1$</font>, then we are considering <font size=3>$hr$</font>. 


* When <font size=3>$\text{grad_out} = 1j$</font>, then we are considering <font size=3>$hi$</font>.

<br>
<br>

## <font style="color:maroon;font-size:110%"><b>Fast</b> backward mode gradcheck</font>

While the above formulation of gradcheck is great, both, to ensure `correctness and debuggability`, it is **very slow** `because it reconstructs the full Jacobian matrices`. 

This section presents a way to perform gradcheck in a faster way without affecting its correctness. The debuggability can be recovered by adding special logic when we detect an error. In that case, we can run the default version that reconstructs the full matrix to give full details to the user.

The high level strategy here is to find a scalar quantity that can be computed efficiently by both the numerical and analytical methods and that represents the full matrix computed by the slow gradcheck well enough to ensure that it will catch any discrepancy in the Jacobians.

<br>

### <font style="color:red;font-size:110%"><b>Fast gradcheck</b> `for real-to-real functions`</font>

The scalar quantity that we want to compute here is <font size=4>$v^T J_f u$</font> for a given random vector <font size=4>$v \in \mathcal{R}^M$</font> and a random unit norm vector <font size=4>$u \in \mathcal{R}^N$</font>.

* `For the `**`numerical`**` evaluation`, we can efficiently compute <br><br><font size=4>$$J_fu≈\frac{f(x+u∗eps)−f(x−u∗eps)}{2∗eps}$$</font>
<br>
We then perform the dot product between this vector and <font size=4>$v$</font> to get the scalar value of interest.

* `For the `**`analytical`**` version`, we can use backward mode AD to compute <font size=4>$v^T J_f$</font> directly. We then perform the dot product with <font size=4>$u$</font> to get the expected value.

<br>

### <font style="color:red;font-size:110%"><b>Fast gradcheck</b> `for complex-to-real  functions`</font>

Similar to the real-to-real case, we want to `perform a `**`reduction`**` of the full matrix`. But the <font size=3>$2∗CW$</font> matrix is complex-valued and so in this case, we will compare to complex scalars.


Due to some constraints on what we can compute efficiently in the numerical case and to keep the number of numerical evaluations to a minimum, we compute the following (albeit surprising) scalar value:

<font size=5 color=maroon>$$s:=2∗v^T (real(CW)ur+i∗imag(CW)ui)$$</font>

where <font size=3>$v \in \mathcal{R}^M$</font>, <font size=3>$ur \in \mathcal{R}^N$</font> and <font size=3>$ui \in \mathcal{R}^N$</font>.

<br>

#### <font style="font-size:110%">Fast complex input `numerical` evaluation</font>

We first consider how to compute <font size=4>$s$</font> with a numerical method. To do so, keeping in mind that we’re considering <font size=3>$g: \mathcal{C}^N \to \mathcal{R}^M, z \to y$</font> with <font size=3>$z=a+ib$</font>, and that <font size=3 color=red>$CW = \frac{1}{2} * (\frac{\partial y}{\partial a} + i \frac{\partial y}{\partial b})$</font>, we rewrite it as follows:

<font size=4 color=maroon>
\begin{equation}
\begin{split}
s &=2∗v^T (real(CW)ur+i∗imag(CW)ui) \\
  &=2∗v^T (\frac{1}{2}∗\frac{∂y}{∂a}ur+i∗\frac{1}{2}∗\frac{∂y}{∂b}ui) \\
  &=v^T(\frac{∂y}{∂a}ur+i∗\frac{∂y}{∂b}ui) \\
  &=v^T((\frac{∂y}{∂a}ur)+i∗(\frac{∂y}{∂b}ui))
\end{split}
\end{equation}
</font>

In this formula, we can see that <font size=4>$\frac{\partial y}{\partial a} ur$</font> and <font size=4>$\frac{\partial y}{\partial b} ui$</font> can be evaluated the same way as the fast version for the real-to-real case. Once these real-valued quantities have been computed, we can reconstruct the complex vector on the right side and do a dot product with the real-valued <font size=4>$v$</font> vector.

<br<

#### <font style="font-size:110%">Fast complex input `analytical ` evaluation</font>

For the analytical case, things are simpler and we rewrite the formula as:

<font size=4 color=maroon>
\begin{equation}
\begin{split}
s &=2∗v^T (real(CW)ur+i∗imag(CW)ui) \\
  &=v^T real(2∗CW)ur+i∗v^T imag(2∗CW)ui) \\
  &=real(v^T (2∗CW))ur+i∗imag(v^T (2∗CW))ui
\end{split}
\end{equation}
</font>

We can thus use the fact that the backward mode AD provides us with an efficient way to compute <font size=3>$v^T (2 * CW)$</font> and then perform a dot product of the real part with <font size=4>$ur$</font> and the imaginary part with <font size=4>$ui$</font> before reconstructing the final complex scalar <font size=4>$s$</font>.

<br>

#### <font style="color:red;font-size:110%">Why not use a complex ***u***</font>

At this point, you might be wondering why we did not select a complex <font size=4>$u$</font> and just performed the reduction <font size=4>$2 * v^T CW u'$</font>. To dive into this, in this paragraph, we will use the complex version of <font size=4>$u$</font> noted <font size=4>$u' = ur' + i ui'$</font>. Using such complex <font size=4>$u'$</font>, the problem is that when doing the numerical evaluation, we would need to compute:

<font size=4>
\begin{equation}
\begin{split}
2∗CWu′ &= (\frac{∂y}{∂a}+i \frac{∂y}{∂b})(ur′+iui′) \\
       &= \frac{∂y}{∂a}ur′ +i\frac{∂y}{∂a}ui′ +i\frac{∂y}{∂b}ur′ − \frac{∂y}{∂b}ui′
\end{split}
\end{equation}
</font>

Which would require four evaluations of real-to-real finite difference (twice as much compared to the approached proposed above). Since this approach does not have more degrees of freedom (same number of real valued variables) and we try to get the fastest possible evaluation here, we use the other formulation above.

<br>

### Fast gradcheck for functions with complex outputs

Just like in the slow case, we consider two real-valued functions and use the appropriate rule from above for each function.

<br>
<br>

## Gradgradcheck implementation

PyTorch also provide a utility to verify second order gradients. The goal here is to make sure that the backward implementation is also properly differentiable and computes the right thing.

This feature is implemented by considering the function <font size=4 color=maroon>$F: x, v \to v^T J_f$</font> and use the gradcheck defined above on this function. <font size=3 color=maroon><b>Note that</b></font> <font size=4>$v$</font> in this case is just a random vector with the same type as <font size=4>$f(x)$</font>.

The fast version of gradgradcheck is implemented by using the fast version of gradcheck on that same function <font size=4>$F$</font>.

<br>
<br>
<br>

<font color=gray size=3>Docs > Notes2 </font>

# <font style="font-size:120%;color:maroon;font-weight:bold">Autograd mechanics</font> <a href="https://pytorch.org/docs/stable/notes/autograd.html" style="text-decoration:none;"><font size=2>[link]</font></a>

<font size=3 color=maroon>This note will present an overview of how autograd works and records the operations. It’s not strictly necessary to understand all this, but we recommend getting familiar with it, as it will help you write more efficient, cleaner programs, and can aid you in debugging.</font>

<br>

## How autograd encodes the history

<font size=3 color=blue><b>Autograd</b></font>` is reverse automatic differentiation system.` Conceptually, autograd records a graph recording all of the operations that created the data as you execute operations, giving you a directed acyclic graph whose ***leaves*** are the input tensors and ***roots*** are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

Internally, autograd represents this graph as a graph of `Function` objects (really expressions), which can be `apply()` ed to compute the result of evaluating the graph. When computing the forwards pass, autograd simultaneously performs the requested computations and builds up a graph representing the function that computes the gradient (the `.grad_fn` attribute of each **`torch.Tensor`** is an entry point into this graph). <font color=maroon size=3>When the forwards pass is completed, we evaluate this graph in the backwards pass to compute the gradients.</font>

An important thing to note is that <font size=3 color=maroon><b>the graph is recreated from scratch at every iteration</b>, and this is exactly what allows for using arbitrary Python control flow statements, that can change the overall shape and size of the graph at every iteration. You don’t have to encode all possible paths before you launch the training - what you run is what you differentiate.</font>

<br>

### Saved tensors

Some operations need intermediary results to be saved during the forward pass in order to execute the backward pass. For example, the function <font size=3>$x↦x^2$</font> saves the input xx to compute the gradient.

When defining a custom Python <a href="https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function" style="text-decoration:none;"><b>Function</b></a>, you can use `save_for_backward()` to save tensors during the forward pass and `saved_tensors` to retrieve them during the backward pass. See <a href="https://pytorch.org/docs/stable/notes/extending.html" style="text-decoration:none;"><b>Extending PyTorch</b></a> for more information.

For operations that PyTorch defines (e.g. <a href="https://pytorch.org/docs/stable/generated/torch.pow.html#torch.pow" style="text-decoration:none;"><b>torch.pow()</b></a>), tensors are automatically saved as needed. You can explore (for educational or debugging purposes) which tensors are saved by a certain `grad_fn` by looking for its attributes starting with the prefix `_saved`.

In [1]:
import torch

In [2]:
x = torch.randn(5, requires_grad=True)
y = x.pow(2)

print(x.equal(y.grad_fn._saved_self))  # True
print(x is y.grad_fn._saved_self)      # True

True
True


In the previous code, **`y.grad_fn._saved_self`** refers to the same Tensor object as x. But that may not always be the case. For instance:

In [3]:
x = torch.randn(5, requires_grad=True)
y = x.exp()

print(y.equal(y.grad_fn._saved_result))  # True
print(y is y.grad_fn._saved_result)      # False

True
False


Under the hood, to prevent reference cycles, PyTorch has packed the tensor upon saving and unpacked it into a different tensor for reading. Here, the tensor you get from accessing `y.grad_fn._saved_result` is a different tensor object than x (but they still share the same storage).

Whether a tensor will be packed into a different tensor object depends on whether it is an output of its own ***grad_fn***, which is an implementation detail subject to change and that users should not rely on.

You can control how PyTorch does packing / unpacking with (section below) <a href="https://pytorch.org/docs/stable/notes/autograd.html#saved-tensors-hooks-doc" style="text-decoration:none;"><font size=3>Hooks for saved tensors</font></a>.

<br>
<br>

## Gradients for non-differentiable functions 

<a href="https://pytorch.org/docs/stable/notes/autograd.html#gradients-for-non-differentiable-functions" style="text-decoration:none">暂略 [link]</a>

<br>
<br>

## Locally disabling gradient computation

There are several mechanisms available from Python to locally disable gradient computation:

To disable gradients across entire blocks of code, there are `context managers like `<font size=4 color=maroon><b>no-grad mode</b></font>` and `<font size=4 color=maroon><b>inference mode</b></font>. For more fine-grained exclusion of subgraphs from gradient computation, there is setting the **`requires_grad`**` field` of a tensor.

Below, in addition to discussing the mechanisms above, we also describe <font size=4 color=maroon><b>evaluation mode</b></font> (`nn.Module.eval()`), <font size=4 color=maroon><b>a method that</b></font> is not actually used to disable gradient computation but, because of its name, is often mixed up with the three.

<br>

### Setting `requires_grad`

<font size=3>***requires_grad*** is a flag, defaulting to **false** unless wrapped in a `nn.Parameter`, that allows for fine-grained exclusion of subgraphs from gradient computation. It takes effect in both the forward and backward passes:

* During the forward pass, an operation is only recorded in the backward graph if at least one of its input tensors require grad. 

    
* During the backward pass (`.backward()`), only leaf tensors with ***requires_grad=True*** will have gradients accumulated into their `.grad` fields.
</font>



<font size=3><font color=maroon><b>It is important to note that</b></font> even though every tensor has this flag, setting it only makes sense for <font color=maroon>leaf tensors</font> (tensors that do not have a ***grad_fn***, e.g., a **nn.Module**’s parameters). <font color=maroon>Non-leaf tensors</font> (tensors that do have ***grad_fn***) are tensors that have a backward graph associated with them. Thus their gradients will be needed as an intermediary result to compute the gradient for a leaf tensor that requires grad. <font color=maroon> From this definition, it is clear that all non-leaf tensors will automatically have **`require_grad=True`**.</font>


<font color=maroon>Setting **requires_grad** should be the main way you control which parts of the model are part of the gradient computation,</font> for example, if you need to freeze parts of your pretrained model during model fine-tuning.

</font>

<font size=3>To freeze parts of your model, simply apply `.requires_grad_(False)` to the parameters that you don’t want updated. And as described above, since computations that use these parameters as inputs would not be recorded in the forward pass, they won’t have their `.grad` fields updated in the backward pass because they won’t be part of the backward graph in the first place, as desired.    </font>

<font size=3>Because this is such a common pattern, `requires_grad` can also be set at the module level with `nn.Module.requires_grad_()`. When applied to a module, `.requires_grad_()` takes effect on all of the module’s parameters (which have `requires_grad=True` by default).</font>

<br>

### Grad Modes

Apart from setting ***requires_grad*** there are also three possible modes enableable from Python that can affect how computations in PyTorch are processed by autograd internally: 
* default mode (grad mode), 
* no-grad mode, 
* and inference mode, 

all of which can be togglable via <font color=maroon><b>context managers</b></font> and <font color=maroon><b>decorators</b></font>.

<br>

#### <font style="color:blue;font-size:120%;font-weight:bold">Default Mode (Grad Mode)</font>

The “default mode” is actually the mode we are implicitly in when no other modes like no-grad and inference mode are enabled. To be contrasted with “no-grad mode” the default mode is also sometimes called “grad mode”.

<font color=maroon><b>The most important thing</b></font> to know about the default mode is that <font size=3 color=maroon>it is the only mode in which **`requires_grad`** takes effect.</font> **`requires_grad`** is always overridden to be **`False`** in both the two other modes.

<br>

#### <font style="color:blue;font-size:120%;font-weight:bold">No-grad Mode</font>

Computations in no-grad mode <font color=maroon>behave as if none of the inputs require grad.</font> In other words, computations in no-grad mode are <font color=maroon>never recorded in the backward graph even if there are inputs that have ***require_grad=True***.</font>

Enable no-grad mode when you need to perform operations that should not be recorded by autograd, but you’d still like to use the outputs of these computations in grad mode later. This context manager makes it convenient to disable gradients for a block of code or function without having to temporarily set tensors to have *requires_grad=False*, and then back to *True*.

For example, no-grad mode might be useful when writing an optimizer: when performing the training update you’d like to update parameters in-place without the update being recorded by autograd. You also intend to use the updated parameters for computations in grad mode in the next forward pass.

The implementations in <a href="https://pytorch.org/docs/stable/nn.init.html#nn-init-doc" style="text-decoration:none;">torch.nn.init</a> also rely on no-grad mode when initializing the parameters as to avoid autograd tracking when updating the intialized parameters in-place.

<br>

#### <font style="color:blue;font-size:120%;font-weight:bold">Inference Mode</font>

Inference mode is the extreme version of no-grad mode. Just like in no-grad mode, computations in inference mode are not recorded in the backward graph, but enabling inference mode will allow PyTorch to speed up your model even more. This better runtime comes with a drawback: tensors created in inference mode will not be able to be used in computations to be recorded by autograd after exiting inference mode.

Enable inference mode when you are performing computations that don’t need to be recorded in the backward graph, AND you don’t plan on using the tensors created in inference mode in any computation that is to be recorded by autograd later.

<font color=maroon size=3><b>It is recommended that</b> you try out inference mode in the parts of your code that do not require autograd tracking (e.g., <b>data processing</b> and <b>model evaluation</b>).</font> If it works out of the box for your use case it’s a free performance win. If you run into errors after enabling inference mode, check that you are not using tensors created in inference mode in computations that are recorded by autograd after exiting inference mode. If you cannot avoid such use in your case, you can always switch back to no-grad mode.

For details on inference mode please see <a href="https://pytorch.org/cppdocs/notes/inference_mode.html" style="text-decoration:none;"><font size=4>Inference Mode</font></a>.

For implementation details of inference mode see <a href="https://github.com/pytorch/rfcs/pull/17" style="text-decoration:none;"><font size=4>RFC-0011-InferenceMode</font></a>.

<br>

#### <font style="color:blue;font-size:120%;font-weight:bold">Evaluation Mode (`nn.Module.eval()`)</font>

Evaluation mode is not actually a mechanism to locally disable gradient computation. It is included here anyway because it is sometimes confused to be such a mechanism.

Functionally, **`module.eval()`** (or equivalently **`module.train()`**) are completely orthogonal to no-grad mode and inference mode. How `model.eval()` affects your model depends entirely on the specific modules used in your model and whether they define any training-mode specific behavior.

<font size=3>You are responsible for calling `model.eval()` and `model.train()` if your model relies on modules such as <a href="https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html#torch.nn.Dropout" style="text-decoration:none;"><font size=4>torch.nn.Dropout</font></a> and <a href="https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html#torch.nn.BatchNorm2d" style="text-decoration:none;"><font size=4>torch.nn.BatchNorm2d</font></a> that may behave differently depending on training mode, for example, to avoid updating your BatchNorm running statistics on validation data.</font>

<font color=maroon><b>It is recommended that</b> you always use `model.train()` when training and `model.eval()` when evaluating your model (validation/testing)</font> even if you aren’t sure your model has training-mode specific behavior, because a module you are using might be updated to behave differently in training and eval modes.

<br>
<br>

## In-place operations with autograd

Supporting in-place operations in autograd is a hard matter, and we discourage their use in most cases. Autograd’s aggressive buffer freeing and reuse makes it very efficient and there are very few occasions when in-place operations actually lower memory usage by any significant amount. Unless you’re operating under heavy memory pressure, you might never need to use them.

There are two main reasons that limit the applicability of in-place operations:

* In-place operations can potentially overwrite values required to compute gradients.

* Every in-place operation actually requires the implementation to rewrite the computational graph. Out-of-place versions simply allocate new objects and keep references to the old graph, while in-place operations, require changing the creator of all inputs to the `Function` representing this operation. This can be tricky, especially if there are many Tensors that reference the same storage (e.g. created by indexing or transposing), and in-place functions will actually raise an error if the storage of modified inputs is referenced by any other `Tensor`.

### In-place correctness checks

<font color=maroon>Every tensor keeps a version counter, that is incremented every time it is marked dirty in any operation.</font> When a Function saves any tensors for backward, a version counter of their containing Tensor is saved as well. Once you access `self.saved_tensors` it is checked, and if it is greater than the saved value an error is raised. This ensures that if you’re using in-place functions and not seeing any errors, you can be sure that the computed gradients are correct.

<br>
<br>

## Multithreaded Autograd

The autograd engine is responsible for running all the backward operations necessary to compute the backward pass. This section will describe all the details that can help you make the best use of it in a `multithreaded environment`.(this is relevant only for PyTorch 1.6+ as the behavior in previous version was different).

User could train their model with `multithreading code` (e.g. `Hogwild training`), and does not block on the concurrent backward computations, example code could be:

```python
# Define a train function to be used in different threads
def train_fn():
    x = torch.ones(5, 5, requires_grad=True)
    # forward
    y = (x + 3) * (x + 4) * 0.5
    # backward
    y.sum().backward()
    # potential optimizer update


# User write their own threading code to drive the train_fn
threads = []
for _ in range(10):
    p = threading.Thread(target=train_fn, args=())
    p.start()
    threads.append(p)

for p in threads:
    p.join()
```

<font size=4>Note that some behaviors that user should be aware of:</font>

### Concurrency on CPU

When you run `backward()` or `grad()` via python or C++ API in multiple threads on CPU, you are expecting to see extra concurrency instead of serializing all the backward calls in a specific order during execution (behavior before PyTorch 1.6).

<br>

### Non-determinism

If you are calling `backward()` on multiple thread concurrently but with shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared across threads, gradient accumulation might become non-deterministic on backward calls across threads, because two backward calls might access and try to accumulate the same `.grad` attribute. This is technically not safe, and it might result in racing condition and the result might be invalid to use.

But this is expected pattern if you are using the multithreading approach to drive the whole training process but using shared parameters, user who use multithreading should have the threading model in mind and should expect this to happen. User could use the functional API <a href="https://pytorch.org/docs/stable/generated/torch.autograd.grad.html#torch.autograd.grad" style="text-decoration:none;"><b>torch.autograd.grad()</b></a> to calculate the gradients instead of `backward()` to avoid non-determinism.

<br>

### Graph retaining

If part of the autograd graph is shared between threads, i.e. run first part of forward single thread, then run second part in multiple threads, then the first part of graph is shared. In this case different threads execute `grad()` or `backward()` on the same graph might have issue of destroying the graph on the fly of one thread, and the other thread will crash in this case. Autograd will error out to the user similar to what call `backward()` twice without `retain_graph=True`, and let the user know they should use `retain_graph=True`.

<br>

### Thread Safety on Autograd Node

Since Autograd allows the caller thread to drive its backward execution for potential parallelism, it’s important that we ensure thread safety on CPU with parallel backwards that share part/whole of the GraphTask.

Custom Python `autograd.function` is automatically thread safe because of GIL. for built-in C++ Autograd Nodes(e.g. AccumulateGrad, CopySlices) and custom `autograd::Function`, the Autograd Engine uses thread mutex locking to protect thread safety on autograd Nodes that might have state write/read.

<br>

### No thread safety on C++ hooks

Autograd relies on the user to write thread safe C++ hooks. If you want the hook to be correctly applied in multithreading environment, you will need to write proper thread locking code to ensure the hooks are thread safe.

<br>
<br>

## Autograd for Complex Numbers

The short version:

* When you use PyTorch to differentiate any function <font size=3><b>$f(z)$</b></font> with complex domain and/or codomain, the gradients are computed under the assumption that the function is a part of a larger real-valued loss function <font size=3><b>$g(input)=L$</b></font>. The gradient computed is <font size=5><b>$\frac{\partial L}{\partial z^*}$</b></font> (note the conjugation of z), the negative of which is precisely the direction of steepest descent used in Gradient Descent algorithm. Thus, all the existing optimizers work out of the box with complex parameters.


* This convention matches TensorFlow’s convention for complex differentiation, but is different from JAX (which computes <font size=5><b>$\frac{\partial L}{\partial z}$</b></font>.


* If you have a real-to-real function which internally uses complex operations, the convention here doesn’t matter: you will always get the same result that you would have gotten if it had been implemented with only real operations.

If you are curious about the mathematical details, or want to know how to define complex derivatives in PyTorch, read on.

<br>

### What are complex derivatives?

The mathematical definition of complex-differentiability takes the limit definition of a derivative and generalizes it to operate on complex numbers. Consider a function <font size=4><b>$f: ℂ → ℂ$</b></font>:

<font size=5>$$f(z=x+yj)=u(x,y)+v(x,y)j$$</font>

where <font size=4><b>$u$</b></font> and <font size=4><b>$v$</b></font> are two variable real valued functions.

Using the derivative definition, we can write:

<font size=5>$$f′(z)=\lim_{h→0,h∈C}\frac{f(z+h)−f(z)}{h}$$</font>

In order for this limit to exist, not only must <font size=4><b>$u$</b></font> and <font size=4><b>$v$</b></font> must be real differentiable, but ff must also satisfy the <a href="https://en.wikipedia.org/wiki/Cauchy%E2%80%93Riemann_equations" style="text-decoration:none;"><font size=3>Cauchy-Riemann equations</font></a>. In other words: the limit computed for real and imaginary steps (<font size=4><b>$h$</b></font>) must be equal. This is a more restrictive condition.

<font color=royalblue>The complex differentiable functions are commonly known as **holomorphic functions**.</font> They are well behaved, have all the nice properties that you’ve seen from real differentiable functions, but are practically of no use in the optimization world. For optimization problems, only real valued objective functions are used in the research community since complex numbers are not part of any ordered field and so having complex valued loss does not make much sense.

<font color=maroon>It also turns out that no interesting real-valued objective fulfill the Cauchy-Riemann equations. So the theory with homomorphic function cannot be used for optimization and most people therefore use the **Wirtinger calculus**.</font>

<br>

### Wirtinger Calculus comes in picture …

<font color=maroon>So, we have this great theory of `complex differentiability` and `holomorphic functions`, and we can’t use any of it at all, because many of the commonly used functions are not holomorphic. 
    
What’s a poor mathematician to do? Well, Wirtinger observed that even if <font size=4><b>$f(z)$</b></font> isn’t holomorphic, one could rewrite it as a two variable function <font size=4><b>$f(z, z*)$</b></font> which is always holomorphic.</font>

This is because real and imaginary of the components of <font size=4><b>$z$</b></font> can be expressed in terms of <font size=4><b>$z$</b></font> and <font size=4><b>$z^*$</b></font> as:

<font size=4>$$Re(z)=\frac{z+z^*}{2}$$</font>


<font size=4>$$Im(z)=\frac{z-z^*}{2j}$$</font>

<font color=maroon size=3>Wirtinger calculus suggests to study <font size=4><b>$f(z, z^*)$</b></font> instead, which is guaranteed to be holomorphic if <font size=4><b>$f$</b></font> was real differentiable (another way to think of it is as a change of coordinate system, from <font size=4><b>$f(x, y)$</b></font> to <font size=4><b>$f(z, z^*)$</b></font>.)</font> This function has partial derivatives<font size=5><b>$\frac{\partial }{\partial z}$</b></font> and <font size=5><b>$\frac{\partial}{\partial z^{*}}$</b></font>. We can use the **chain rule** to establish a relationship between these partial derivatives and the partial derivatives w.r.t., the real and imaginary components of <font size=4><b>$z$</b></font>.

<font size=4>
\begin{equation}
\begin{split}
\frac{∂}{∂x} &= \frac{∂z}{∂x}*\frac{∂}{∂z} + \frac{∂z^*}{∂x}*\frac{∂}{∂z^*} \\
             &= \frac{∂}{∂z} + \frac{∂}{∂z^*}
\end{split}
\end{equation}
</font>

<br>

<font size=4>
\begin{equation}
\begin{split}
\frac{∂}{∂y} &= \frac{∂z}{∂y}*\frac{∂}{∂z} + \frac{∂z^*}{∂y}*\frac{∂}{∂z^*} \\
             &= 1j*(\frac{∂}{∂z} - \frac{∂}{∂z^*})
\end{split}
\end{equation}
</font>

From the above equations, we get:

<font size=4>$$\frac{∂}{∂z}=1/2*(\frac{∂}{∂x}-1j*\frac{∂}{∂y})$$</font>


<font size=4>$$\frac{∂}{∂z^*}=1/2*(\frac{∂}{∂x}+1j*\frac{∂}{∂y})$$</font>

which is the classic definition of Wirtinger calculus that you would find on <a href="https://en.wikipedia.org/wiki/Wirtinger_derivatives" style="text-decoration:none;"><b>Wikipedia</b></a>.

<br>

There are a lot of beautiful consequences of this change.

* For one, the `Cauchy-Riemann equations` translate into simply saying that <font size=4>$\frac{\partial f}{\partial z^*} = 0$</font> (that is to say, the function <font size=4>$f$</font> can be written entirely in terms of <font size=4>$z$</font>, without making reference to <font size=4>$z^*$</font>).


* Another important (and somewhat counterintuitive) result, as we’ll see later, is that when we do optimization on a real-valued loss, the step we should take while making variable update is given by <font size=4>$\frac{\partial Loss}{\partial z^*}$</font> (not <font size=4>$\frac{\partial Loss}{\partial z}$</font>).

For more reading, check out: <a href="https://arxiv.org/pdf/0906.4835.pdf" style="text-decoration:none;"><b>https://arxiv.org/pdf/0906.4835.pdf</b></a>

<br>

### How is Wirtinger Calculus useful in optimization?

<font size=3 color=maroon>Researchers in audio and other fields, more commonly, use `gradient descent` to optimize real valued loss functions with complex variables.</font> Typically, these people treat the real and imaginary values as separate channels that can be updated. For a step size <font size=4>$α/2$</font> and loss <font size=4>$L$</font>, we can write the following equations in <font size=4>$ℝ^2$</font>:

<font size=4>$$x_{n+1}=x_n-\alpha/2*\frac{∂L}{∂x}$$</font>

<font size=4>$$y_{n+1}=y_n-\alpha/2*\frac{∂L}{∂y}$$</font>

How do these equations translate into complex space <font size=4>$ℂ$</font>?

<font size=4>
\begin{equation}
\begin{split}
z_{n+1}&=x_n-(\alpha/2)*\frac{∂L}{∂x}+1j*(y_n-(\alpha/2)*\frac{∂L}{∂y}) \\
       &=z_n-\alpha * 1/2 * (\frac{∂L}{∂x}+j\frac{∂L}{∂y}) \\
       &=z_n-\alpha * \frac{∂L}{∂z^*}
\end{split}
\end{equation}
    
</font>

<font size=3>Something very interesting has happened: <font color=maroon>Wirtinger calculus tells us that we can simplify the complex variable update formula above to only refer to the conjugate Wirtinger derivative <font size=5>$\frac{\partial L}{\partial z^*} $</font>, giving us exactly the step we take in optimization.</font></font>

<font size=3 color=maroon>Because the conjugate Wirtinger derivative gives us exactly the correct step for a real valued loss function, PyTorch gives you this derivative when you differentiate a function with a real valued loss.</font>

<br>

### <font style="color:red;font-size:110%;">How does PyTorch compute the <b>conjugate Wirtinger derivative</b>?</font>

Typically, our derivative formulas take in grad_output as an input, representing the incoming Vector-Jacobian product that we’ve already computed, aka, <font size=5>$\frac{\partial L}{\partial s^*}$</font>, where <font size=4>$L$</font> is the loss of the entire computation (producing a real loss) and <font size=4>$s$</font> is the output of our function. The goal here is to compute <font size=5>$\frac{\partial L}{\partial z^*}$</font>, where <font size=4>$z$</font> is the input of the function. 
 


It turns out that in the case of real loss, we can get away with only calculating <font size=5>$\frac{\partial L}{\partial z^*}$</font>, even though the chain rule implies that we also need to have access to <font size=5>$\frac{\partial L}{\partial z^*}$</font>. If you want to skip this derivation, look at the last equation in this section and then skip to the next section.

<br>

Let’s continue working with <font size=4>$f: ℂ → ℂ$</font> defined as <font size=4>$f(z) = f(x+yj) = u(x, y) + v(x, y)j$</font>. As discussed above, autograd’s gradient convention is centered around optimization for real valued loss functions, so let’s assume <font size=4>$f$</font> is a part of larger real valued loss function <font size=4>$g$</font>. Using chain rule, we can write:

<font size=4>$$\frac{∂L}{∂z^*}=\frac{∂L}{∂u}*\frac{∂u}{∂z^*}+\frac{∂L}{∂v}*\frac{∂v}{∂z^*} \ \ \ \ \ \ \ \ \ (1)$$</font>

Now using Wirtinger derivative definition, we can write:

<font size=4>$$\frac{∂L}{∂s}=1/2*(\frac{∂L}{∂u}-\frac{∂L}{∂v}j)$$</font>

<font size=4>$$\frac{∂L}{∂s^*}=1/2*(\frac{∂L}{∂u}+\frac{∂L}{∂v}j)$$</font>

It should be noted here that since uu and vv are real functions, and LL is real by our assumption that ff is a part of a real valued function, we have:

<font size=4>$$(\frac{∂L}{∂s})^*=\frac{∂L}{∂s^*} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$$</font>

i.e., <font size=4>$\frac{\partial L}{\partial s} $</font> equals to <font size=4 color=blue>$grad\_output^*$</font>.

Solving the above equations for <font size=4>$\frac{\partial L}{\partial u}$</font> and <font size=4>$\frac{\partial L}{\partial v}$</font>, we get:

<font size=4>$$\frac{∂L}{∂u}=\frac{∂L}{∂s}+\frac{∂L}{∂s^*}$$</font>
<font size=4>$$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)$$</font>
<font size=4>$$\frac{∂L}{∂v}=-1j*(\frac{∂L}{∂s}-\frac{∂L}{∂s^*})$$</font>

Substituting <font size=3 color=red>(3)</font> in <font size=3 color=red>(1)</font>, we get:

<font size=4>
\begin{equation}
\begin{split}
\frac{∂L}{∂z^*}&=(\frac{∂L}{∂s} + \frac{∂L}{∂s^*})*\frac{∂u}{∂z^*}-1j*(\frac{∂L}{∂s}-\frac{∂L}{∂s^*})*\frac{∂v}{∂z^*} \\ \\
         &=\frac{∂L}{∂s}*(\frac{∂u}{∂z^*}+\frac{∂v}{∂z^*}j)+\frac{∂L}{∂s^*}*(\frac{∂u}{∂z^*}-\frac{∂v}{∂z^*}j)  \\ \\
         &=\frac{∂L}{∂s^*}*\frac{∂(u+vj)}{∂z^*}+\frac{∂L}{∂s}*\frac{∂(u+vj)^*}{∂z^*} \\ \\
         &=\frac{∂L}{∂s}*\frac{∂s}{∂z^*}+\frac{∂L}{∂s^*}*\frac{∂s^*}{∂z^*}
\end{split}
\end{equation}
</font>

Using <font size=3 color=red>(2)</font>, we get:

<font size=4>$$\frac{∂L}{∂z^*}=(\frac{∂L}{∂s^*})^**\frac{∂s}{∂z^*}+\frac{∂L}{∂s^*}*(\frac{∂s}{∂z})^*$$</font>
<div class="alert alert-block alert-danger" style="font-size:150%">
$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ =(𝑔𝑟𝑎𝑑\_𝑜𝑢𝑡𝑝𝑢𝑡)^**\frac{∂s}{∂z^*}+ 𝑔𝑟𝑎𝑑\_𝑜𝑢𝑡𝑝𝑢𝑡 *(\frac{∂s}{∂z})^*\ \ \ \ (4)$
</div>

<font color=red size=5>This last equation is the important one for writing your own gradients, as it decomposes our derivative formula into a simpler one that is easy to compute by hand.</font>

<br>

### <font style="color:red;font-size:110%">How can I write my own derivative formula for a complex function?</font>

The above boxed equation gives us the general formula for all derivatives on complex functions. However, we still need to compute <font size=5>$\frac{\partial s}{\partial z}$</font> and <font size=5>$\frac{\partial s}{\partial z^*}$</font>. There are two ways you could do this:

* The first way is to just use the definition of Wirtinger derivatives directly and calculate <font size=5>$\frac{\partial s}{\partial z}$</font> and <font size=5>$\frac{\partial s}{\partial z^*}$</font> by using <font size=5>$\frac{\partial s}{\partial x}$</font> and <font size=5>$\frac{\partial s}{\partial y}$</font> (which you can compute in the normal way).


* The second way is to use the change of variables trick and rewrite <font size=4>$f(z)$</font> as a two variable function <font size=4>$f(z, z^*)$</font>, and compute the conjugate Wirtinger derivatives by treating <font size=4>$z$</font> and <font size=4>$z^*$</font> as independent variables. This is often easier; for example, if the function in question is holomorphic, only <font size=4>$z$</font> will be used (and <font size=5>$\frac{\partial s}{\partial z^*}$</font> will be zero).

<br>

Let’s consider the function <font size=4>$f(z=x+yj)=c∗z=c∗(x+yj)$</font> as an example, where <font size=4>$c \in ℝ$</font>.

Using the first way to compute the Wirtinger derivatives, we have:

<font size=4>
\begin{equation}
\begin{split}
\frac{∂s}{∂z}
    &=1/2*(\frac{∂s}{∂x}-\frac{∂s}{∂y}j) \\
    &=1/2*(c-(c*1j)*1j) \\
    &=c
\end{split}
\end{equation}
</font>

<font size=4>
\begin{equation}
\begin{split}
\frac{∂s}{∂z^*}
    &=1/2*(\frac{∂s}{∂x}+\frac{∂s}{∂y}j) \\
    &=1/2*(c+(c*1j)*1j) \\
    &=0
\end{split}
\end{equation}
</font>

Using <font size=3 color=red>(4)</font>, and *`grad_output = 1.0`* (which is the default grad output value used when `backward()` is called on a scalar output in PyTorch), we get:

<font size=4>$$\frac{∂L}{∂z^*}=1*0+1*c=c$$</font>

Using the second way to compute Wirtinger derivatives, we directly get:

<br>

<font size=4>
\begin{equation}
\begin{split}
\frac{∂s}{∂z}=\frac{∂(c*z)}{∂z}=c
\end{split}
\end{equation}
</font>

<br>

<font size=4>
\begin{equation}
\begin{split}
\frac{∂s}{∂z^*}=\frac{∂(c*z)}{∂z^*}=0
\end{split}
\end{equation}
</font>

And using <font size=3 color=red>(4)</font> again, we get <font size=4>$\frac{\partial L}{\partial z^*} = c$</font>. As you can see, <font color=red size=5>the second way involves lesser calculations, and comes in more handy for faster calculations.</font>

<br>

### What about cross-domain functions?

Some functions map from complex inputs to real outputs, or vice versa. These functions form a special case of <font size=3 color=red>(4)</font>, which we can derive using the chain rule:

* For <font size=4>$f:ℂ → ℝ$</font>, we get:

<div class="alert alert-block alert-danger"><font size=5>$$\frac{∂L}{∂z^*}=2*𝑔𝑟𝑎𝑑\_𝑜𝑢𝑡𝑝𝑢𝑡*\frac{∂s}{∂z^*}$$</font></div>

* For <font size=4>$f: ℝ → ℂ$</font>, we get:

<div class="alert alert-block alert-danger"><font size=5>$$\frac{∂L}{∂z^*}=2*Re(𝑔𝑟𝑎𝑑\_𝑜𝑢𝑡𝑝𝑢𝑡^**\frac{∂s}{∂z^*})$$</font></div>

<br>
<br>

## Hooks for saved tensors

You can control <a href="https://pytorch.org/docs/stable/notes/autograd.html#saved-tensors-doc" style="text-decoration:none;">how saved tensors are packed / unpacked</a> by defining a pair of `pack_hook` / `unpack_hook hooks`. 

* The `pack_hook` function should take a tensor as its single argument but can return any python object (e.g. another tensor, a tuple, or even a string containing a filename). 


* The `unpack_hook` function takes as its single argument the output of pack_hook and should return a tensor to be used in the backward pass. The tensor returned by unpack_hook only needs to have the same content as the tensor passed as input to pack_hook. In particular, any autograd-related metadata can be ignored as they will be overwritten during unpacking.

An example of such pair is:

```python
class SelfDeletingTempFile():
    def __init__(self):
        self.name = os.path.join(tmp_dir, str(uuid.uuid4()))

    def __del__(self):
        os.remove(self.name)

        
def pack_hook(tensor):
    temp_file = SelfDeletingTempFile()
    torch.save(tensor, temp_file.name)
    return temp_file

def unpack_hook(temp_file):
    return torch.load(temp_file.name)
```

<font color=maroon>Notice that</font> the `unpack_hook` should not delete the temporary file because it might be called multiple times: the temporary file should be alive for as long as the returned ***`SelfDeletingTempFile`*** object is alive. In the above example, we prevent leaking the temporary file by closing it when it is no longer needed (on deletion of the ***`SelfDeletingTempFile`*** object).

<div class="alert alert-block alert-info">

<font size=3 color=red><b>NOTE: </b></font>

We guarantee that `pack_hook` will only be called once but `unpack_hook` can be called as many times as the backward pass requires it and we expect it to return the same data each time.

</div>

<div class="alert alert-block alert-danger">

<font size=3 color=red><b>WARNNING: </b></font>

Performing inplace operations on the input of any of the functions is forbidden as they may lead to unexpected side-effects. PyTorch will throw an error if the input to a pack hook is modified inplace but does not catch the case where the input to an unpack hook is modified inplace.

</div>

<br>

### Registering hooks for a saved tensor

You can register a pair of hooks on a saved tensor by calling the `register_hooks()` method on a `SavedTensor` object. Those objects are exposed as attributes of a `grad_fn` and start with the `_raw_saved_` prefix.

```python
x = torch.randn(5, requires_grad=True)
y = x.pow(2)
y.grad_fn._raw_saved_self.register_hooks(pack_hook, unpack_hook)
```

* The `pack_hoo`k method is called as soon as the pair is registered. 

* The `unpack_hook` method is called each time the saved tensor needs to be accessed, either by means of `y.grad_fn._saved_self` or during the backward pass.

<div class="alert alert-block alert-danger">

<font size=3 color=red><b>WARNNING: </b></font>

If you maintain a reference to a `SavedTensor` after the saved tensors have been released (i.e. after backward has been called), calling its `register_hooks()` is forbidden. PyTorch will throw an error most of the time but it may fail to do so in some cases and undefined behavior may arise.

</div>

<br>

### Registering default hooks for saved tensors

Alternatively, you can use the context-manager <a href="https://pytorch.org/docs/stable/autograd.html#torch.autograd.graph.saved_tensors_hooks" style="text-decoration:none;"><b>saved_tensors_hooks</b></a> to register a pair of hooks which will be applied to ***all*** saved tensors that are created in that context.

Example:

```python
# Only save on disk tensors that have size >= 1000
SAVE_ON_DISK_THRESHOLD = 1000

def pack_hook(x):
    if x.numel() < SAVE_ON_DISK_THRESHOLD:
        return x
    temp_file = SelfDeletingTempFile()
    torch.save(tensor, temp_file.name)
    return temp_file

def unpack_hook(tensor_or_sctf):
    if isinstance(tensor_or_sctf, torch.Tensor):
        return tensor_or_sctf
    return torch.load(tensor_or_sctf.name)

class Model(nn.Module):
    def forward(self, x):
        with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook):
          # ... compute output
          output = x
        return output

model = Model()
net = nn.DataParallel(model)
```

The hooks defined with this context manager are <font size=3 color=maroon><b>thread-local</b></font>. Hence, the following code will not produce the desired effects because the hooks <font size=3 color=maroon>do not go through DataParallel</font>.

```python
# Example what NOT to do

net = nn.DataParallel(model)
with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook):
    output = net(input)
```

<br>

<font color=maroon><b>Note that</b></font> using those hooks disables all the optimization in place to reduce Tensor object creation. For example:

```python
with torch.autograd.graph.saved_tensors_hooks(lambda x: x, lambda x: x):
    x = torch.randn(5, requires_grad=True)
    y = x * x
```

Without the hooks, `x`, `y.grad_fn._saved_self` and `y.grad_fn._saved_other` all refer to the same tensor object. With the hooks, PyTorch will pack and unpack x into two new tensor objects that share the same storage with the original x (no copy performed).

<br>
<br>

## Backward Hooks execution 

<a href="https://pytorch.org/docs/stable/notes/autograd.html#backward-hooks-execution" style="text-decoration:none">暂略 [link]</a>

<br>
<br>
<br>

<font color=gray size=3>Docs > Automatic differentiation package - torch.autograd ></font>

# <font style="font-size:120%;color:maroon"> torch.autograd.backward</font><a href="https://pytorch.org/docs/stable/generated/torch.autograd.backward.html" style="text-decoration:none;"><font size=2>[link]</font></a>

<div class="alert alert-block alert-info">

<font size=4><a href="https://pytorch.org/docs/stable/generated/torch.autograd.backward.html" style="text-decoration:none;">torch.autograd<b>.backward</b></a>(<font color=gray size=3><i>tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None, inputs=None</i></font>)</font>
</div>

<font size=4 color=red>Computes the sum of gradients of <b><i>given tensors</i></b> with respect to <b><i>graph leaves</b></i>.</font>

<font size=3>The graph is differentiated using the ***chain rule***. If any of **`tensors`** are non-scalar (i.e. their data has more than one element) and require gradient, then the ***Jacobian-vector product*** would be computed, in this case the function additionally requires specifying **`grad_tensors`**. It should be a sequence of matching length, that contains the “vector” in the *Jacobian-vector product*, usually the gradient of the differentiated function w.r.t. corresponding tensors (**`None`** is an acceptable value for all tensors that don’t need gradient tensors).</font>

<font size=3>This function accumulates gradients in the leaves - you might need to zero **`.grad`** attributes or set them to **`None`** before calling it. See `Docs > Automatic differentiation package - torch.autograd > `<a href="https://pytorch.org/docs/stable/autograd.html#default-grad-layouts" style="text-decoration:none;"><font color=maroon>Default gradient layouts</font></a> for details on the memory layout of accumulated gradients.</font>

<br>

**Parameters** `(详见 `<a href="https://pytorch.org/docs/stable/generated/torch.autograd.backward.html" style="text-decoration:none;"><font size=2>[link]</font></a>` 链接)`

<br>

<div class="alert alert-block alert-info">

<font size=3 color=red><b>NOTE: </b></font>

Using this method with `create_graph=True` will create a reference cycle between the parameter and its gradient which can cause a memory leak. We recommend using `autograd.grad` when creating the graph to avoid this. If you have to use this function, make sure to reset the `.grad` fields of your parameters to `None` after use to break the cycle and avoid the leak.

</div>

<div class="alert alert-block alert-info">

<font size=3 color=red><b>NOTE: </b></font>

If you run any forward ops, create `grad_tensors`, and/or call `backward` in a user-specified CUDA stream context, see <a href="https://pytorch.org/docs/stable/notes/cuda.html#bwd-cuda-stream-semantics" style="text-decoration:none;"><font color=maroon>Stream semantics of backward passes</font></a>.

</div>

<div class="alert alert-block alert-info">

<font size=3 color=red><b>NOTE: </b></font>

When `inputs` are provided and a given input is not a ***leaf***, the current implementation will call its `grad_fn` (even though it is not strictly needed to get this gradients). It is an implementation detail on which the user should not rely. See <a href="https://github.com/pytorch/pytorch/pull/60521#issuecomment-867061780" style="text-decoration:none;"><font color=maroon>https://github.com/pytorch/pytorch/pull/60521#issuecomment-867061780</font></a> for more details.</font></a>.

</div>

<br>

<br>
<br>
<br>

<font color=gray size=3>Docs > Automatic differentiation package - torch.autograd > </font>

# <font style="font-size:120%;color:maroon">torch.autograd.grad</font><a href="https://pytorch.org/docs/stable/generated/torch.autograd.grad.html" style="text-decoration:none;"><font size=2>[link]</font></a>

<div class="alert alert-block alert-info">

<font size=4><a href="https://pytorch.org/docs/stable/generated/torch.autograd.grad.html#torch.autograd.grad" style="text-decoration:none;">torch.autograd<b>.grad</b></a>(<font color=gray size=3><i>outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False, only_inputs=True, allow_unused=False, is_grads_batched=False</i></font>)</font>
</div>

<font size=4 color=red>Computes and returns the sum of gradients of <b><i>outputs</i></b> with respect to the <b><i>inputs</i></b>.</font>

<font size=3>`grad_outputs` should be a sequence of length matching `output` containing the “vector” in ***vector-Jacobian product***, usually the pre-computed gradients w.r.t. each of the outputs. If an output doesn’t `require_grad`, then the gradient can be `None`).</font>

<br>

**Parameters** `(详见 `<a href="https://pytorch.org/docs/stable/generated/torch.autograd.grad.html" style="text-decoration:none;"><font size=2>[link]</font></a>` 链接)`

<br>

<div class="alert alert-block alert-info">

<font size=3 color=red><b>NOTE: </b></font>

If you run any forward ops, create `grad_outputs`, and/or call `grad` in a user-specified CUDA stream context, see <a href="https://pytorch.org/docs/stable/notes/cuda.html#bwd-cuda-stream-semantics" style="text-decoration:none;"><font color=maroon>Stream semantics of backward passes</font></a>.

</div>

<div class="alert alert-block alert-info">

<font size=3 color=red><b>NOTE: </b></font>

`only_inputs` argument is deprecated and is ignored now (defaults to `True`). To accumulate gradient for other parts of the graph, please use <a href="https://pytorch.org/docs/stable/generated/torch.autograd.backward.html" style="text-decoration:none;"><font color=maroon>torch.autograd.backward</font></a>.

</div>

<br>

<br>
<br>
<br>

<font color=gray size=3>Docs > </font>

# <font style="font-size:120%;color:maroon">Automatic differentiation package - <b>torch.autograd</b></font> <a href="https://pytorch.org/docs/stable/autograd.html" style="text-decoration:none;"><font size=2>[link]</font></a>

<br>

<font size=4 color=maroon><b>torch.autograd</b></font> provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. It requires minimal changes to the existing code - you only need to declare `Tensor` s for which gradients should be computed with the `requires_grad=True` keyword. As of now, we only support autograd for ***floating point Tensor types*** ( half, float, double and bfloat16) and ***complex Tensor types*** (cfloat, cdouble).

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.backward.html#torch.autograd.backward" style="text-decoration:none;"><font color=maroon size=4><b>backward</b> (torch.autograd.backward)</font></a><br>
Computes the sum of gradients of given tensors `with respect to `**`graph leaves`**. (另见本 notebook 3)

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.grad.html#torch.autograd.grad" style="text-decoration:none;"><font color=maroon size=4><b>grad</b> (torch.autograd.grad)</font></a><br>
Computes and returns the sum of gradients of outputs `with respect to `**`the inputs`**.

<br>
<br>

## <font style="color:red;font-size:110%">Forward-mode Automatic Differentiation</font>

**WARNNING:** (略)

Please see the <a href="https://pytorch.org/tutorials/intermediate/forward_ad_usage.html" style="text-decoration:none;"><font size=4>forward-mode AD tutorial</font></a> for detailed steps on how to use this API.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.forward_ad.dual_level.html#torch.autograd.forward_ad.dual_level" style="text-decoration:none;"><font color=maroon size=4>forward_ad.dual_level</font></a><br>
Context-manager that enables forward AD.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.forward_ad.make_dual.html#torch.autograd.forward_ad.make_dual" style="text-decoration:none;"><font color=maroon size=4>forward_ad.make_dual</font></a><br>
Associates a tensor value with a forward gradient, the tangent, to create a “dual tensor”, which is used to compute forward AD gradients.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.forward_ad.unpack_dual.html#torch.autograd.forward_ad.unpack_dual" style="text-decoration:none;"><font color=maroon size=4>forward_ad.unpack_dual</font></a><br>
Unpacks a “dual tensor” to get both its Tensor value and its forward AD gradient.

<br>
<br>

## Functional higher level API

**WARNNING:** (略)

This section contains the higher level API for the autograd that `builds on the basic API above` and allows you to compute ***jacobians***, ***hessians***, etc.

This API works with **`user-provided functions`**` that take only Tensors as input and return only Tensors.` <font color=maroon>If your function takes other arguments that are not Tensors or Tensors that don’t have `requires_grad` set, you can use a lambda to capture them.</font> 

For example, for a function `f` that takes three inputs, a Tensor for which we want the jacobian, another tensor that should be considered constant and a boolean flag as `f(input, constant, flag=flag)` you can use it as `functional.jacobian(lambda x: f(x, constant, flag=flag), input)`.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.functional.jacobian.html#torch.autograd.functional.jacobian" style="text-decoration:none;"><font color=maroon size=4>functional.jacobian</font></a><br>
Function that computes the Jacobian of a given function.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.functional.hessian.html#torch.autograd.functional.hessian" style="text-decoration:none;"><font color=maroon size=4>functional.hessian</font></a><br>
Function that computes the Hessian of a given scalar function.

* (其它略)

<br>
<br>

## <font style="color:red;font-size:110%">Locally disabling gradient computation</font>

See Docs > Autograd mechanics > <a href="https://pytorch.org/docs/stable/notes/autograd.html#locally-disable-grad-doc" style="text-decoration:none;">Locally disabling gradient computation</a> for more information on the differences between `no-grad` and `inference mode` as well as `other related mechanisms` that may be confused with the two.

Also see <a href="https://pytorch.org/docs/stable/torch.html#torch-rst-local-disable-grad" style="text-decoration:none">Locally disabling gradient computation</a> for a list of functions that can be used to locally disable gradients.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.no_grad.html#torch.autograd.no_grad" style="text-decoration:none;"><font color=maroon size=4>no_grad</font></a><br>
Context-manager that disabled gradient calculation.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.enable_grad.html#torch.autograd.enable_grad" style="text-decoration:none;"><font color=maroon size=4>enable_grad</font></a><br>
Context-manager that enables gradient calculation.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.set_grad_enabled.html#torch.autograd.set_grad_enabled" style="text-decoration:none;"><font color=maroon size=4>set_grad_enabled</font></a><br>
Context-manager that sets gradient calculation to on or off.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.inference_mode.html#torch.autograd.inference_mode" style="text-decoration:none;"><font color=maroon size=4>inference_mode</font></a><br>
Context-manager that enables or disables inference mode

<br>
<br>

## Default gradient layouts

When a non-sparse param receives a non-sparse gradient during <a href="https://pytorch.org/docs/stable/generated/torch.autograd.backward.html#torch.autograd.backward" style="text-decoration:none;">torch.autograd.backward()</a> or <a href="https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html#torch.Tensor.backward" style="text-decoration:none;">torch.Tensor.backward()</a> `param.grad` is accumulated as follows:

* <font size=3><b>If `param.grad` is initially `None`:</b></font>
    1. If `param`’s memory is non-overlapping and dense, `.grad` is created with strides matching `param` (thus matching `param`’s layout).
<br>
<br>
    2. Otherwise, `.grad` is created with rowmajor-contiguous strides.

* <font size=3><b>If `param` already has a non-sparse `.grad` attribute:</b></font>
    3. If `create_graph=False`, `backward()` accumulates into `.grad` in-place, which preserves its strides.
<br>
<br>
    4. If `create_graph=True`, `backward()` replaces `.grad` with a new tensor `.grad + new grad`, which attempts (but does not guarantee) matching the preexisting `.grad’s` strides.

<font color=maroon size=3>The default behavior (letting `.grads` be `None` before the first `backward()`, such that their layout is created according to 1 or 2, and retained over time according to 3 or 4) is recommended for best performance. Calls to `model.zero_grad()` or `optimizer.zero_grad()` will not affect `.grad` layouts.

In fact, resetting all `.grads` to `None` before each accumulation phase, e.g.:</font>

<font color=maroon size=3>such that they’re recreated according to 1 or 2 every time, is a valid alternative to `model.zero_grad()` or `optimizer.zero_grad()` that may improve performance for some networks.</font>

<br>

### Manual gradient layouts

<font color=maroon size=3>If you need manual control over `.grad`’s strides, assign `param.grad = a zeroed tensor with desired strides` before the first `backward()`, and never reset it to `None`. <b>3</b> guarantees your layout is preserved as long as `create_graph=False`. <b>4</b> indicates your layout is likely preserved even if `create_graph=True`.</font>

<br>
<br>

## In-place operations on Tensors

Supporting in-place operations in autograd is a hard matter, and we discourage their use in most cases. Autograd’s aggressive buffer freeing and reuse makes it very efficient and there are very few occasions when in-place operations actually lower memory usage by any significant amount. Unless you’re operating under heavy memory pressure, you might never need to use them.

### In-place correctness checks

All `Tensor`s keep track of in-place operations applied to them, and if the implementation detects that a tensor was saved for backward in one of the functions, but it was modified in-place afterwards, an error will be raised once backward pass is started. This ensures that if you’re using in-place functions and not seeing any errors, you can be sure that the computed gradients are correct.

<br>
<br>

## Variable (deprecated)

<div class="alert alert-block alert-danger">

<font size=3 color=red><b>WARNNING: </b></font>

The Variable API has been deprecated: Variables are no longer necessary to use autograd with tensors. Autograd automatically supports Tensors with `requires_grad` set to `True`. 
<br>
<br>
Below please find a quick guide on what has changed:

</div>

<br>
<br>

##  <font style="color:red;font-weight:bold">Tensor autograd functions</font>

* <font color=maroon size=4>torch.Tensor.grad</font><br>
This attribute is `None` by default and becomes a Tensor the first time a call to <a href="https://pytorch.org/docs/stable/generated/torch.autograd.backward.html#torch.autograd.backward" style="text-decoration:none;"><font color=maroon>backward()</font></a> computes gradients for `self`.

* <font color=maroon size=4>torch.Tensor.requires_grad</font><br>
Is `True` if gradients need to be computed for this Tensor, `False` otherwise.

* <font color=maroon size=4>torch.Tensor.is_leaf</font><br>
<font size=3>All Tensors that have `requires_grad` which is `False` will be **leaf Tensors** by convention.</font>

* <font color=maroon size=4>torch.Tensor.backward</font>([gradient, …])<br>
Computes the gradient of current tensor w.r.t.

* <font color=maroon size=4>torch.Tensor.detach</font><br>
Returns a new Tensor, detached from the current graph.

* <font color=maroon size=4>torch.Tensor.detach_</font><br>
Detaches the Tensor from the graph that created it, making it a leaf.

* <font color=maroon size=4>torch.Tensor.register_hook</font>(hook)<br>
Registers a backward hook.

* <font color=maroon size=4>torch.Tensor.retain_grad</font>()<br>
Enables this Tensor to have their <a href="https://pytorch.org/docs/stable/generated/torch.autograd.grad.html#torch.autograd.grad" style="text-decoration:none;"><font color=maroon>grad</font></a> populated during <a href="https://pytorch.org/docs/stable/generated/torch.autograd.backward.html#torch.autograd.backward" style="text-decoration:none;"><font color=maroon>backward()</font></a>.


<br>
<br>

##  <font style="color:red;font-weight:bold">Function</font>

<div class="alert alert-block alert-info">

<font size=4 color=black>`CLASS` torch.autograd.Function(*args, **kwargs)</font>
</div>

<font color=magenta>Base class to create custom *`autograd.Function`*</font>

To create a custom *`autograd.Function`*, subclass this class and implement the <a href="https://pytorch.org/docs/stable/generated/torch.autograd.Function.forward.html#torch.autograd.Function.forward" style="text-decoration:none;"><font color=maroon>forward()</font></a> and <a href="https://pytorch.org/docs/stable/generated/torch.autograd.backward.html#torch.autograd.backward" style="text-decoration:none;"><font color=maroon>backward()</font></a> static methods. Then, to use your custom op in the forward pass, call the class method apply. Do not call `forward()` directly.

To ensure correctness and best performance, make sure you are calling the correct methods on `ctx` and validating your backward function using <a href="https://pytorch.org/docs/stable/generated/torch.autograd.gradcheck.html#torch.autograd.gradcheck" style="text-decoration:none;"><font color=maroon>torch.autograd.gradcheck()</font></a>.

See Docs > Extending PyTorch > <a href="https://pytorch.org/docs/stable/notes/extending.html#extending-autograd" style="text-decoration:none;"><font color=maroon>Extending torch.autograd</font></a> for more details on how to use this class.

Examples:

```python
class Exp(Function):
    @staticmethod
    def forward(ctx, i):
        result = i.exp()
        ctx.save_for_backward(result)
        return result
    @staticmethod
    def backward(ctx, grad_output):
        result, = ctx.saved_tensors
        return grad_output * result
    
# Use it by calling the apply method:
output = Exp.apply(input)
```

<br>

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.Function.forward.html#torch.autograd.Function.forward" style="text-decoration:none;"><font color=maroon size=4>Function.forward</font></a><br>
Performs the operation.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.Function.backward.html#torch.autograd.Function.backward" style="text-decoration:none;"><font color=maroon size=4>Function.backward</font></a><br>
Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the `vjp` function).

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.Function.jvp.html#torch.autograd.Function.jvp" style="text-decoration:none;"><font color=maroon size=4>Function.jvp</font></a><br>
Defines a formula for differentiating the operation with forward mode automatic differentiation.

<br>
<br>

## Context method mixins

When creating a new <a href="https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function" style="text-decoration:none;">Function</a>, the following methods are available to *`ctx`*.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.function.FunctionCtx.mark_dirty.html#torch.autograd.function.FunctionCtx.mark_dirty" style="text-decoration:none;"><font color=maroon size=4>function.FunctionCtx.mark_dirty</font></a><br>
Marks given tensors as modified in an in-place operation.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.function.FunctionCtx.mark_non_differentiable.html#torch.autograd.function.FunctionCtx.mark_non_differentiable" style="text-decoration:none;"><font color=maroon size=4>function.FunctionCtx.mark_non_differentiable</font></a><br>
Marks outputs as non-differentiable.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.function.FunctionCtx.save_for_backward.html#torch.autograd.function.FunctionCtx.save_for_backward" style="text-decoration:none;"><font color=maroon size=4>function.FunctionCtx.save_for_backward</font></a><br>
Saves given tensors for a future call to <a href="https://pytorch.org/docs/stable/generated/torch.autograd.Function.backward.html#torch.autograd.Function.backward" style="text-decoration:none;">backward()</a>.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.function.FunctionCtx.set_materialize_grads.html#torch.autograd.function.FunctionCtx.set_materialize_grads" style="text-decoration:none;"><font color=maroon size=4>function.FunctionCtx.set_materialize_grads</font></a><br>
Sets whether to materialize output grad tensors.

<br>
<br>

## Numerical gradient checking

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.gradcheck.html#torch.autograd.gradcheck" style="text-decoration:none;"><font color=maroon size=4>gradcheck</font></a><br>
Check gradients computed via small finite differences against analytical gradients w.r.t.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.gradgradcheck.html#torch.autograd.gradgradcheck" style="text-decoration:none;"><font color=maroon size=4>gradgradcheck</font></a><br>
Check gradients of gradients computed via small finite differences against analytical gradients w.r.t.

<br>
<br>

## Profiler

Autograd includes a profiler that lets you inspect the cost of different operators inside your model - both on the CPU and GPU. There are two modes implemented at the moment:
* **CPU-only** using `profile`. 
* and **nvprof** based (registers both CPU and GPU activity) using `emit_nvtx`.

<br>

### `torch.autograd.profiler`**.profile()**

<div class="alert alert-block alert-info">

<font size=4 color=black>`CLASS` <a href="https://pytorch.org/docs/stable/autograd.html#torch.autograd.profiler.profile" style="text-decoration:none;">torch.autograd.profiler<b>.profile</b></a>(<font color=gray size=3>enabled=True, *, use_cuda=False, record_shapes=False, with_flops=False, profile_memory=False, with_stack=False, with_modules=False, use_kineto=False, use_cpu=True</font>)</font>
</div>

Context manager that manages autograd profiler state and holds a summary of results. Under the hood it just records events of functions being executed in C++ and exposes those events to Python. You can wrap any code into it and it will only report runtime of PyTorch functions. 

Note: profiler is thread local and is automatically propagated into the async tasks.

**Parameters：** (详见蓝色字体链接)

Example

In [2]:
import torch

x = torch.randn((1, 1), requires_grad=True)
with torch.autograd.profiler.profile() as prof:
    for _ in range(100):  # any normal python code, really!
        y = x ** 2
        y.backward()
# NOTE: some columns were removed for brevity
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                              aten::pow        32.32%       6.578ms        43.74%       8.903ms      44.515us           200  
                                              aten::mul        14.31%       2.913ms        22.91%       4.664ms      23.320us           200  
                                            aten::copy_         9.25%       1.883ms         9.25%       1.883ms       9.415us           200  
                                           PowBackward0         6.62%       1.348ms        44.17%       8.991ms      89.910us           100  
      

<br>

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.profiler.profile.export_chrome_trace.html#torch.autograd.profiler.profile.export_chrome_trace" style="text-decoration:none;"><font color=maroon size=4>profiler.profile.export_chrome_trace</font></a><br>
Exports an EventList as a Chrome tracing tools file.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.profiler.profile.key_averages.html#torch.autograd.profiler.profile.key_averages" style="text-decoration:none;"><font color=maroon size=4>profiler.profile.key_averages</font></a><br>
Averages all function events over their keys.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.profiler.profile.self_cpu_time_total.html#torch.autograd.profiler.profile.self_cpu_time_total" style="text-decoration:none;"><font color=maroon size=4>profiler.profile.self_cpu_time_total</font></a><br>
Returns total time spent on CPU obtained as a sum of all self times across all the events.

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.profiler.profile.total_average.html#torch.autograd.profiler.profile.total_average" style="text-decoration:none;"><font color=maroon size=4>profiler.profile.total_average</font></a><br>
Averages all events.

<br>

### `torch.autograd.profiler`**.emit_nvtx()**

<div class="alert alert-block alert-info">

<font size=4 color=black>`CLASS` <a href="https://pytorch.org/docs/stable/autograd.html#torch.autograd.profiler.emit_nvtx" style="text-decoration:none;">torch.autograd.profiler<b>.emit_nvtx</b></a>(<font color=gray size=3>enabled=True, record_shapes=False</font>)</font>
</div>

Context manager that makes every autograd operation emit an NVTX range.

It is useful when running the program under ` nvprof`:

<font color=gray>锐平：nvprof (nvidia profile?) 和 NVIDIA Visual Profiler 是用来分析CUDA程序性能的工具。</font>

Unfortunately, there’s no way to force nvprof to flush the data it collected to disk, so for CUDA profiling one has to use this context manager to annotate nvprof traces and wait for the process to exit before inspecting them. Then, either `NVIDIA Visual Profiler (nvvp)` can be used to visualize the timeline, or <a href="https://pytorch.org/docs/stable/generated/torch.autograd.profiler.load_nvprof.html#torch.autograd.profiler.load_nvprof" style="text-decoration:none;"><font color=maroon>torch.autograd.profiler.load_nvprof()</font></a> can load the results for inspection e.g. in Python REPL.

**Parameters：** (详见蓝色字体链接)

Example

```python
with torch.cuda.profiler.profile():
    model(x)      # Warmup CUDA memory allocator and profiler
    with torch.autograd.profiler.emit_nvtx():
        model(x)
```

<br>

#### <font style="font-size:120%">Forward-backward correlation</font>

When viewing a profile created using `emit_nvtx` in the Nvidia Visual Profiler, correlating each backward-pass op with the corresponding forward-pass op can be difficult. To ease this task, `emit_nvtx` appends sequence number information to the ranges it generates.

During the forward pass, each function range is decorated with `seq=<N>`. `seq` is a running counter, incremented each time a new backward Function object is created and stashed for backward. Thus, the `seq=<N>` annotation associated with each forward function range tells you that if a backward Function object is created by this forward function, the backward object will receive sequence number N. During the backward pass, the top-level range wrapping each C++ backward Function’s `apply()` call is decorated with `stashed seq=<M>`. `M` is the sequence number that the backward object was created with. By comparing `stashed seq` numbers in backward with `seq` numbers in forward, you can track down which forward op created each backward Function.

Any functions executed during the backward pass are also decorated with `seq=<N>`. During default backward (with `create_graph=False`) this information is irrelevant, and in fact, `N` may simply be 0 for all such functions. Only the top-level ranges associated with backward Function objects’ `apply()` methods are useful, as a way to correlate these Function objects with the earlier forward pass.

#### <font style="font-size:120%">Double-backward</font>

If, on the other hand, a backward pass with `create_graph=True` is underway (in other words, if you are setting up for a ***double-backward***), each function’s execution during backward is given a nonzero, useful `seq=<N>`. Those functions may themselves create Function objects to be executed later during double-backward, just as the original functions in the forward pass did. <font color=maroon>The relationship between *backward* and *double-backward* is conceptually the same as the relationship between forward and backward:</font> The functions still emit current-sequence-number-tagged ranges, the Function objects they create still stash those sequence numbers, and during the eventual double-backward, the Function objects’ `apply()` ranges are still tagged with `stashed seq` numbers, which can be compared to seq numbers from the backward pass.

<br>

* <a href="https://pytorch.org/docs/stable/generated/torch.autograd.profiler.load_nvprof.html#torch.autograd.profiler.load_nvprof" style="text-decoration:none;"><font color=maroon size=4>profiler.load_nvprof</font></a><br>
Opens an nvprof trace file and parses autograd annotations.

<br>
<br>

## Anomaly

<div class="alert alert-block alert-info">

<font size=4 color=black>`CLASS` <a href="https://pytorch.org/docs/stable/autograd.html#anomaly-detection" style="text-decoration:none;">torch.autograd<b>.detect_anomaly</b></a></font>
</div>

Context-manager that enable anomaly detection for the autograd engine.

This does two things:

* Running the forward pass with detection enabled will allow the backward pass to print the traceback of the forward operation that created the failing backward function.


* Any backward computation that generate “nan” value will raise an error.

<div class="alert alert-block alert-danger">

<font size=3 color=red><b>WARNNING: </b></font>

<font color=black>This mode should be enabled `only for debugging` as the different tests will slow down your program execution.</font>
    
</div>

Example

In [None]:
# 可以试运行下面的程序，会报错
import torch
from torch import autograd

class MyFunc(autograd.Function):
    @staticmethod
    def forward(ctx, inp):
        return inp.clone()
    
    @staticmethod
    def backward(ctx, gO):
        # Error during the backward pass
        raise RuntimeError("Some error in backward")
        return gO.clone()
    

def run_fn(a):
    out = MyFunc.apply(a)
    return out.sum()
inp = torch.rand(10, 10, requires_grad=True)
out = run_fn(inp)
out.backward()

In [2]:
out

tensor(48.1482, grad_fn=<SumBackward0>)

In [None]:
# 可以试运行下面的程序，会报错
with autograd.detect_anomaly():
    inp = torch.rand(10, 10, requires_grad=True)
    out = run_fn(inp)
    out.backward()

<br>

<div class="alert alert-block alert-info">

<font size=4 color=black>`CLASS` <a href="https://pytorch.org/docs/stable/autograd.html#anomaly-detection" style="text-decoration:none;">torch.autograd<b>.set_detect_anomaly</b></a>(<font color=gray size=3>mode</font>)</font>
</div>

Context-manager that sets the anomaly detection for the autograd engine on or off.

`set_detect_anomaly` will enable or disable the autograd anomaly detection based on its argument mode. It can be used as a context-manager or as a function.

See `detect_anomaly` above for details of the anomaly detection behaviour.

**Parameters：** 

* **mode** (bool) – Flag whether to enable anomaly detection (`True`), or disable (`False`).

<br>
<br>

## Saved tensors default hooks

Some operations need intermediary results to be saved during the forward pass in order to execute the backward pass. You can define how these saved tensors should be packed / unpacked using hooks. A common application is to trade compute for memory by saving those intermediary results to disk or to CPU instead of leaving them on the GPU. This is especially useful if you notice your model fits on GPU during evaluation, but not training. Also see `Docs > Autograd mechanics > `<a href="https://pytorch.org/docs/stable/notes/autograd.html#saved-tensors-hooks-doc" style="text-decoration:none;">Hooks for saved tensors</a>.

<div class="alert alert-block alert-info">

<font size=4 color=black>`CLASS` <a href="https://pytorch.org/docs/stable/autograd.html#saved-tensors-default-hooks" style="text-decoration:none;">torch.autograd.graph<b>.saved_tensors_hooks</b></a>(<font color=gray size=3>pack_hook, unpack_hook </font>)</font>
</div>

**Context-manager** that sets a pair of `pack / unpack hooks` for saved tensors.

Use this context-manager to define how intermediary results of an operation should be packed before saving, and unpacked on retrieval.

In that context, the `pack_hook` function will be called everytime an operation saves a tensor for backward (<font color=maroon>this includes intermediary results saved using `save_for_backward()` but also those recorded by a PyTorch-defined operation</font>). <font color=maroon size=4>The output of `pack_hook` is then stored in the <b>computation graph</b> instead of the original tensor.</font>

The `unpack_hook` is called when the saved tensor needs to be accessed, namely when executing <a href="https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html#torch.Tensor.backward" style="text-decoration:none;"><font color=maroon>torch.Tensor.backward()</font></a> or <a href="https://pytorch.org/docs/stable/generated/torch.autograd.grad.html#torch.autograd.grad" style="text-decoration:none;"><font color=maroon>torch.autograd.grad()</font></a>. It takes as argument the packed object returned by `pack_hook` and should return a tensor which has the same content as the original tensor (passed as input to the corresponding `pack_hook`).

The hooks should have the following signatures:

    pack_hook(tensor: Tensor) -> Any

    unpack_hook(Any) -> Tensor

where the return value of `pack_hook` is a valid input to `unpack_hook`.

In general, you want `unpack_hook(pack_hook(t))` to be equal to t in terms of value, size, dtype and device.

Example:

In [5]:
def pack_hook(x):
    print("Packing", x)
    return x

def unpack_hook(x):
    print("Unpacking", x)
    return x

a = torch.ones(5, requires_grad=True)
b = torch.ones(5, requires_grad=True) * 2

with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook):
    y = a * b

Packing tensor([1., 1., 1., 1., 1.], requires_grad=True)
Packing tensor([2., 2., 2., 2., 2.], grad_fn=<MulBackward0>)


In [6]:
y.sum().backward()

Unpacking tensor([1., 1., 1., 1., 1.], requires_grad=True)
Unpacking tensor([2., 2., 2., 2., 2.], grad_fn=<MulBackward0>)


<div class="alert alert-block alert-danger">

<font size=3 color=red><b>WARNNING: </b></font>

<font color=black>Performing an inplace operation on the input to either hooks may lead to undefined behavior.</font>

</div>

<div class="alert alert-block alert-danger">

<font size=3 color=red><b>WARNNING: </b></font>

<font color=black>Only one pair of hooks is allowed at a time. When recursively nesting this context-manager, only the inner-most pair of hooks will be applied.</font>

</div>

<br>

<div class="alert alert-block alert-info">

<font size=4 color=black>`CLASS` <a href="https://pytorch.org/docs/stable/autograd.html#saved-tensors-default-hooks" style="text-decoration:none;">torch.autograd.graph<b>.save_on_cpu</b></a>(<font color=gray size=3>pin_memory=False</font>)</font>
</div>

Context-manager under which tensors saved by the forward pass will be stored on cpu, then retrieved for backward.

When performing operations within this context manager, intermediary results saved in the graph during the forward pass will be moved to CPU, then copied back to the original device when needed for the backward pass. If the graph was already on CPU, no tensor copy is performed.

<font color=maroon size=4>Use this context-manager to trade compute for GPU memory usage (e.g. when your model doesn’t fit in GPU memory during training).</font>

**Parameters**
* **pin_memory** (bool) – If `True` tensors will be saved to CPU pinned memory during packing and copied to GPU asynchronously during unpacking. Defaults to `False`. Also see Docs > CUDA semantics > <a href="https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-pinning" style="text-decoration:none;"><font color=maroon>Use pinned memory buffers</font></a>.

Example:

In [7]:
a = torch.randn(5, requires_grad=True, device="cuda")
b = torch.randn(5, requires_grad=True, device="cuda")
c = torch.randn(5, requires_grad=True, device="cuda")

def f(a, b, c):
    prod_1 = a * b            # a and b are saved on GPU
    with torch.autograd.graph.save_on_cpu():
        prod_2 = prod_1 * c   # prod_1 and c are saved on CPU
    y = prod_2 * a            # prod_2 and a are saved on GPU
    return y


y = f(a, b, c)
del a, b, c                   # for illustration only

# the content of a, b, and prod_2 are still alive on GPU
# the content of prod_1 and c only live on CPU


y.sum().backward()  # all CPU tensors are moved back to GPU, for backward
# all intermediary tensors are released (deleted) after the call to backward

In [8]:
y

tensor([ 1.0208,  0.0058, -0.0550, -0.1145,  1.6716], device='cuda:0',
       grad_fn=<MulBackward0>)

In [10]:
a

NameError: name 'a' is not defined

In [11]:
b

NameError: name 'b' is not defined

In [12]:
c

NameError: name 'c' is not defined

<br>
<br>
<br>

<font color=gray size=3>Tutorials > </font>

# <font style="font-size:120%;color:maroon;font-weight:bold">Forward-mode Automatic Differentiation (Beta)</font> <a href="https://pytorch.org/tutorials/intermediate/forward_ad_usage.html" style="text-decoration:none;"><font size=2>[link]</font></a>

This tutorial demonstrates how to use <font color=maroon><b>forward-mode AD</b></font> to compute <font color=maroon>directional derivatives</font> (or equivalently, <font color=maroon>Jacobian-vector products</font>).

The tutorial below uses some APIs only available in versions >= 1.11 (or nightly builds).

<font color=maroon><b>Also note that</b></font> forward-mode AD is currently in beta. The API is subject to change and operator coverage is still incomplete.

<br>

## Basic Usage

Unlike reverse-mode AD, forward-mode AD computes gradients eagerly alongside the forward pass. We can use forward-mode AD to compute a directional derivative by performing the forward pass as before, except we first associate our input with another tensor representing the direction of the directional derivative (or equivalently, the `v` in a Jacobian-vector product). <font color=maroon>When an input, which we call `“primal”`, is associated with a `“direction” tensor`, which we call `“tangent”`, the resultant new tensor object is called a **`“dual tensor”`** for its connection to <a href="https://en.wikipedia.org/wiki/Dual_number" style="text-decoration:none;"><b>dual numbers</b></a></font>[0].

<font size=4 color=maroon>As the forward pass is performed, if any input tensors are dual tensors, extra computation is performed to propogate this <b>“sensitivity”</b> of the function.</font>

In [2]:
import torch
import torch.autograd.forward_ad as fwAD

primal = torch.randn(10, 10)
tangent = torch.randn(10, 10)

def fn(x, y):
    return x ** 2 + y ** 2


'''All forward AD computation must be performed in the context of a ``dual_level`` context. '''
# All dual tensors created in such a context will have their tangents destroyed upon exit. 
# This is to ensure that if the output or intermediate results of this computation are reused
# in a future forward AD computation, their tangents (which are associated
# with this computation) won't be confused with tangents from the later computation.
with fwAD.dual_level():
    # To create a dual tensor we associate a tensor, which we call the ``primal`` 
    # with another tensor of the same size, which we call the ``tangent``.
    #
    # If the layout of the tangent is different from that of the primal,
    # The values of the tangent are copied into a new tensor with the same
    # metadata as the primal. Otherwise, the tangent itself is used as-is.
    ''' It is also important to note that the dual tensor created by ``make_dual`` 
        is a view of the primal.
    '''
    dual_input = fwAD.make_dual(primal, tangent)
    assert fwAD.unpack_dual(dual_input).tangent is tangent

    
    # To demonstrate the case where the copy of the tangent happens,
    # we pass in a tangent with a layout different from that of the primal
    dual_input_alt = fwAD.make_dual(primal, tangent.T)
    assert fwAD.unpack_dual(dual_input_alt).tangent is not tangent

    
    # Tensors that do not not have an associated tangent are automatically
    # considered to have a zero-filled tangent of the same shape.
    plain_tensor = torch.randn(10, 10)
    dual_output = fn(dual_input, plain_tensor)

    # Unpacking the dual returns a namedtuple with ``primal`` and ``tangent`` as attributes
    jvp = fwAD.unpack_dual(dual_output).tangent

assert fwAD.unpack_dual(dual_output).tangent is None

AttributeError: 'tuple' object has no attribute 'tangent'

<br>
<br>

## Usage with Modules

To use **`nn.Module`** with `forward AD`, replace the parameters of your model with `dual tensors` before performing the `forward pass`. At the time of writing, it is not possible to create dual tensor `nn.Parameter`s. As a workaround, one must register the dual tensor as a non-parameter attribute of the module.

```python
import torch.nn as nn

model = nn.Linear(5, 5)
input = torch.randn(16, 5)

params = {name: p for name, p in model.named_parameters()}
tangents = {name: torch.rand_like(p) for name, p in params.items()}

with fwAD.dual_level():
    for name, p in params.items():
        delattr(model, name)
        setattr(model, name, fwAD.make_dual(p, tangents[name]))

    out = model(input)
    jvp = fwAD.unpack_dual(out).tangent
```

<br>
<br>

## Using Modules stateless API (experimental)

Another way to use **`nn.Module`** with forward AD is to utilize the stateless API. 

NB: At the time of writing the stateless API is still experimental and may be subject to change.

```python
from torch.nn.utils._stateless import functional_call

# We need a fresh module because the functional call requires the
# the model to have parameters registered.
model = nn.Linear(5, 5)

dual_params = {}
with fwAD.dual_level():
    for name, p in params.items():
        # Using the same ``tangents`` from the above section
        dual_params[name] = fwAD.make_dual(p, tangents[name])
    out = functional_call(model, dual_params, input)
    jvp2 = fwAD.unpack_dual(out).tangent

# Check our results
assert torch.allclose(jvp, jvp2)
```

<br>
<br>

## Custom autograd Function

Custom Functions also support `forward-mode AD`. <font size=3 color=maroon>To create custom Function supporting forward-mode AD, register the `jvp()` static method. It is possible, but not mandatory for custom Functions to support both forward and backward AD.</font> See the documentation: <a href="https://pytorch.org/docs/master/notes/extending.html#forward-mode-ad" style="text-decoration:none;"><b>Docs > Extending PyTorch</b></a> `(Docs > PyTorch documentation > Notes: Extending PyTorch)` for more information.

```python
class Fn(torch.autograd.Function):
    @staticmethod
    def forward(ctx, foo):
        result = torch.exp(foo)
        # Tensors stored in ctx can be used in the subsequent forward grad
        # computation.
        ctx.result = result
        return result

    @staticmethod
    def jvp(ctx, gI):
        gO = gI * ctx.result
        # If the tensor stored in ctx will not also be used in the backward pass,
        # one can manually free it using ``del``
        del ctx.result
        return gO

fn = Fn.apply

primal = torch.randn(10, 10, dtype=torch.double, requires_grad=True)
tangent = torch.randn(10, 10)

with fwAD.dual_level():
    dual_input = fwAD.make_dual(primal, tangent)
    dual_output = fn(dual_input)
    jvp = fwAD.unpack_dual(dual_output).tangent

# It is important to use ``autograd.gradcheck`` to verify that your
# custom autograd Function computes the gradients correctly. By default,
# gradcheck only checks the backward-mode (reverse-mode) AD gradients. Specify
# ``check_forward_ad=True`` to also check forward grads. If you did not
# implement the backward formula for your function, you can also tell gradcheck
# to skip the tests that require backward-mode AD by specifying
# ``check_backward_ad=False``, ``check_undefined_grad=False``, and
# ``check_batched_grad=False``.
torch.autograd.gradcheck(Fn.apply, (primal,), check_forward_ad=True,
                         check_backward_ad=False, check_undefined_grad=False,
                         check_batched_grad=False)
```

<br>
<br>

[0] <a href="https://en.wikipedia.org/wiki/Dual_number" style="text-decoration:none;">https://en.wikipedia.org/wiki/Dual_number</a>

<br>
<br>
<br>

<font color=gray size=3>Docs > </font>

# <font style="font-size:120%;color:maroon;font-weight:bold">Inference Mode</font> <a href="https://pytorch.org/cppdocs/notes/inference_mode.html" style="text-decoration:none;"><font size=2>[link]</font></a>

`c10::InferenceMode` is a new RAII guard analogous to **`NoGradMode`** to be used when you are certain your operations will have no interactions with <font color=maroon>autograd (e.g. model training)</font>. Compared to `NoGradMode`, code run under this mode gets better performance by disabling autograd related work like view tracking and version counter bumps. However, tensors created inside `c10::InferenceMode` has more limitation when interacting with autograd system as well.<br><br>

**`InferenceMode`** can be enabled for a given block of code. Inside `InferenceMode` all newly allocated (non-view) tensors are marked as <font size=3 color=blue>inference tensors</font>. Inference tensors:

* do not have a <font size=3 color=maroon>version counter</font> so an error will be raised if you try to read their version (e.g., because you saved this tensor for backward).


* are <font size=3 color=maroon>immutable</font> outside `InferenceMode`. So an error will be raised if you try to: 
    * \- mutate their data outside InferenceMode. 
    * \- mutate them into `requires_grad=True` outside InferenceMode. 
<br>
To work around you can make a clone outside `InferenceMode` to get a normal tensor before mutating.<br><br>

<font color=maroon>A `non-view tensor` is an inference tensor if and only if it was allocated inside `InferenceMode`. <br>A `view tensor` is an inference tensor if and only if the tensor it is a view of is an inference tensor.</font><br><br>

<font size=3><font color=maroon>Inside an `InferenceMode block`, we make the following performance guarantees:</font>

* Like `NoGradMode`, all operations do not record `grad_fn` even if their inputs have `requires_grad=True`. This applies to both inference tensors and normal tensors.

    
* View operations on inference tensors do not do view tracking. View and non-view inference tensors are indistinguishable.

    
* Inplace operations on inference tensors are guaranteed not to do a version bump.

For more implementation details of `InferenceMode` please see the <a href="https://github.com/pytorch/rfcs/pull/17" style="text-decoration:none;"><font color=maroon size=2>RFC-0011-InferenceMode</font></a>.</front>

<br>
<br>

## Migration guide from `AutoNonVariableTypeMode`

In production use of PyTorch for inference workload, we have seen a proliferation of uses of the C++ guard `AutoNonVariableTypeMode` (now `AutoDispatchBelowADInplaceOrView`), which disables autograd, view tracking and version counter bumps. 

Unfortunately, current colloquial of this guard for inference workload is unsafe: it’s possible to use `AutoNonVariableTypeMode` to bypass PyTorch’s safety checks and result in silently wrong results, e.g. PyTorch throws an error when tensors saved for backwards are subsequently mutated, but mutation happens inside `AutoNonVariableTypeMode` will silently bypass the check and returns wrong gradient to users.

When current users of `AutoNonVariableTypeMode` think about migrating, the following steps might help you decide the best alternatives: (以下，略)

<br>

<br>
<br>
<br>