In [1]:
%matplotlib inline

# Automatic Differentiation with ``torch.autograd`` [link](https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html)

In [2]:
import torch

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True) #reqires_grad=True  tensor which we'll compute gradient for
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

Gradient function for z = <AddBackward0 object at 0x7f5beeae3670>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x7f5beeae3880>


## Computing Gradients


In [4]:
loss.backward() #calculating gradient
print(w.grad) #grad(w) = partial(loss) / paraital(w)
print(b.grad)

tensor([[0.0261, 0.0560, 0.0428],
        [0.0261, 0.0560, 0.0428],
        [0.0261, 0.0560, 0.0428],
        [0.0261, 0.0560, 0.0428],
        [0.0261, 0.0560, 0.0428]])
tensor([0.0261, 0.0560, 0.0428])


## Doing optimization by hand

In [9]:
# we can not do so:
learning_rate = 1e-4
w -= learning_rate * w.grad
b -= learning_rate * b.grad

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

## Disabling Gradient Tracking


In [5]:
z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)

True
False


Another way to achieve the same result is to use the ``detach()`` method
on the tensor:




In [6]:
z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)

False


There are reasons you might want to disable gradient tracking:
  - To mark some parameters in your neural network as **frozen parameters**.
  - To **speed up computations** when you are only doing forward pass, because computations on tensors that do
    not track gradients would be more efficient.



## More on Computational Graphs
Conceptually, autograd keeps a record of data (tensors) and all executed
operations (along with the resulting new tensors) in a Dynamic directed acyclic
graph (Dynamic DAG) consisting of
[Function](https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function)_
objects. In this DAG, leaves are the input tensors, roots are the output
tensors. By tracing this graph from roots to leaves, you can
automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

- run the requested operation to compute a resulting tensor
- maintain the operation’s *gradient function* in the DAG.

The backward pass kicks off when ``.backward()`` is called on the DAG
root. ``autograd`` then:

- computes the gradients from each ``.grad_fn``,
- accumulates them in the respective tensor’s ``.grad`` attribute
- using the chain rule, propagates all the way to the leaf tensors.

<div class="alert alert-info"><h4>Note</h4><p>**DAGs are dynamic in PyTorch**
  An important thing to note is that the graph is recreated from scratch; after each
  ``.backward()`` call, autograd starts populating a new graph. This is
  exactly what allows you to use control flow statements in your model;
  you can change the shape, size and operations at every iteration if
  needed.</p></div>



## Optional Reading: Tensor Gradients and Jacobian Products


In [7]:
# Note: we are not OPTIMIZING so inp will not ve updates (it will remain the same)
inp = torch.eye(4, 5, requires_grad=True) # identity matrix
out = (inp+1).pow(2).t()
v = 3*torch.ones_like(out)

#create_grpath=True making gradient stored to be computed second time
out.backward(gradient=v, create_graph=True) 
print(f"First call\n{inp.grad}")
print(f'input: \n{inp}')

out.backward(gradient=v, create_graph=True)
print(f"\nSecond call\n{inp.grad}")

inp.grad.zero_()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nCall after zeroing gradients\n{inp.grad}")

First call
tensor([[12.,  6.,  6.,  6.,  6.],
        [ 6., 12.,  6.,  6.,  6.],
        [ 6.,  6., 12.,  6.,  6.],
        [ 6.,  6.,  6., 12.,  6.]], grad_fn=<CopyBackwards>)
input: 
tensor([[1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.]], requires_grad=True)

Second call
tensor([[24., 12., 12., 12., 12.],
        [12., 24., 12., 12., 12.],
        [12., 12., 24., 12., 12.],
        [12., 12., 12., 24., 12.]], grad_fn=<AddBackward0>)

Call after zeroing gradients
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]], grad_fn=<ZeroBackward0>)


  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass


## Explanation of the Above:`torch.Tensor.bachward()`  [link](https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html)
* `x`: is a diahonal matrix [4, 5]
* $y = {((x+1)^2})^T$ [5, 4]
```python
v = torch.ones_like(out)
out = out.backward(grade=v, create_grade=True)
```
$
\begin{align}
 grad(x) = v^T  * \frac{\partial y}{\partial x}
\end{align}
$

 * Note: element wise product
 * `create_grade=True`: saves the gradient if we use backward again
 -------------------
 
 * calling second time:
 ```python
out = out.backward(grade=v, create_grade=True)
 ```
 
 $
\begin{align}
 grad(x) = {previous Gradient} * v  * \frac{\partial y}{\partial x}
\end{align}
$

<div class="alert alert-info"><h4>Note</h4><p>Previously we were calling ``backward()`` function without
          parameters. This is essentially equivalent to calling
          ``backward(torch.tensor(1.0))``, which is a useful way to compute the
          gradients in case of a scalar-valued function, such as loss during
          neural network training.</p></div>




--------------




### Further Reading
- [Autograd Mechanics](https://pytorch.org/docs/stable/notes/autograd.html)

