## Automatic Differentiation with `torch.autograd`
- *Backpropagation* is the most frequently used algorithm for training Neural Network.
- Parameters(model weights) are adjusted by the gradient of the Loss Function for a given parameter.
- To calculate those gradients, PyTorch has a bulit-in automatic differentiation engine called `torch.autograd`.  
And it supports automatic calculation of gradients for all types of calculation graphs.

In [1]:
import torch

# Input Tensor(x) and Expected Output(y)
x = torch.ones(5)
y = torch.zeros(3)

# Parameters
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)

# Single-layered Neural Network and Loss Function
z = torch.matmul(x, w) + b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

- In this Neural Network, w and b are the parameters that should be optimized.
- Therefore, you must be able to calculate the gradients of Loss Function for those parameters with `requires_grad` attiribute.

In [2]:
# Print gradient fuctions for Neural Network and Loss
print(f"Gradient Function for z = {z.grad_fn}")
print(f"Gradient Function for loss = {loss.grad_fn}")

Gradient Function for z = <AddBackward0 object at 0x000002C007D34F10>
Gradient Function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x000002C0084E64D0>


### Calculating Gradient
- To optimize the weights of parameters in a Neural Network, you must calculate the derivative of Loss Function with respect to the parameters.
- For this, you must get the value from `w.grad` and `b.grad` after calling `loss_backward()`.

In [3]:
loss.backward()

print(w.grad)
print(b.grad)

tensor([[0.0034, 0.3219, 0.3229],
        [0.0034, 0.3219, 0.3229],
        [0.0034, 0.3219, 0.3229],
        [0.0034, 0.3219, 0.3229],
        [0.0034, 0.3219, 0.3229]])
tensor([0.0034, 0.3219, 0.3229])


### How to Stop Tracking the Gradient Change
- Basically, all of the Tensors with `require_grads=True` track the calcuation history and support the calcluation of gradients.  
But if you only need Net Propagation, this tracking or support may not be necessary.
- You can stop tracking the calculation by surrounding the calculation codes with `torch.no_grad()`.

In [4]:
z = torch.matmul(x, w) + b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w) + b
print(z.requires_grad)

True
False


- Using `detach()` method to Tensor also do the same.

In [5]:
z = torch.matmul(x, w) + b
z_det = z.detach()

print(z_det.requires_grad)

False


#### Here's why you should stop tracking gradient change:
- Displays some of the parameters in a Neural Network as *frozen parameters*.
- Since the calculation of Tensors that are not tracking gradient change is more efficient,  
the calcuation speed will be improved when only the Net Propagation is performed.

### Additional information about Calculation Graph
- Conceptually, `autograd` keeps records of data(Tensor) and all operations executed(even if the operation result is a new Tensor)  
in a *Directed Acyclic Graph(DAG)* composed of `Function` objects.
- The leave of DAG is Input Tensor, and the root is Output.
- Tracking this graph from the root to leaves makes you to calculate gradient automatically according to the *Chain Rule*.

<br/>

- In the Net Propagation step, `autograd` performs the following tasks at the same time.
> - Calculates the Output Tensor by performing the qequested operation.
> - Maintain the gradient function of DAG.

<br/>

- The Backpropagation starts when `.backward()` method is called from the root of DAG.
- At this time, `autograd` does...
> - Calculate gradients from each `grad_fn`,
> - Accumulates the result of operation to `.grad` attiributes of each Tensor,
> - And Propagates to the all leaf Tensors using Chain Rule.

### Optional Reading: Change of Tensor Gradient and Jacobian Product

In [6]:
input = torch.eye(4, 5, requires_grad=True)
output = (input + 1).pow(2).t()

output.backward(torch.ones_like(output), retain_graph=True)

print(f"First Call:\n{input.grad}")

output.backward(torch.ones_like(output), retain_graph=True)

print(f"Second Call:\n{input.grad}")

input.grad.zero_()
output.backward(torch.ones_like(output), retain_graph=True)

print(f"Call after zeroing gradients:\n{input.grad}")

First Call:
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])
Second Call:
tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.]])
Call after zeroing gradients:
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])
