## TLDR:
- Parameters can be differentiated if specified
- Expressions form graphs, which:
  - Handle forward passes, calculating the values of each node
  - Handle backward passes, calculating the derivatives of each node
- Gradients can be generated using `.backward()` and are accumulated in `.grad`
- Gradients can be disabled to freeze parameters or improve performance if the backward pass isn't needed 

## A Glance at Automatic Differentiation

In [13]:
import torch

# Parameters can be represented using tensors, which, if
# specified using `requires_grad=True` in the constructor
# (or calling `.requires_grad_() after creation), can be
# differentiated
w = torch.randn(5, 3, requires_grad=True) # Weights
b = torch.randn(3)                        # Bias
y = torch.zeros(3)                        # y-actual
x = torch.ones(5)
b.requires_grad_()

# Furthermore, when put inside an expression, the parameters
# and operators are pieced together into a "computational
# graph":
# - The function used to construct this graph is an object
#   called `Function`, which stores:
#   - How to compute the value of the function in the
#     *forward* direction (`.forward()`; do not call
#     directly!)
#   - How to compute the derivative of the function in the
#     *backward* propogation step (`.backward()`)
y_pred = torch.matmul(x, w) + b           # y-predicted
loss = torch.nn.functional.binary_cross_entropy_with_logits(y_pred, y)
print(f"Gradient function of `y_pred`: {y_pred.grad_fn}")
print(f"Gradient function of `loss`: {loss.grad_fn}")
print()

Gradient function of `y_pred`: <AddBackward0 object at 0x00000223F9B13E80>
Gradient function of `loss`: <BinaryCrossEntropyWithLogitsBackward0 object at 0x00000223954417B0>



Parameters can be linked together and form a computational graph, which handles:
- Evaluating the expression in the forward direction: computing the values of each node
- Evaluating the expression in the backward direction: computing the derivatives of each node

## Using Automatic Differentiation

In [14]:
# To compute and use derivatives:
# - First, `.backward()` can be called on the root, which
#   backward propogates to all nodes in its graph, computing
#   the gradients for each one along the way
# - Then, the gradients are accumulated in the `.grad`
#   property of each descendant of the root
#   - If `.backward()` is called multiple times while
#     retaining graph (`retain_graph=True` in `.backward()`),
#     the `.grad` property will contain the sum of the
#     gradients
#   - Gradients can be set to zero by calling `.grad.zero_()`
loss.backward(retain_graph=True)
print(f"Accumulated gradient of `w` after 1 call:\n{w.grad}")
print(f"Accumulated gradient of `b` after 1 call:\n{b.grad}")
print()

loss.backward()
print(f"Accumulated gradient of `w` after 2 calls:\n{w.grad}")
print(f"Accumulated gradient of `b` after 2 calls:\n{b.grad}")
print()

w.grad.zero_()
b.grad.zero_()
print(f"Accumulated gradient of `w` after zeroing:\n{w.grad}")
print(f"Accumulated gradient of `b` after zeroing:\n{b.grad}")
print()

Accumulated gradient of `w` after 1 call:
tensor([[0.2526, 0.1327, 0.0793],
        [0.2526, 0.1327, 0.0793],
        [0.2526, 0.1327, 0.0793],
        [0.2526, 0.1327, 0.0793],
        [0.2526, 0.1327, 0.0793]])
Accumulated gradient of `b` after 1 call:
tensor([0.2526, 0.1327, 0.0793])

Accumulated gradient of `w` after 2 calls:
tensor([[0.5052, 0.2654, 0.1586],
        [0.5052, 0.2654, 0.1586],
        [0.5052, 0.2654, 0.1586],
        [0.5052, 0.2654, 0.1586],
        [0.5052, 0.2654, 0.1586]])
Accumulated gradient of `b` after 2 calls:
tensor([0.5052, 0.2654, 0.1586])

Accumulated gradient of `w` after zeroing:
tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])
Accumulated gradient of `b` after zeroing:
tensor([0., 0., 0.])



Calling `.backward()` on a root will differentiate all the parameters contributing to it and add the resulting gradients to each node's `.grad` property

## Disabling Automatic Differentiation

In [15]:
# Gradients can be permanently or temporarily disabled. This
# could be done to:
# - Mark some parameters as "frozen parameters"
# - Speed up computations when only the forward pass is needed,
#   since extra work is required to track these gradients
y_pred = torch.matmul(x, w) + b
print(f"`.requires_grad` before disabling gradients: {y_pred.requires_grad}")
print()

# Disabling gradients with `.no_grad()`
with torch.no_grad():
  y_pred = torch.matmul(x, w) + b
print(f"`.requires_grad` after disabling gradients with `.no_grad()`: {y_pred.requires_grad}")
print()

# Disabling gradients by getting a new tensor, detached from
# the graph, using `.detach()`
y_pred = torch.matmul(x, w) + b
y_pred = y_pred.detach()
print(f"`.requires_grad` after disabling gradients with `.detach()`: {y_pred.requires_grad}")
print()

`.requires_grad` before disabling gradients: True

`.requires_grad` after disabling gradients with `.no_grad()`: False

`.requires_grad` after disabling gradients with `.detach()`: False



Gradients can be disabled to freeze parameters or improve performance if only the forward pass is needed

## Further reading
There is an optional reading section on tensor gradients and Jacobian products

Since I don't understand it yet, I'll leave it to my future self to take notes on it in my stead