# 2.5. Automatic Differentiation

Reference
* [PyTorch Autograd Explained - In-depth Tutorial
](https://www.youtube.com/watch?v=MswxJw-8PvE): computation graph 그려지는 과정과 역전파 과정 자세히 소개

`automatic differentiation`, `computational graph`

Here, `backpropagate`
simply means to trace through the computational graph, filling in the partial derivatives with respect
to each parameter.

In [1]:
import torch

x = torch.arange(4.0)
x

tensor([0., 1., 2., 3.])

In [2]:
x.requires_grad_(True) # Same as `x = torch.arange(4.0, requires_grad=True)`
x.grad # The default value is None

In [3]:
y = 2 * torch.dot(x, x)
y

tensor(28., grad_fn=<MulBackward0>)

In [4]:
y.backward()
x.grad

tensor([ 0.,  4.,  8., 12.])

In [5]:
x.grad == 4 * x

tensor([True, True, True, True])

In [6]:
# PyTorch accumulates the gradient in default, we need to clear the previous
# values
x.grad.zero_()
y = x.sum()
y.backward()
x.grad

tensor([1., 1., 1., 1.])

Non-Scalar Variables

Technically, when y is not a scalar, the most natural interpretation of the differentiation of a vector
y with respect to a vector x is a matrix. For higher-order and higher-dimensional y and x, the
differentiation result could be a high-order tensor.


However, while these more exotic objects do show up in advanced machine learning (including in
deep learning), more often when we are calling backward on a vector, we are trying to calculate
the derivatives of the loss functions for each constituent of a batch of training examples. Here, our
intent is not to calculate the differentiation matrix but rather the sum of the partial derivatives
computed individually for each example in the batch.

In [7]:
# Invoking `backward` on a non-scalar requires passing in a `gradient` argument
# which specifies the gradient of the differentiated function w.r.t `self`.
# In our case, we simply want to sum the partial derivatives, so passing
# in a gradient of ones is appropriate
x.grad.zero_()
y = x * x
# y.backward(torch.ones(len(x))) equivalent to the below
y.sum().backward()
x.grad

tensor([0., 2., 4., 6.])

`detach()`

Sometimes, we wish to move some calculations outside of the recorded computational graph. For
example, say that y was calculated as a function of x, and that subsequently z was calculated as a
function of both y and x. Now, imagine that we wanted to calculate the gradient of z with respect
to x, but wanted for some reason to treat y as a constant, and only take into account the role that
x played after y was calculated.


Here, we can detach y to return a new variable u that has the same value as y but discards any
information about how y was computed in the computational graph. In other words, the gradient
will not flow backwards through u to x. Thus, the following backpropagation function computes
the partial derivative of z = u * x with respect to x while treating u as a constant, instead of the
partial derivative of z = x * x * x with respect to x.

In [8]:
x.grad.zero_()
z = x * x * x
z.sum().backward()
x.grad

tensor([ 0.,  3., 12., 27.])

In [9]:
x.grad.zero_()
y = x * x
u = y.detach()
z = u * x
z.sum().backward()
x.grad

tensor([0., 1., 4., 9.])

In [10]:
u

tensor([0., 1., 4., 9.])

---

https://tutorials.pytorch.kr/beginner/blitz/autograd_tutorial.html#computational-graph

In [11]:
import torch, torchvision
model = torchvision.models.resnet18(pretrained=True)
data = torch.rand(1, 3, 64, 64)
labels = torch.rand(1, 1000)

In [12]:
prediction = model(data) # 순전파 단계(forward pass)

In [13]:
loss = (prediction - labels).sum()
loss.backward() # 역전파 단계(backward pass)

In [14]:
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

In [15]:
optim.step() # 경사하강법(gradient descent)

---

In [16]:
import torch

a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

In [17]:
Q = 3*a**3 - b**2

In [18]:
Q

tensor([-12.,  65.], grad_fn=<SubBackward0>)

In [19]:
external_grad = torch.tensor([1., 1.]) 
# external gradient= Q를 스칼라로 만들어주는 함수의 gradient
# Q: 2x2 -> l(Q) in R의 gradient
# 여기서는 l(Q) = Q.sum() = \sum(q_i)
# grad l = (1,1)

In [20]:
Q.backward(gradient=external_grad) 
# Q.sum().backward() # is same

In [21]:
a.grad

tensor([36., 81.])

In [22]:
b.grad

tensor([-12.,  -8.])

In [23]:
9*a**2

tensor([36., 81.], grad_fn=<MulBackward0>)

In [24]:
-2*b

tensor([-12.,  -8.], grad_fn=<MulBackward0>)

---

https://tutorials.pytorch.kr/beginner/basics/autogradqs_tutorial.html

In [27]:
import torch

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output

w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)

z = torch.matmul(x, w)+b # z = wX + b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

![](https://tutorials.pytorch.kr/_images/comp-graph.png)

In [28]:
print('Gradient function for z =', z.grad_fn)
print('Gradient function for loss =', loss.grad_fn)

Gradient function for z = <AddBackward0 object at 0x0000021AFAD48D30>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward object at 0x0000021AFAD48EB0>


In [31]:
# w,b -> z(w,b)=wX+b -> Loss(z,y) = Loss(w,b,y)
loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.3192, 0.0129, 0.1704],
        [0.3192, 0.0129, 0.1704],
        [0.3192, 0.0129, 0.1704],
        [0.3192, 0.0129, 0.1704],
        [0.3192, 0.0129, 0.1704]])
tensor([0.3192, 0.0129, 0.1704])


In [32]:
z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad(): # 단순 forward -> gradient 추적 필요 X
    z = torch.matmul(x, w)+b
print(z.requires_grad)

True
False


In [33]:
z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)

False


변화도 추적을 멈추는 경우
* 신경망의 일부 매개변수를 고정된 매개변수(frozen parameter)로 표시합니다. 이는 사전 학습된 신경망을 미세조정 할 때 매우 일반적인 시나리오입니다.
* 변화도를 추적하지 않는 텐서의 연산이 더 효율적이기 때문에, 순전파 단계만 수행할 때 연산 속도가 향상됩니다.

In [37]:
inp = torch.eye(5, requires_grad=True)
out = (inp+1).pow(2)

In [40]:
inp

tensor([[1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1.]], requires_grad=True)

In [41]:
out

tensor([[4., 1., 1., 1., 1.],
        [1., 4., 1., 1., 1.],
        [1., 1., 4., 1., 1.],
        [1., 1., 1., 4., 1.],
        [1., 1., 1., 1., 4.]], grad_fn=<PowBackward0>)

In [38]:
out.backward(torch.ones_like(inp), retain_graph=True)
print("First call\n", inp.grad)
out.backward(torch.ones_like(inp), retain_graph=True)
print("\nSecond call\n", inp.grad) # 누적됨

First call
 tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.],
        [2., 2., 2., 2., 4.]])

Second call
 tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.],
        [4., 4., 4., 4., 8.]])


In [39]:
inp.grad.zero_() # 누적하지 않으려면 초기화하고 backward call
out.backward(torch.ones_like(inp), retain_graph=True)
print("\nCall after zeroing gradients\n", inp.grad)


Call after zeroing gradients
 tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.],
        [2., 2., 2., 2., 4.]])
