### **Gradient Descent**
$$
\textbf{w} = \textbf{w} - \eta\nabla\mathscr{L}_{\textbf{w}} \\[10pt]
b = b - \eta\nabla\mathscr{L}_{\textbf{b}}
$$
$\eta$ is called learning rate, $\nabla\mathscr{L}_{\textbf{w}}$ is the gradient of loss $\mathscr{L}$ w.r.t. $\textbf{w}$, similar for $b$

- Gradient descent is the main component of back propagation
  - Well not exactly Gradient Descent in most cases, but a variant of it
- Weight update happens once per epoch

### **Stochastic Gradient Descent**
- Weight update happens for every training example for each epoch

### **Minibatch Gradient Descent**
- Updates happens in batches, not as frequenct as SGD nor as sparse as gradient descent
- Less noisy loss compared to SGD
- Faster than gradient descent because more than 1 update per epoch
- Better GPU utilization than SGD by using Linear Algebra concepts (matrix multiplication)

### **Automatic differentiation in PyTorch**

In [5]:
import torch
import torch.nn.functional as F

In [2]:
# Model parameters
w = torch.tensor([0.23], requires_grad = True)
b = torch.tensor([0.1], requires_grad = True)

# Inputs and targets
x = torch.tensor([1.23])
y = torch.tensor([1.])

In [3]:
# Weighted sum
z = w * x + b
z

tensor([0.3829], grad_fn=<AddBackward0>)

In [4]:
# Activation function
a = torch.sigmoid(z)
a

tensor([0.5946], grad_fn=<SigmoidBackward0>)

In [6]:
# Loss function in Logistic Regression (negative log likelihood/binary crossentropy)
loss = F.binary_cross_entropy(a, y)
loss

tensor(0.5199, grad_fn=<BinaryCrossEntropyBackward0>)

**It's a good practice to NOT use binary crossentropy in PyTorch and instead use binary crossentropy with logits loss**
- What are logits?
  - It is the net inputs of logistic regression model
  - In simple words, feed in the weighted sum and not the output of activation function
  - It will take care of activation function
  - This is just for computational efficiency reasons and numerical stability, but the results are same

In [8]:
loss = F.binary_cross_entropy_with_logits(z, y)
loss

tensor(0.5199, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)

#### **Now, to use the automatic differentiation of PyTorch**

In [9]:
from torch.autograd import grad

We use `retain_graph = True` to keep the computation graph (backpropagation) in memory, otherwise PyTorch will deconstruct the computation graph we previously built.

$$
\frac{\partial \mathscr{L}}{\partial w}
$$

In [10]:
grad_L_w = grad(loss, w, retain_graph = True)
grad_L_w

(tensor([-0.4987]),)

$$
\frac{\partial \mathscr{L}}{\partial b}
$$

In [11]:
grad_L_b = grad(loss, b, retain_graph = True)
grad_L_b

(tensor([-0.4054]),)

**Or, we can use something even more convenient**

In [12]:
loss.backward()

In [13]:
w.grad, b.grad

(tensor([-0.4987]), tensor([-0.4054]))