# **Understanding PyTorch Autograd: A Practical Guide to Gradient Calculations**

---
## **Overview**
>This notebook explores PyTorch's autograd feature, focusing on automatic differentiation, gradient calculations, and backward propagation. Each example illustrates key concepts, including manual gradient derivation, using requires_grad, and different ways to manage gradients. This foundational knowledge will help understand deep learning optimization processes.

---
## **Importing PyTorch**
>Importing the library PyTorch for Autograd.

In [1]:
import torch

---
## **Custom Binary Cross-Entropy Implementation**
>This defines a binary cross-entropy loss function to compute the error between predictions and actual labels in a binary classification task. The `torch.clamp` function ensures numerical stability by bounding prediction values within [`epsilon`, `1-epsilon`] to avoid `log(0)` errors. The formula calculates the negative log likelihood for the correct class based on the predicted probabilities.

In [29]:
def binary_cross_entropy(prediction, target):
  epsilon = 1e-8
  prediction = torch.clamp(prediction, epsilon, 1-epsilon)

  return -1*target*torch.log(prediction) - (1-target)*torch.log(1-prediction)

---
## **Manual Gradient Computation**
>This initializes the input (`x`), target (`y`), weight (`w`), and bias (`b`). Using the linear model equation `z = wx + b`, the weighted sum is computed. The result is passed through the sigmoid function to produce a probability (`prediction`). Finally, the binary cross-entropy loss is calculated using the custom function.

In [40]:
x = torch.tensor(3.73)
y = torch.tensor(1.0)

w = torch.tensor(1.0)
b = torch.tensor(0.0)

z = w*x + b
prediction = torch.sigmoid(z)
loss = binary_cross_entropy(prediction, y)

print(f"Loss: {loss}")
print(f"Prediction: {prediction}")

Loss: 0.02370944619178772
Prediction: 0.976569414138794


---
>The gradients are calculated manually using the chain rule. Each component of the derivative is computed:
  - `dloss_dpred`: Partial derivative of the loss w.r.t. the prediction.
  - `dpred_dz`: Derivative of the sigmoid function w.r.t. its input.
  - `dz_dw` and `dz_db`: Derivatives of the linear model w.r.t. the weight and bias, respectively.<br>
The final gradients for the weight (`dloss_dw`) and bias (`dloss_db`) are computed by combining these components.

In [42]:
dloss_dpred = (prediction-y) / (prediction * (1-prediction))
dpred_dz = prediction * (1-prediction)

dz_dw = x
dz_db = 1

dloss_dw = dloss_dpred * dpred_dz * dz_dw
dloss_db = dloss_dpred * dpred_dz * dz_db

print(f"Gradient of loss with respect to weight: {dloss_dw}")
print(f"Gradient of loss with respect to bias: {dloss_db}")

Gradient of loss with respect to weight: -0.08739608526229858
Gradient of loss with respect to bias: -0.023430585861206055


---
## **Automatic Differentiation with `requires_grad`**
>This cell uses PyTorch’s autograd by setting `requires_grad=True` for `w2` and `b2`. This enables automatic tracking of operations on these tensors. The same forward computation (`z2`, `y_pred`, `loss2`) is performed, and the gradients are computed automatically using `loss2.backward()`. The gradients are stored in the `.grad` attribute of `w2` and `b2`.

In [31]:
x2 = torch.tensor(3.73)
y2 = torch.tensor(1.0)

w2 = torch.tensor(1.0, requires_grad=True)
b2 = torch.tensor(0.0, requires_grad=True)

z2 = w2*x2 + b2
y_pred = torch.sigmoid(z2)
loss2 = binary_cross_entropy(y_pred, y2)

print(f"Loss: {loss2}")
print(f"Prediction: {y_pred}")

loss2.backward()

print(f"\n\nGradient of loss with respect to weight: {w2.grad}")
print(f"Gradient of loss with respect to bias: {b2.grad}")

Loss: 0.02370944619178772
Prediction: 0.976569414138794


Gradient of loss with respect to weight: -0.08739608526229858
Gradient of loss with respect to bias: -0.023430585861206055


---
## **Gradient Calculation for a Vector Tensor**
>This example computes the gradient for a vector tensor `x3`. The operation squares each element of `x3` and calculates their mean (`y3`). When `y3.backward()` is called, gradients of `y3` w.r.t. each element of `x3` are computed and stored in `x3.grad`. This shows how autograd handles tensor-level gradients.

In [32]:
x3 = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y3 = (x3**2).mean()

print(f"Value of y: {y3}")

y3.backward()

print(f"\n\nGradients of y: {x3.grad}")

Value of y: 4.666666507720947


Gradients of y: tensor([0.6667, 1.3333, 2.0000])


---
## **Clearing Gradients**
>This example shows that gradients persist in `.grad` after calling `.backward()`. The gradient of `y4 = x4**2` is computed, and `x4.grad` stores the result. The `zero_()` method clears the gradients, which is important in iterative training loops to prevent accumulation of gradients from previous iterations.

In [37]:
x4 = torch.tensor(2.0, requires_grad=True)
y4 = x4**2

print(f"Value of y: {y4}")

y4.backward()

print(f"\n\nGradients of y: {x4.grad}")
x4.grad.zero_()

Value of y: 4.0


Gradients of y: 4.0


tensor(0.)

---
## **Disabling Gradient Tracking**
>This demonstrates two methods to disable gradient tracking:
  - `requires_grad_(False)`: Permanently disables gradient computation for the tensor `x6`.
  - `detach()`: Creates a new tensor `x7` detached from the computation graph of `x5`. Any operation on `x7` will not affect the gradients of `x5`.

In [46]:
x5 = torch.tensor(2.0, requires_grad=True)
y5 = x5**3

print(f"Value of y: {y5}")

y5.backward()

print(f"Gradients of y: {x5.grad}")

x6 = x5.requires_grad_(False)
y6 = x6**4

print(f"\nValue of y (requires_grad_): {y6}")

x7 = x5.detach()
y7 = x7**5

print(f"Value of y (detach): {y7}")

Value of y: 8.0
Gradients of y: 12.0

Value of y (requires_grad_): 16.0
Value of y (detach): 32.0


---
## **Using no_grad for Inference**
>This example shows the use of `torch.no_grad()` to temporarily disable gradient tracking within its scope. This is useful for inference, where gradients are not needed, reducing memory usage and computation overhead.

In [45]:
x8 = torch.tensor(2.0, requires_grad=True)

with torch.no_grad():
  y8 = x8**6

print(f"Value of y (no_grad): {y8}")

Value of y (no_grad): 64.0
