# Pytorch (Autograd Practice)

## The Computation Graph (The "Brain" of Autograd)
- The core concept behind Autograd is the dynamic computation graph (DCG).
- When you perform an operation on a PyTorch tensor, Autograd implicitly builds a graph where:
  - Nodes represent the Tensors.
  - Edges represent the functions or operations applied to those Tensors (e.g., addition, multiplication, or $x^2$).
  - This graph is built dynamically, meaning it is rebuilt for **every single forward pass**.

In [1]:
import torch

## Tensors and requires_grad
- The **requires_grad** flag dictates whether Autograd should track operations on a tensor.
- Leaf Nodes (Inputs/Parameters): For tensors that are user inputs (X) or model parameters (W, b), set requires_grad=True.
- Intermediate/Output Nodes (y): These tensors automatically **inherit** the requires_grad=True flag if any tensor used to create them had requires_grad=True.
- We can change this flag in-place using .requires_grad_():

``` Python
W = torch.randn(5, 5) # requires_grad=False by default
W.requires_grad_(True) # Now tracking operations on W
# Note using '_ Underscore' means we are changing the option
# IN-PLACE (within the same tensor and not creating another tensor)

# OR
W = torch.randn(5, 5, requires_grad=True)
```

## Gradient Accumulation
- Gradients **accumulate** in the .grad attribute. This is the most common pitfall in PyTorch.
- Every time we call .backward(), the new gradients are **added** to any existing values in the .grad attribute.
  - Why? In advanced scenarios like Recurrent Neural Networks (RNNs) or accumulating gradients over large batches, this accumulation is desirable.
  - The Fix: We must **manually zero out the gradients** before running a new backward pass, usually using **optimizer.zero_grad()**.

``` Python
# Before starting a new batch or epoch
# optimizer.zero_grad()
X.grad.zero_() # Manually zeroing the gradient for the next calculation IN-PLACE
```

## Gradient Calculation and Scope
- In our example Below: y.backward(), we are finding the gradient of the final scalar value (y) with respect to its dependency (X).
- In PyTorch, calling tensor.backward() effectively computes the gradient of that tensor with respect to the leaf nodes (X).
- Our Code: Since $y = X^2$, calling y.backward() computes $\frac{d y}{d X} = 2X$. With $X=3.0$, $X.\text{grad}$ becomes $2 \times 3.0 = 6.0$.
- Neural Networks: In machine learning, $y$ is usually a loss function ($L$). When we call $L.\text{backward()}$, we are computing $\frac{\partial L}{\partial \text{Weights}}$, which is used for the optimization step. The gradient of the loss is what truly matters.
### The grad_fn Attribute
- The grad_fn attribute is crucial and its existence indicates that Autograd is tracking the computation for this tensor.
- If a tensor is created as a result of an operation, it will have a grad_fn property, which references the function that created it (e.g., <PowBackward0>).
- If a tensor is a leaf node (like our initial X with requires_grad=True), it has no grad_fn. It is the starting point of the graph.

In [4]:
X = torch.tensor(3.0, requires_grad=True)
print(X)
y = X**2
print(y)
y.backward()
print(X.grad)

tensor(3., requires_grad=True)
tensor(9., grad_fn=<PowBackward0>)
tensor(6.)


## Disabling Autograd Tracking
- When performing operations that should not be tracked by Autograd (e.g., during model evaluation, **testing**, or weight updates), use the torch.no_grad() context manager.
  - This reduces memory consumption by preventing the DCG from being built.
  - It significantly speeds up forward passes where gradients aren't needed.
### Using `requires_grad_(False)` functionality
``` Python
z = w*x + b
model.eval()
# During Testing
w.requires_grad_(False)
b.requires_grad_(False)
z = w*x + b
# The 'z' tensor will have requires_grad=False AND
# .backward() functionality will not run
```
### Using `.detach()` functionality
- This Method creates a new tensor with the same values from the tensor on which this method is used BUT with requires_grad set as FALSE.
``` Python
z = w*x + b
model.eval()
# During Testing
w1 = w.detatch()
b1 = b.detatch()
z1 = w1*x + b
# The 'z1' tensor will have requires_grad=False AND
# .backward() functionality will not run
```
### The `with torch.no_grad()`: Context Manager
``` Python
# During model evaluation or validation
model.eval()
with torch.no_grad():
    predictions = model(input_data)
    # The 'predictions' tensor will have requires_grad=False
```


### Major Example

In [5]:
# Major Example of A Simple Perceptron of Logistic Regression

# Inputs
x = torch.tensor(6.7) # Input Feature - Lets say IQ Score of Student - 6.7
y = torch.tensor(0.0) # True Label - Lets Say Student Didnot get the job

# INITIALIZATION OF WEIGHT AND BIAS (ORIGINALLY WILL BE RANDOMLY INITIALIZED)
w = torch.tensor(1.0)
b = torch.tensor(0.0)

# ALL ABOVE ARE ACTUALLY SCALARS

In [7]:
# Binary Entropy Loss Calculator for Scalars (Manually)
def binary_cross_entropy_loss(pred,target):
  epsilon = 1e-8
  # epsilon is A Self Made Small value used to clamp the predictions
  # within the range [1 - epsilon, epsilon] ([max, min]). We do this
  # to prevent any prediction value to be zero by any chance so that
  # the log(0)=undefined scenario could be avoided in cross_entropy loss
  # because it uses the logs to calculate the loss
  prediction = torch.clamp(pred,epsilon,1-epsilon)
  return -(target * torch.log(pred) + (1-target) * (torch.log(1-pred)))

In [9]:
# Doing Sample Forward Passing (Manually)
z = w*x + b               # Weighted Sum Calculation
y_pred = torch.sigmoid(z) # Probability Mapping
# Computing Cross_Entropy Loss (Manually)
loss = binary_cross_entropy_loss(y_pred,y)
print(f"Z = {z}, y_pred = {y_pred}, loss = {loss}")

Z = 6.699999809265137, y_pred = 0.998770534992218, loss = 6.701176166534424


In [14]:
# Partial Differential Or Gradients Calculations (Manually) - (Digma Symbol can't be written so using 'd' so don't confuse in differential and partial differential)
# 1. dL/d(y_pred): Change in Loss with respect to the prediction (y_pred)
dloss_dy_pred = (y_pred - y)/(y_pred * (1-y_pred))

# 2. d(y_pred)/dz: Change in prediction (y_pred) with respect to z (sigmoid mapping)
dy_pred_dz = y_pred * (1-y_pred)

# 3. dz/dw and dz/db: Change in z with respect to weight and bias (our leaf nodes in computation graph)
dz_dw = x
dz_db = 1   # 1 means bias contribute directly to z and don't depent any other variable

dL_dw = dloss_dy_pred * dy_pred_dz * dz_dw
dL_db = dloss_dy_pred * dy_pred_dz * dz_db

print(f"Manual Gradient of Loss w.r.t weight (w): {dL_dw}")
print(f"Manual Gradient of Loss w.r.t bias (b): {dL_db}")

Manual Gradient of Loss w.r.t weight (w): 6.691762447357178
Manual Gradient of Loss w.r.t bias (b): 0.998770534992218


In [15]:
# NOW USING AUTOGRAD WE CAN DO THE SAME WITHIN FEW LINES OF CODE
x = torch.tensor(6.7)
y = torch.tensor(0.0)
# NOTE WE ARE NOT ENABLING AUTOGRAD OPTION FOR x (feature) and y (label)
# BECAUSE WE DON'T NEED OR REQUIRE TO TRACK THE OPERATIONS WITH RESPECT TO
# FEATURE AND LABEL (AND NORMALY WE WON'T EVEN IN FUTURE) INSTEAD OUR
# POINT OF CONCERN IS THE SELF INITIALIZED WEIGHT AND BIAS - ONE OF THE
# REASON FOR THIS IS THAT FEATURE AND LABEL ARE DATA PROVIDED AND WE
# CAN'T OR DON'T WANT TO MANIPULATE THE DATA INSTEAD WE WANT TO MANIPULATE
# OUR OWN ASSUMPTIONS TO WORK BEST ON THE DATA.
w = torch.tensor(1.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

In [24]:
# Now doing other things
z = w*x + b               # Weighted Sum Calculation
y_pred = torch.sigmoid(z) # Probability Mapping
# Computing Cross_Entropy Loss (Manually)
loss = binary_cross_entropy_loss(y_pred,y)
print(f"Z = {z}, y_pred = {y_pred}, loss = {loss}")

Z = 6.699999809265137, y_pred = 0.998770534992218, loss = 6.701176166534424


In [25]:
# Now doing backward propagation and checking the results
loss.backward()
print(w.grad)
print(b.grad)

tensor(6.6918)
tensor(0.9988)


In [26]:
# Now Clearing the Gradients
w.grad.zero_()
b.grad.zero_()
print(w.grad)
print(b.grad)

tensor(0.)
tensor(0.)
