<a href="https://colab.research.google.com/github/Ramandeep-Singh17/DLusingPyTorch/blob/main/2_AutogradPyTorch_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ðŸ”¥ Autograd in PyTorch (Automatic Differentiation)

---

## ðŸ”¹ What is Autograd?

Autograd is a PyTorch feature that automatically calculates derivatives (gradients) of tensors.

In simple words:
> Autograd = Automatic derivative calculator

---

## ðŸ”¹ Why Do We Need Autograd?

Deep Learning = Optimization problem  
Optimization needs â†’ Gradients  
Gradients = Derivatives  

Neural network training uses:
- Backpropagation
- Gradient Descent

Without derivatives â†’ No weight update â†’ No learning

---

## ðŸ”¹ Basic Example

Let:

y = xÂ²  

Derivative:

dy/dx = 2x  

If:
x = 2 â†’ dy/dx = 4  
x = 3 â†’ dy/dx = 6  

We can easily compute slope by changing x.

---

## ðŸ”¹ Now Increase Complexity

Suppose:

y = xÂ²  
z = sin(y)  

Now we need:

dz/dx  

We cannot directly differentiate.  
We must use **Chain Rule**.

---

## ðŸ”¹ Chain Rule

If:

z = sin(y)  
y = xÂ²  

Then:

dz/dx = (dz/dy) Ã— (dy/dx)

We know:

dz/dy = cos(y)  
dy/dx = 2x  

So:

dz/dx = cos(y) Ã— 2x  

Since y = xÂ²:

dz/dx = 2x cos(xÂ²)

---

## ðŸ”¹ Why This Matters in Deep Learning

Neural networks are full of nested functions:

Example:

u = e^z  
z = sin(y)  
y = xÂ²  

Now derivative:

du/dx = (du/dz) Ã— (dz/dy) Ã— (dy/dx)

As functions become more nested:
- Manual derivative becomes difficult
- Human error chances increase

---

## ðŸ”¹ Where Autograd is Used?

- Backpropagation
- Loss gradient calculation
- Weight updates
- Training deep neural networks

Every layer depends on previous layer â†’ Chain rule everywhere

---

## ðŸ”¹ How Autograd Helps?

PyTorch automatically:

1. Tracks operations
2. Builds computation graph
3. Applies chain rule
4. Calculates gradients
5. Stores gradients in `.grad`

No need to manually calculate derivatives.

---

# ðŸ”¥ Final Understanding

- Neural networks = Nested mathematical functions
- Training = Derivative calculation
- Derivatives use Chain Rule
- Manual differentiation is complex
- Autograd does it automatically

Autograd is the heart of Deep Learning training.


In [1]:
def dy_dx(x):
  return 2*x

In [2]:
dy_dx(3)

6

In [3]:
import math
def dz_dx(x):
   return 2*x*math.cos(x**2)

In [4]:
dz_dx(2)

-2.6145744834544478

**Same x2 ka calculation using autograd**

In [5]:
import torch

In [6]:
x = torch.tensor(3.0, requires_grad=True)
# tensor me ek value ahi wo scaler
#  requires_grad=True (hmne ise true set kiya hai ye bydeafult false hota hai)
# agar hm kabhi bhi derivates cahiye yab  requires_grad=True karna padta hai


In [7]:
y = x**2

In [8]:
x

tensor(3., requires_grad=True)

In [9]:
y

tensor(9., grad_fn=<PowBackward0>)

In [10]:
y.backward()
#  ye hme backward jate waqt dy/dx dega jisse deifferrnion milta hai

In [11]:
x.grad

tensor(6.)

# ðŸ”¥ Autograd in PyTorch (Core Idea)

---

## ðŸ”¹ Neural Network = Nested Function

Example:

y = xÂ²  
z = sin(y)  
u = e^z  

Neural Network bhi exactly aise hi hota hai:

Output = f3(f2(f1(x)))

ðŸ‘‰ Har layer ek function hai  
ðŸ‘‰ Deep network = Highly nested function  

---

## ðŸ”¹ Forward Computation

Forward pass me:

1. Linear transformation â†’ z = wx + b  
2. Activation â†’ sigmoid(z)  
3. Loss calculation  

Mathematically:

Loss = L( sigmoid(wx + b) )

Ye pura ek **nested function** hai.

---

## ðŸ”¹ Problem: Derivatives

Training ke liye hume chahiye:

dL/dw  
dL/db  

Agar network deep ho gaya:

Layer1 â†’ Layer2 â†’ Layer3 â†’ ... â†’ Output  

To derivative nikalna hoga:

dL/dw = (chain rule applied many times)

Manual differentiation:
- Bahut lengthy
- Error-prone
- Practically impossible for deep networks

---

## ðŸ”¹ Solution â†’ AUTOGRAD

Autograd automatically:

- Tracks operations
- Builds computation graph
- Applies chain rule
- Calculates gradients

No manual derivative required.

---

# ðŸ”¹ requires_grad = True

When we write:

```
x = torch.tensor(3.0, requires_grad=True)
```

It tells PyTorch:

ðŸ‘‰ "Is tensor ka gradient calculate karna hai"

Then PyTorch:

1. Tracks all operations on x  
2. Creates a computation graph  
3. During `.backward()`  
4. Calculates dy/dx automatically  

Gradient stored in:

```
x.grad
```

---

## ðŸ”¹ How It Works (Simple Flow)

Forward pass:
x â†’ square â†’ sin â†’ loss  

Backward pass:
loss.backward()

PyTorch automatically computes:

dloss/dx  

Using chain rule internally.

---

# ðŸ”¥ THE WHY (Very Important â€“ Hinglish)

Neural network ek nested function hota hai.  
Jaise jaise network deep hota hai, derivative manually nikalna practically impossible ho jata hai.  

Isliye hum Autograd use karte hain jo automatically chain rule apply karke gradients calculate karta hai.

Without Autograd â†’ No Backpropagation â†’ No Training.


In [12]:
import math

def dz_dx(x):
    return 2 * x * math.cos(x**2)

In [13]:
dz_dx(4)

-7.661275842587077

**Abhi jo cheez hm manually calaculate kar rhe hai z=sin(y), dz/dx nikal rhe hai use ab pytorch se nikal lenhe AutoGrad se**

In [14]:
x = torch.tensor(4.0, requires_grad=True)

In [15]:
y = x ** 2
#x->sq--y-> z
# jab hm forward ja rhe hai tab hme y=x2 mil rha hai so y =16 aa rha hai

In [16]:
z = torch.sin(y)

In [17]:
x

tensor(4., requires_grad=True)

In [18]:
y

tensor(16., grad_fn=<PowBackward0>)

In [19]:
z

tensor(-0.2879, grad_fn=<SinBackward0>)

**ye sab mathemaitical intuion hai ki internally jayega and phir forward se kuch calculate hoga and phir kuch backward se but ye point bas itna hai ki jisb+ka bhi nikalna ho derivatives  *usme backward laga dena.***

In [20]:
z.backward()

In [21]:
x.grad
#   ye hm likte hai uski valuse dekhe ke liye jiska ifferention dekhna chahte hai

tensor(-7.6613)

In [22]:
y.grad #The .grad attribute of a Tensor that is not a leaf Tensor is being accessed.

  y.grad #The .grad attribute of a Tensor that is not a leaf Tensor is being accessed.


In [23]:
import torch

# Inputs
x = torch.tensor(6.7)  # Input feature
y = torch.tensor(0.0)  # True label (binary)

w = torch.tensor(1.0)  # Weight
b = torch.tensor(0.0)  # Bias

## ðŸ”¹ Binary Cross Entropy (BCE) Loss

Binary classification ke case me loss function:

L = - [ y_target * log(y_pred) + (1 - y_target) * log(1 - y_pred) ]

Where:

- y_target â†’ Actual label (0 or 1)
- y_pred â†’ Model prediction (after sigmoid, between 0 and 1)

ðŸ‘‰ If prediction correct hoga â†’ loss small  
ðŸ‘‰ If prediction wrong hoga â†’ loss large  

Used when:
- Output layer me **Sigmoid**
- Binary classification problem (0/1)


In [24]:
# Binary Cross-Entropy Loss for scalar
def binary_cross_entropy_loss(prediction, target):
    epsilon = 1e-8  # To prevent log(0)
    prediction = torch.clamp(prediction, epsilon, 1 - epsilon)
    return -(target * torch.log(prediction) + (1 - target) * torch.log(1 - prediction))

In [25]:
# Forward pass
z = w * x + b  # Weighted sum (linear part)
y_pred = torch.sigmoid(z)  # Predicted probability i.e ^y

# Compute binary cross-entropy loss
loss = binary_cross_entropy_loss(y_pred, y)

In [26]:
loss

tensor(6.7012)

In [27]:
# Derivatives:
# 1. dL/d(y_pred): Loss with respect to the prediction (y_pred)
dloss_dy_pred = (y_pred - y)/(y_pred*(1-y_pred))

# 2. dy_pred/dz: Prediction (y_pred) with respect to z (sigmoid derivative)
dy_pred_dz = y_pred * (1 - y_pred)

# 3. dz/dw and dz/db: z with respect to w and b
dz_dw = x  # dz/dw = x
dz_db = 1  # dz/db = 1 (bias contributes directly to z)

dL_dw = dloss_dy_pred * dy_pred_dz * dz_dw
dL_db = dloss_dy_pred * dy_pred_dz * dz_db

In [28]:
print(f"Manual Gradient of loss w.r.t weight (dw): {dL_dw}")
print(f"Manual Gradient of loss w.r.t bias (db): {dL_db}")

Manual Gradient of loss w.r.t weight (dw): 6.691762447357178
Manual Gradient of loss w.r.t bias (db): 0.998770534992218


**Using autgrad to do the same thing easily**

In [29]:
x = torch.tensor(6.7)
y = torch.tensor(0.0)

In [30]:
w = torch.tensor(1.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

In [31]:
w

tensor(1., requires_grad=True)

In [32]:
b

tensor(0., requires_grad=True)

## ðŸ”¹ What is it?

z = w*x + b  

Where:

- x â†’ input
- w â†’ weight
- b â†’ bias
- z â†’ output before activation


In [33]:
z = w*x + b
z# mathemaitical formula predefined

tensor(6.7000, grad_fn=<AddBackward0>)

In [34]:
y_pred = torch.sigmoid(z)
y_pred

tensor(0.9988, grad_fn=<SigmoidBackward0>)

In [35]:
loss = binary_cross_entropy_loss(y_pred, y)
loss

tensor(6.7012, grad_fn=<NegBackward0>)

In [36]:
loss.backward() # backward move kar rhe hai taki uska differention nikal jayee

In [37]:
print(w.grad)
print(b.grad)

tensor(6.6918)
tensor(0.9988)


In [38]:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)# ab vecor input de rhe hai

In [39]:
x

tensor([1., 2., 3.], requires_grad=True)

In [40]:
y = (x**2).mean()
y

tensor(4.6667, grad_fn=<MeanBackward0>)

In [41]:
y.backward()

In [42]:
x.grad# ab teen gradient mang rhe hai

tensor([0.6667, 1.3333, 2.0000])

# ðŸ”¥ Clearing Gradients in PyTorch

---

## ðŸ”¹ Problem: Gradient Accumulation

In PyTorch:

When we call:

loss.backward()

Gradients are **added (accumulated)** to existing gradients.

They are NOT automatically replaced.

---

## ðŸ”¹ What Happens If We Don't Clear Gradients?

If we run backward multiple times:

1st backward â†’ grad = g1  
2nd backward â†’ grad = g1 + g2  
3rd backward â†’ grad = g1 + g2 + g3  

Gradients keep accumulating.

This leads to:
- Wrong weight updates
- Incorrect training

---

## ðŸ”¹ Why Does PyTorch Accumulate Gradients?

Because sometimes we intentionally:

- Accumulate gradients over multiple batches
- Use small batch sizes
- Simulate large batch training

So PyTorch does not reset gradients automatically.

---

## ðŸ”¹ Solution: Clear Gradients Before Backward

Before every backward pass:

```
optimizer.zero_grad()
```

This sets gradients to zero.

Correct training flow:

1. optimizer.zero_grad()
2. forward pass
3. loss calculation
4. loss.backward()
5. optimizer.step()

---

# ðŸ”¥ Important Understanding (Hinglish)

Agar hum multiple baar backward chalate hain aur gradients clear nahi karte,  
to naye gradients purane gradients me add hote rehte hain.  

Isliye har training step se pehle  
gradients ko zero karna zaroori hai.


In [43]:
# clearing grad

x = torch.tensor(2.0, requires_grad=True)
x

tensor(2., requires_grad=True)

In [44]:
y = x ** 2
y

tensor(4., grad_fn=<PowBackward0>)

In [45]:
y.backward()

In [46]:
x.grad

tensor(4.)

In [47]:
x.grad.zero_()

tensor(0.)

# ðŸ”¥ Disable Gradient Tracking

## ðŸ”¹ Concept

Training ke time:
- Gradient chahiye
- Backward pass ON

Prediction ke time:
- Gradient nahi chahiye
- Backward pass OFF

---

## ðŸ”¹ Kaise band kare?

```
with torch.no_grad():
    output = model(x)
```

---

## ðŸ”¹ Kyun band kare?

- Memory bachegi
- Model fast chalega
- Computation graph nahi banega

---

Simple baat:
Training me gradient ON  
Prediction me gradient OFF


In [48]:
# disable gradient tracking
x = torch.tensor(2.0, requires_grad=True)
x

tensor(2., requires_grad=True)

In [49]:
y = x ** 2
y

tensor(4., grad_fn=<PowBackward0>)

In [50]:
y.backward()

In [51]:
x.grad

tensor(4.)

In [52]:
# option 1 - requires_grad_(False)
# option 2 - detach()
# option 3 - torch.no_grad()

In [53]:
x.requires_grad_(False)

tensor(2.)

In [54]:
x

tensor(2.)

In [55]:
y = x ** 2

In [56]:
y

tensor(4.)

In [57]:
y.backward()# dekho  upper y ka gradient func wala aatribute nhi dikh rha haiso hm y ko call nhi kar sakte hai.

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

In [58]:
x = torch.tensor(2.0, requires_grad=True)
x

tensor(2., requires_grad=True)

## ðŸ”¹ detach()

`detach()` tensor ko computation graph se alag kar deta hai taaki uska gradient track na ho.


In [59]:
z = x.detach()
z


tensor(2.)

In [60]:
y = x ** 2

In [61]:
y

tensor(4., grad_fn=<PowBackward0>)

In [62]:
y1 = z ** 2
y1

tensor(4.)

In [63]:
y.backward()

In [64]:
y1.backward()

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

In [65]:
x = torch.tensor(2.0, requires_grad=True)
x

tensor(2., requires_grad=True)

In [66]:
y = x ** 2

In [67]:
y

tensor(4., grad_fn=<PowBackward0>)

In [68]:
y.backward()