#### Gradients are essesntial for our model optimization. Pytorch provides the Autograd package which can do all the computations for us

In [1]:
import torch

In [2]:
x = torch.randn(3, requires_grad=True)
x

tensor([-0.0197,  0.9841, -0.2759], requires_grad=True)

In [3]:
y = x+2

In [4]:
y

tensor([1.9803, 2.9841, 1.7241], grad_fn=<AddBackward0>)

<ol>
    <li>Our input is "x" and "2" and our output is "y". our operation is "+".</li>
    <li>Now with the technique "Backpropagation",we can calculate the gradients</li>
    <li>So, First we do a forward pass and apply the "+" operation and calculate the output "y".</li>
    <li>And since we specified that it requires the gradient(requires_grad), Pytorch will automatically create and store a function for us.</li>
    <li>This function is then used in backpropagation and to get the gradients.</li>
    <li>So, the "y" ill have an attribute "grad_fn" which will point to a gradient function and in this case, it's called "Add Backward"</li>
    <li>So, this function will caculate gradient of y w.r.t "x" (dy/dx)</li>
    <li>Now we see "grad_fn" while printing "y"</li>
</ol>

<h3>Backpropagation</h3>
<p>Backpropagation is the algorithm of calculating the gradients of the cost function with respect to the weights. Backpropagation is used to improve the output of neural networks. It does this by propagating the error in a backward direction and calculating the gradient of the cost function for each weight.</p>
It is the practice of fine-tuning the weights of a neural net based on the error rate (i.e. loss) obtained in the previous epoch (i.e. iteration.)

In [5]:
z = y*y*2
z

tensor([ 7.8434, 17.8091,  5.9450], grad_fn=<MulBackward0>)

In [6]:
z_mean = z.mean()
z_mean

tensor(10.5325, grad_fn=<MeanBackward0>)

In [7]:
# To calculate gradient
z_mean.backward(retain_graph=True) #dz/dx #retain_graph=True---> Retain the computational graph for calling .backward() next time in future
x.grad

tensor([2.6404, 3.9787, 2.2988])

<ol>
    <li>In the background, what this does is create a vector jacobian product to get gradients</li>
    <li>So, we have the jacobian matrix with partial derivates and then we multiply this with a gradient vector and then we will get the final gradients we are interested in</li>
    <li>This is also called chain rule</li>
</ol>

#### Note: backward() will throw error if requires_grad=False, so keep it true to calculate gradient

In [8]:
z

tensor([ 7.8434, 17.8091,  5.9450], grad_fn=<MulBackward0>)

In [9]:
#z.backward() -> only using this throws error as grad(gradient) can be implicityly created only for scalar outputs.

v = torch.tensor([0.1, 1.0, 0.001], dtype=torch.float32)
#z.backward(v) #<--- In background, it is doin an vector jacobian product
# z.backward(v) threw a below error because previously we did not use retain_graph=True parameter in z_mean.backward()
"""
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already 
been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). 
Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved 
tensors after calling backward.
"""
z.backward(v)

<h3>Here's the breakdown of the issue:</h3>

<p>Calling z_mean.backward() calculates the gradients for the computation that resulted in z_mean.</p>
<p>By default, PyTorch releases the intermediate values used during the calculation after the first backward pass.</p>
<p>When you try z.backward(v), PyTorch tries to access these intermediate values again, but they are already freed, leading to the error "Trying to backward through the graph a second time".</p>
<b>Solution: Use retain_graph=True</b>

<h3>How to prevent pytorch from tracking the history and calculating the grad_fn attribute.</h3>

So, Sometimes during our training loop when we want to update our weights then this operation should not be a part of the gradient computation.

There are 3 options to do so:
1. x.requires_grad_(False) <-- inplace function
2. x.detach() <-- This would create a new tensor that doesn't require the gradient
3. with torhc.no_grad():

In [10]:
x

tensor([-0.0197,  0.9841, -0.2759], requires_grad=True)

In [11]:
#x.requires_grad_(False)
# output : tensor([-0.9846, -0.4701, -0.7996]) <-- values are random

In [12]:
x.detach()

tensor([-0.0197,  0.9841, -0.2759])

In [13]:
y

tensor([1.9803, 2.9841, 1.7241], grad_fn=<AddBackward0>)

In [14]:
with torch.no_grad():
    y = x+2
    print(y)

tensor([1.9803, 2.9841, 1.7241])


Whenever we call the backward function, the gradient for this tensor will be accumulated into the .grad attribute
So, Their values will be summed up 

In [15]:
weights = torch.ones(4, requires_grad=True)

for epoch in range(5):
    model_output = (weights*3).sum() # dummpy operation which will simulate some model output
    
    #Calculate gradient
    model_output.backward()
    
    print(weights.grad)
    
    # If we do more n iterations, the nth backward call will keep accumulating the values and write them into .grad attribute
    """ n=5
    tensor([3., 3., 3., 3.])
    tensor([6., 6., 6., 6.])
    tensor([9., 9., 9., 9.])
    tensor([12., 12., 12., 12.])
    tensor([15., 15., 15., 15.])
    """
    
    #To handle this, we must empty the gradients
    weights.grad.zero_()
    

tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])


In [16]:
weights = torch.ones(4, requires_grad=True)

optimizer = torch.optim.SGD([weights], lr=0.01)
optimizer.step()
optimizer.zero_grad() #<--empties the grad like above

### Notes

#### requires_grad

In PyTorch, the requires_grad attribute is a flag that is used to indicate whether or not gradients should be computed for a tensor during backpropagation. 

1. When set to True for a tensor, PyTorch tracks the operations performed on that tensor during the forward pass of your computational graph.

2. This allows PyTorch to calculate the gradients of the loss function with respect to that tensor during the backward pass.

3. Gradients are essential for training neural networks, as they indicate how adjustments to the tensor's values can improve the model's performance.

4. PyTorch dynamically builds a computational graph during the forward pass, tracking operations involving tensors with requires_grad=True. 

### Here's a simple example code in PyTorch demonstrating backpropagation:

In [17]:
# Define input data and ground truth
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y_true = torch.tensor([2.0], dtype=torch.float32)

# Initialize model parameters (weights and biases)
w = torch.randn(3, requires_grad=True)  # Weights
b = torch.randn(1, requires_grad=True)  # Bias

# Define model prediction and loss function
def model(x):
    return torch.sum(w * x) + b  # Linear regression model

def loss_fn(y_pred, y_true):
    return torch.mean((y_pred - y_true) ** 2)  # Mean squared error loss

# Forward pass
y_pred = model(x)

# Compute the loss
loss = loss_fn(y_pred, y_true)

# Backward pass (compute gradients)
loss.backward()

# Print gradients
print("Gradients w.r.t. weights (w):", w.grad)
print("Gradient w.r.t. bias (b):", b.grad)

Gradients w.r.t. weights (w): tensor([ -7.4605, -14.9209, -22.3814])
Gradient w.r.t. bias (b): tensor([-7.4605])
