## Notes

### Chain Rule

1. x : input
2. z : output
3. y : outout for a(x) and input for b(y)
4. a(x) and b(x) are functions/operations
5. |   | : Nodes

x --> | a(x) | --> y --> | b(y) | --> z

Now we want to minimize z, so we need to calculate the derivative dz/dx.

According to chain rule: dz/dx = dz/dy . dy/dx



### Computational Graph

So, for every operation we do with our tensors, pytorch will create a graph for us.
So, over at each node we apply one operation/function with some inputs and then get an output.

Here, x and y are inputs and we multiply them and get z as output

<p>x ---↘............................>( dz/dx = d(x.y)/dx = y) [Local gradients]</p>
...........| f = x.y | ---> z</p>
<p>y ---↗............................>( dz/dy = d(x.y)/dy = x) [Local gradients]</p>

Now at these nodes, we can calculate local gradients and we can use them later in the chain rule to get final gradient.
We want local gradients because our graph has more operations and at the very end we calculate a loss function that we want to minimize so we have to calculate the gradient of this loss w.r.t x

So, d(loss)/dx = d(loss)/dz . dz/dx


<h3>How to solve d(x.y)/dx like above</h3>

d(x.y)/dy

= x.dy/dy + y.dx/dy

(Since x is treated as a constant with respect to y, dx/dy =0 dy/dy = 1)

= x.1 + y.0

=x

### The whole concept now contains 3 steps:
1. Forward pass : Compute Loss
2. Compute local gradients
3. Backward pass : Compute dloss / dweights using chain rule (Here we compute the gradient of loss w.r.t our weights or parameters using chain rule)

### Linear Regression in terms of NN
1. ŷ = w.x
2. loss function (Squared Error)= (ŷ - y)^2 = (w.x - y)^2
3. ŷ : predicted Value
4. y : actual value

<p>x ---↘</p>
______| * | ---> ŷ ---> | - (minus) | --s--> | ^2 | --> Loss</p>
<p>w ---↗________y ---↗</p>

Now we want to minimize our loss so we want to know our derivative of our loss w.r.t our weights.

So, we apply forward pass and get the loss.

Then we calculate the local gradients (dŷ/dx, dŷ/dw, ds/dŷ, d(loss)/ds) at each node

And thenwe do a backward pass  d(loss)/ds) -> ds/dŷ -> dŷ/dw (for parameter w). We use the chain rule to get the derivative of loss w.r.t s = d(loss)/ds then with this we calculate d(loss)/dŷ and then the final d(loss)/dw all using chain rule.

Notice we don't need to know the derivative of x and y because these are constant.

### Implementation

In [1]:
import torch

In [2]:
x = torch.tensor(1.0)
y = torch.tensor(2.0)
w = torch.tensor(1.0, requires_grad=True)

In [3]:
# Forward pass and compute the loss
y_hat = w*x
loss = (y_hat-y)**2
loss

tensor(1., grad_fn=<PowBackward0>)

In [4]:
# Backward pass
loss.backward()
w.grad

tensor(-2.)

In [5]:
# Our next steps would be to update weights and do couple of iterations

### Let's find out how w.grad is 2

1. x = 1
2. y = 2
3. w = 1
4. ŷ = w.x = 1
5. s = ŷ - y = 1 - 2 = -1
6. loss = s^2 = 1

#### Equation1 - solving for d(loss)/dŷ  
d(loss)/dŷ  

= d(loss)/ds . ds/dŷ

= d(s^2)/ds . d(ŷ - y)/dŷ

= 2.s.1

= -2

#### Equation2 - solving for d(loss)/dw
d(loss)/dw

= d(loss)/dŷ . dŷ/dw

= -2 . d(w.x)/dw

= -2.x

= -2

#### Hence w.grad is -2 as we saw it in the above code as well