As we’ll explain soon in detail, to learn the impact of all of the inputs, weights, and biases to the neuron output and at the end of the loss function, we need to calculate
the derivative of each operation performed during the forward pass in the neuron and the whole model. 
To do that and get answers, we’ll need to use the chain rule.

The partial derivative measures how much impact a single input has on a function’s output. 
The method for calculating a partial derivative is the same as for derivatives explained in the previous chapter; 
we simply have to repeat this process for each of the independent inputs

The gradient in machine learning is a vector that points in the direction of the steepest increase of a function. 
It is often used to optimize the parameters of a neural network by following the negative gradient towards the minimum of the loss function. 
The gradient is calculated by taking the partial derivatives of the function with respect to each parameter. 
The gradient can be affected by the learning rate, which determines the size of the steps taken along the gradient. 
Some common optimization methods that use the gradient are Stochastic Gradient Descent, AdaGrad, RMSProp, and Adam.

# Partial Derivative
![partial_derivative.png](attachment:88dd9c55-0369-4ea4-ac41-6bac0ce61e43.png)

# Partial Derivative of Sum
![pd_sum.png](attachment:4fa7a6e7-945c-4cd9-9048-487af9a7151d.png)

# Partial Derivative Sum Example1
![pd_sum_ex.png](attachment:dda0142f-8e71-4221-8fea-c8760d1c477a.png)

# Partial Derivative Sum Example2
![pd_sum_ex1.png](attachment:6f86b2ef-d7e2-4406-9b8b-8cd7dd9483f3.png)

# Partial Derivative of Multiplication
![pd_mul.png](attachment:4ad6a282-5b4c-4eec-89f6-c70d9e358872.png)

# Partial Derivative of Add and Multiplication Example
![pd_mul_x.png](attachment:421d1e33-b12f-4d3a-baf8-0e59013467c8.png)
![pd_mul_y.png](attachment:32dc67ee-1420-4407-b073-9a0ae654da23.png)
![pd_mul_z.png](attachment:950a4291-a424-435f-8dcc-b9275a0ebd96.png)

The reason to learn about partial derivatives is we’ll be calculating the partial derivatives of
multivariate functions soon, an example of which is the neuron. From the code perspective and
the Dense layer class, more specifically, the forward method of this class, we’re passing in a
single variable — the input array, containing either a batch of samples or outputs from the
previous layer. From the math perspective, each value of this single variable (an array) is a
separate input — it contains as many inputs as we have data in the input array. For example, if we
pass a vector of 4 values to the neuron, it’s a singular variable in the code, but 4 separate inputs in
the equation. This forms a function that takes multiple inputs. To learn about the impact that each
input makes to the function’s output, we’ll need to calculate the partial derivative of this function
with respect to each of its inputs, which we’ll explain in detail in the next chapter

# The Partial Derivative of Max
![pd_max_variable.png](attachment:3001781f-dbcf-4fc0-8f93-48115efb60e1.png)

Here the del/delx(max(x,y)) = 1 as max(x,y) = x if and only if (x>y)
else the del/delx(max(x,y)) = 0 as max(x,y) = y if and only if (y>x)

# The Gradient
As we mentioned at the beginning of this chapter, the gradient is a vector composed of all of the
partial derivatives of a function, calculated with respect to each input variable
Let’s return to one of the partial derivatives of the sum operation that we calculated earlier:
![pd_gradient_example.png](attachment:9b1ece22-f1c8-4768-b450-602aa9aba119.png)

If we calculate all of the partial derivatives, we can form a gradient of the function. Using
different notations, it looks as follows

![pd_gradient_formula.png](attachment:cbca7df9-644a-40d6-baef-9b8fed08043f.png)

That’s all we have to know about the gradient - it’s a vector of all of the possible partial
derivatives of the function, and we denote it using the ∇ — nabla symbol that looks like an
inverted delta symbol.:.

We’ll be using derivatives of single-parameter functions and gradients of multivariate functions
to perform gradient descent using the chain rule, or, in other words, to perform the backward
pass, which is a part of the model training.

# The Chain Rule
During the forward pass, we’re passing the data through the neurons, then through the activation
function, then through the neurons in the next layer, then through another activation function, and
so on. We’re calling a function with an input parameter, taking an output, and using that output as
an input to another function. For this simple example, let’s take 2 functions: f and: g:

![cr_1.png](attachment:ba3a7040-9073-4801-aa4a-85baf6b732df.png)

x is the input data, z is an output of the function f, but also an input for the function g, and y is an
output of the function g.We could write the same calculation as

![cr2.png](attachment:d5c8f6c6-0c17-40fe-8b64-e2be0ace6d6f.png)

In this form, we do not use the intermediate z variable, showing that function g takes the output of
function f directly as an input. This does not differ much from the above 2 equations but shows an
important property of functions chained this way — since x is an input to the function f and then
the output of the function f is an input to the function g, the output of the function g is influenced
by x in some way, so there must exist a derivative which can inform us of this influence.

The forward pass through our model is a chain of functions similar to these examples. We are
passing in samples, the data flows through all of the layers, and activation functions to form an
output. Let’s bring the equation and the code of the example model from chapter 1:


![loss_func.png](attachment:049e8409-df40-4124-b29d-f90d1b1ea948.png)

![neural_loss_code.png](attachment:ade5f0d1-67f0-4c01-b226-8a57b957eb78.png)

If you look closely, you’ll see that we are presenting the loss as a big function, or a chain of
functions, of multiple inputs — input data, weights, and biases. We are passing input data to the
first layer where we also have that layer’s weights and biases, then the outputs flow through the
ReLU activation function, and another layer, which brings more weights and biases, and another
ReLU activation, up to the end — the output layer and softmax activation. The model output,
along with the targets, is passed to the loss function, which returns the model’s error. We can look
at the loss function not only as a function that takes the model’s output and targets as parameters
to produce the error, but also as a function that takes targets, samples, and all of the weights and
biases as inputs if we chain all of the functions performed during the forward pass as we’ve just
shown in the images. To improve loss, we need to learn how each weight and bias impacts it.


How to do that for a chain of functions? By using the chain rule. This rule says that the derivative
of a function chain is a product of derivatives of all of the functions in this chain, for example:


# Normal Derivative using Chain Rule 
![normal_derivative_chainrule.png](attachment:971180af-1120-4ef1-9a29-bb735115512c.png)

# Partial Derivative using Chain Rule
![pd_using_chainrule.png](attachment:c56192ea-8abc-4f68-8527-0e44bd4dbf42.png)

# Normal Derivative Chain Rule example
h(x) = 3(2x^2) <sup>5</sup> consider 2x^2 = y
so h(x) = 3(y)^5 , y = g(x) = 2x^2 so h(x) is of the form f(g(x)) 
So Applying chain rule


![cr_sol.png](attachment:a1237fec-0b34-4018-a4d1-a35e1cc06930.png)

![cr_ans.png](attachment:f5e75b10-24f1-4c98-8a0f-a33f7888ee30.png)