# Chapter 8: Gradients, Partial Derivatives, and the Chain Rule

Our neural network consists of neurons, which have multiple inputs, each input gets multiplied by the corresponding weight,and they get summed with the bias.

To learn the impact of all of the inputs, weights, and biases to the neuron output and the loss function, we need to calculate the derivative of each operation performed during the forward pass in the neuron and the whole model.

To do that we need to use the chain rule.

## 8.1. The Partial Derivative

The partial derivative measures how much impact a single input has on a function’s output.

Euler’s notation $∂$ is used instead of Leibniz’s notation $d$.

$$
f(x,y,z) \quad \to \quad \frac{∂}{∂x} f(x,y,z), \frac{∂}{∂y} f(x,y,z), \frac{∂}{∂z} f(x,y,z)
$$

The gradient is a vector of the size of inputs containing partial derivative solutions with respect to each of the inputs.

## 8.2. The Partial Derivative of a Sum

Calculating the partial derivative with respect to a given input means to calculate it like the regular derivative of one input, and treat other inputs as constants.

$$
\begin{align}
f(x,y) = x + y \quad \to \quad & \frac{∂}{∂x} f(x,y) = \frac{∂}{∂x} \big[ x+y \big] = \frac{∂}{∂x} x + \frac{∂}{∂x} y = 1 + 0 = 1 \\
& \frac{∂}{∂y} f(x,y) = \frac{∂}{∂y} \big[ x+y \big] = \frac{∂}{∂y} x + \frac{∂}{∂y} y = 0 + 1 = 1 \
\end{align}
$$

$$
\begin{align}
f(x,y) = 2x + 3y^2 \quad \to \quad \frac{∂}{∂x} f(x,y) & = \frac{∂}{∂x} \big[ 2x + 3y^2 \big] = \frac{∂}{∂x} 2x + \frac{∂}{∂x} 3y^2 \\
& = 2 \cdot \frac{∂}{∂x} x + 3 \cdot \frac{∂}{∂x} y^2 = 2 \cdot 1 + 3 \cdot 0 = 2 \\
\frac{∂}{∂y} f(x,y) & = \frac{∂}{∂y} \big[ 2x + 3y^2 \big] = \frac{∂}{∂y} 2x + \frac{∂}{∂y} 3y^2 \\
& =  2 \cdot \frac{∂}{∂y} x + 3 \cdot \frac{∂}{∂y} y^2 = 2 \cdot 0 + 3 \cdot 2 y^1 = 6y \
\end{align}
$$

$$
\begin{align}
f(x,y)  = 3x^3 - y^2 + 5x + 2 \quad & \to \\
\frac{∂}{∂x} f(x,y) & = \frac{∂}{∂x} \big[ 3x^3 - y^2 + 5x + 2 \big] = \frac{∂}{∂x} 3x^3 - \frac{∂}{∂x} y^2 + \frac{∂}{∂x} 5x + \frac{∂}{∂x} 2 \\
& = 3 \cdot \frac{∂}{∂x} x^3 - \frac{∂}{∂x} y^2 + 5 \cdot \frac{∂}{∂x} x + \frac{∂}{∂x} 2 = 3 \cdot 3x^2 - 0 + 5 \cdot 1 + 0 = 9x^2 +5\\
\frac{∂}{∂y} f(x,y) & = \frac{∂}{∂y} \big[ 3x^3 - y^2 + 5x + 2 \big] = \frac{∂}{∂y} 3x^3 - \frac{∂}{∂y} y^2 + \frac{∂}{∂y} 5x + \frac{∂}{∂y} 2  \\
& = 3 \cdot \frac{∂}{∂y} x^3 - \frac{∂}{∂y} y^2 + 5 \cdot \frac{∂}{∂y} x + \frac{∂}{∂y} 2 = 3 \cdot 0 - 2 y^1 + 5 \cdot 0 + 0 = -2y  \
\end{align}
$$


## 8.3. The Partial Derivative of Multiplication

We need to treat the other independent variables as constants, so we can move constants to the outside of the derivative.

$$
\begin{align}
f(x,y) = x \cdot y \quad \to \quad & \frac{∂}{∂x} f(x,y) = \frac{∂}{∂x} \big[ x \cdot y \big] = y \frac{∂}{∂x} x  = y \cdot 1 = y \\
& \frac{∂}{∂y} f(x,y) = \frac{∂}{∂y} \big[ x \cdot y \big] = x \frac{∂}{∂y} y = x \cdot 1 = x\
\end{align}
$$

Let’s introduce a third input variable and add multiplication of variables for another example:

$$
\begin{align}
f(x,y, z)  = 3x^3z - y^2 + 5z + 2yz \quad & \to \\
\frac{∂}{∂x} f(x,y, z) & = \frac{∂}{∂x} \big[ 3x^3z - y^2 + 5z + 2yz \big] \\
& = \frac{∂}{∂x} 3x^3z - \frac{∂}{∂x} y^2 + \frac{∂}{∂x} 5z  + \frac{∂}{∂x} 2yz \\
& = 3z \cdot \frac{∂}{∂x} x^3 - \frac{∂}{∂x} y^2 + 5 \cdot \frac{∂}{∂x}5z  + 2 \cdot \frac{∂}{∂x} yz \\
& = 3z \cdot 3x^2 - 0 + 5 \cdot 0 + 2 \cdot 0 = 9x^2 z  \
\end{align}
$$

$$
\begin{align}
f(x,y, z)  = 3x^3z - y^2 + 5z + 2yz \quad & \to \\
\frac{∂}{∂y} f(x,y, z) & = \frac{∂}{∂y} \big[ 3x^3z - y^2 + 5z + 2yz \big] \\
& = \frac{∂}{∂y} 3x^3z - \frac{∂}{∂y} y^2 + \frac{∂}{∂y} 5z  + \frac{∂}{∂y} 2yz \\
& = 3 \cdot \frac{∂}{∂y} x^3z - \frac{∂}{∂y} y^2 + 5 \cdot \frac{∂}{∂y}z  + 2z \cdot \frac{∂}{∂y} y \\
& = 3 \cdot 0 - 2y + 5 \cdot 0 + 2z \cdot 1 = -2y + 2z \
\end{align}
$$

$$
\begin{align}
f(x,y, z)  = 3x^3z - y^2 + 5z + 2yz \quad & \to \\
\frac{∂}{∂z} f(x,y, z) & = \frac{∂}{∂z} \big[ 3x^3z - y^2 + 5z + 2yz \big] \\
& = \frac{∂}{∂z} 3x^3z - \frac{∂}{∂z} y^2 + \frac{∂}{∂z} 5z  + \frac{∂}{∂z} 2yz \\
& = 3x^3 \cdot \frac{∂}{∂z} z - \frac{∂}{∂z} y^2 + 5 \cdot \frac{∂}{∂z} z  + 2y \cdot \frac{∂}{∂z} z  \\
& = 3x^3 \cdot 1 - 0 + 5 \cdot 1 + 2y \cdot 1 = 3x^3 + 5 + 2y \
\end{align}
$$

## 8.4. The Partial Derivative of Max

The max function returns the greatest input.

$$
f(x,y) = max(x,y) \quad \to \quad \frac{∂}{∂x} f(x,y) = \frac{∂}{∂x} max(x,y) = 1 (x > y)
$$

If $x$ is greater than $y$, the derivative of $f(x,y)$ with respect to $x$ equals 1.

If $y$ is greater than $x$, the derivative of $f(x,y)$ with respect to $x$ equals 0 — we treat $y$ as a constant.

The ReLU activation function effectively clips the input value at 0 from the positive side.

$$
f(x) = max(x,0) \quad \to \quad \frac{d}{dx} f(x) = \frac{d}{dx} max(x,0) = 1 (x > 0)
$$

We used $d$ instead of $∂$ as the function takes a single parameter, we calculate the non-partial derivative.


## 8.5. The Gradient

The gradient is a vector composed of all of the partial derivatives of a function, calculated with respect to each input variable.

$$
\begin{align}
f(x, y, z)  = 3x^3z - y^2 + 5z + 2yz \quad & \to \\
\frac{∂}{∂x} f(x,y, z) & = 9x^2z \\
\frac{∂}{∂y} f(x,y, z) & = -2y + 2z  \\
\frac{∂}{∂z} f(x,y, z) & = 3x^3 + 5 + 2y \
\end{align}
$$

The gradient of the function is denoted using `nabla` symbol $∇$ that looks like an inverted delta symbol.

$$
\nabla f(x, y, z) = \begin{bmatrix}
\frac{∂}{∂x} f(x,y, z)  \\
\frac{∂}{∂y} f(x,y, z) \\
\frac{∂}{∂z} f(x,y, z)
\end{bmatrix} =  \begin{bmatrix}
\frac{∂}{∂x}  \\
\frac{∂}{∂y} \\
\frac{∂}{∂z}
\end{bmatrix} f(x, y, z) = \begin{bmatrix}
9x^2z  \\
-2y + 2z \\
3x^3 + 5 + 2y
\end{bmatrix}
$$

We will perform the gradient descent using the chain rule to perform the backward pass, as a part of the model training.


## 8.6. The Chain Rule

The forward pass through two consecutive neurons can be described as

$$
z = f(x)\\
y = g(z)
$$

or

$$
y = g\big(  f(x) \big)
$$

The output of $g$ is influenced by $x$ in some way. So, there must exist a derivative.

The loss function takes output, targets, samples, weights, and biases as input parameters.

<center><img src='./image/1-11-loss-function.png' style='width: 60%'/></center>

<center><img src='./image/1-12-loss-function-code.png' style='width: 60%'/><font color='gray'><i>Code for a forward pass of an example neural network model.</i></font></center>

To improve loss, we need to learn how each weight and bias impacts it.

The chain rule turns is the most important rule in finding the impact of singular input to the output of a chain of functions.

The chain rule says that the derivative of a function chain is a product of derivatives of all of the functions in this chain.

$$
\frac{d}{dx} f \big( g(x) \big) = \frac{d f \big( g(x) \big)}{dg(x)} \cdot \frac{dg(x)}{dx} = f' \big( g(x) \big) \cdot g'(x)
$$

$$
\frac{∂}{∂x} f \Big( g \big( y, h(x,z)  \big) \Big) = \frac{∂ f \Big( g \big( y, h(x,z)  \big) \Big)}{∂ g \big( y, h(x,z)  \big) } \cdot \frac{∂ g \big( y, h(x,z)  \big)}{∂ h(x,z) } \cdot \frac{∂ h(x,z)}{∂x}
$$

Example of applying chain rule:

$$
h(x) = f \big( g(x) \big) = 3(2x^2)^5 \quad \to \quad f' \big( g(x)  \big) = 3 \cdot 5 (2 x^2)^{5-1} = 15 (2x^2)^4\\
\begin{align}
\to \quad h'(x) & = f' \big( g(x) \big) \cdot g'(x) = 15(2x^2)^4 \cdot \frac{d}{dx} 2x^2 = 15(2x^2)^4 \cdot 2 \cdot \frac{d}{dx} x^2 \\
& = 15(2x^2)^4 \cdot 2 \cdot 2x^1 = 15(2x^2)^4 \cdot 4x \
\end{align}
$$

We usually do:

$$
f(x) = 3 \big(  2x^2 \big)^5 \quad \to \quad f'(x) = 15(2x^2)^4 \cdot 4x\\
f'(x) = 15 \cdot 2^4 \cdot 4 \cdot x^9 = 960 x^9
$$


## 8.7. Summary

The partial derivative of the sum with respect to any input equals 1:

$$
\begin{align}
f(x, y)  = x+y \quad \to \quad & \frac{∂}{∂x} f(x, y) = 1\\
& \frac{∂}{∂y} f(x, y) = 1 \
\end{align}
$$

The partial derivative of the multiplication operation with 2 inputs, with respect to any input, equals the other input:

$$
\begin{align}
f(x, y)  = x \cdot y \quad \to \quad & \frac{∂}{∂x} f(x, y) = y \\
& \frac{∂}{∂y} f(x, y) = x \
\end{align}
$$

The partial derivative of the max function of 2 variables with respect to any of them is 1 if this variable is the biggest and 0 otherwise.

$$
f(x, y) = max(x,y) \quad \to \quad  \frac{∂}{∂x} f(x, y) =  1(x >y)
$$

The derivative of the max function of a single variable and 0 equals 1 if the variable is greater than 0 and 0 otherwise:

$$
f(x) = max(x,0) \quad \to \quad  \frac{∂}{∂x} f(x) =  1(x >0)
$$

The derivative of chained functions equals the product of the partial derivatives of the subsequent functions:

$$
\frac{d}{dx} f \big( g(x) \big) = \frac{d}{dg(x)} f \big( g(x) \big) \cdot \frac{d}{dx} g(x) = f' \big( g(x) \big) \cdot g'(x)
$$

The same applies to the partial derivatives.

$$
\frac{∂}{∂x} f\Big(  g \big( y, h(x,z) \big) \Big) = f' \Big(  g \big( y, h(x,z) \big) \Big) \cdot g' \big( y, h(x,z) \big) \cdot h'(x,z)
$$

The gradient is a vector of all possible partial derivatives. An example of a triple-input function:

$$
\nabla f(x,y,z) = 
\begin{bmatrix}
\frac{∂}{∂x} f(x,y,z) \\
\frac{∂}{∂y} f(x,y,z) \\
\frac{∂}{∂z} f(x,y,z)
\end{bmatrix} = 
\begin{bmatrix}
\frac{∂}{∂x} \\
\frac{∂}{∂y} \\
\frac{∂}{∂z}
\end{bmatrix} f(x,y,z)
$$

