# Gradient

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/PilotLeoYan/inside-deep-learning/blob/main/0-premilinaries/gradient.ipynb">
    <img src="../images/colab_logo.png" width="32">Open in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://nbviewer.org/github/PilotLeoYan/inside-deep-learning/blob/main/0-premilinaries/gradient.ipynb">
    <img src="../images/jupyter_logo.png" width="32">Open in Jupyter NBViewer</a>
  </td>
</table>

In the field of machine learning, we will have to calculate the derivatives of multivariable functions.
These derivatives can be grouped together under the term **gradient**. 
We will first use gradients to adjust the parameters of our machine learning models during gradient descent.

🛑 It is assumed that the reader is already familiar with differentiating single variables.

$$
\nabla_{\mathbf{x}}f = \text{grad} f = \frac{\mathrm{d} f}{\mathrm{d} \mathbf{x}}
$$

**Purpose of this Notebook**:

The purposes of this notebook are:
1. Present the three types of layout conventions.
2. Progress from the easiest level to the most difficult.
3. Solve the examples at each level.
4. Use the autograd module at each level.

In [1]:
from autograd import jacobian, numpy as np

from platform import python_version
python_version()

'3.13.5'

In [2]:
# our error function
def mape(a: np.ndarray, b: np.ndarray) -> float:
    """
    Mean Absolute 
    """
    return np.mean(np.abs((a - b) / a)).item()

# Level 1 - vector

There are different layout conventions (numerator layout, denominator layout and mixed layout). 
Let's review each layout and then make a comparison.

For a function $f: \mathbb{R}^{n} \to \mathbb{R}$, 
$\mathbf{x} \mapsto f(\mathbf{x})$,
$\mathbf{x} \in \mathbb{R}^{n}$ we define the *gradient* of $f$ as
$$
\nabla_{\mathbf{x}}f = \text{grad} f = \frac{\mathrm{d} f}{\mathrm{d} \mathbf{x}}
$$

First, we need to calculate the size of the gradient. However, each layout defines a specific size.
Therefore, we will examine each size in relation to its respective layout.
We will use the same example for each layout.

Example 1:
For $\mathbf{x} \in \mathbb{R}^{4}$, we have
$$
f(\mathbf{x}) = 3 \mathbf{x}^{\top} \mathbf{x}
\in \mathbb{R}
$$
then, let's compute its gradient.

## numerator layout

The gradient in *numerator layout* is
$$
\frac{\mathrm{d} {\color{Cyan} f}}
{\mathrm{d} {\color{Magenta} \mathbf{x}}} = 
\begin{bmatrix}
    \frac{\partial f(\mathbf{x})}{\partial x_{1}} &
    \frac{\partial f(\mathbf{x})}{\partial x_{2}} &
    \cdots &
    \frac{\partial f(\mathbf{x})}{\partial x_{n}} 
\end{bmatrix} \in 
\mathbb{R}^{{\color{Cyan} 1} \times {\color{Magenta} n}}
$$
where the dimensionality/size of the gradient is the combination 
of the resulting size of $f$ $\times$ the size of $\mathbf{x}$.

### example 1.1

For our example 1, first, we need to calculate the size of the gradient as
$$
\frac{\mathrm{d} f}{\mathrm{d} \mathbf{x}}
\in \mathbb{R}^{1 \times 4}
$$

Next, let's calculate the row vector as
$$
\frac{\mathrm{d} f}{\mathrm{d} \mathbf{x}} = 
\begin{bmatrix}
    \frac{\partial f(\mathbf{x})}{\partial x_{1}} &
    \frac{\partial f(\mathbf{x})}{\partial x_{2}} &
    \frac{\partial f(\mathbf{x})}{\partial x_{3}} &
    \frac{\partial f(\mathbf{x})}{\partial x_{4}}
\end{bmatrix}
$$

Next, let's calculate the partial derivatives
$$
\begin{align*}
f(\mathbf{x}) &= 3 \mathbf{x}\top \mathbf{x} \\
&= 3 \left( 
    x_{1}^{2} + x_{2}^{2} + x_{3}^{2} + x_{4}^{2}
\right)
\end{align*}
$$

therefore, the partial derivatives are
$$
\frac{\partial f(\mathbf{x})}{\partial x_{1}} = 6x_{1} \\
\frac{\partial f(\mathbf{x})}{\partial x_{2}} = 6x_{2} \\
\frac{\partial f(\mathbf{x})}{\partial x_{3}} = 6x_{3} \\
\frac{\partial f(\mathbf{x})}{\partial x_{4}} = 6x_{4}
$$
or better in a general partial derivative
$$
\frac{\partial f(\mathbf{x})}{\partial x_{i}} = 6x_{i}
$$

**Note**: It is often easier to formulate the derivatives this way instead of writing all the derivatives.

Finally, we can compute the gradient of $f$
$$
\begin{align*}
\frac{\mathrm{d} f}{\mathrm{d} \mathbf{x}} &=
63 \begin{bmatrix}
    x_{1} & x_{2} &
    x_{3} & x_{4}
\end{bmatrix} \\
&= 6 \mathbf{x}^{\top}
\end{align*}
$$

## denominator layout

The gradient in *denominator layout* is
$$
\frac{\mathrm{d} {\color{Cyan} f}}
{\mathrm{d} {\color{Magenta} \mathbf{x}}} = 
\begin{bmatrix}
    \frac{\partial f(\mathbf{x})}{\partial x_{1}} &
    \frac{\partial f(\mathbf{x})}{\partial x_{2}} &
    \cdots &
    \frac{\partial f(\mathbf{x})}{\partial x_{n}} 
\end{bmatrix}^{\top} \in 
\mathbb{R}^{{\color{Magenta} n} \times {\color{Cyan} 1}}
$$
where the dimensionality/size of the gradient is the combination 
of the size of $\mathbf{x}$ $\times$ the resulting size of $f$.

**Note**: You can see that the denominator layout is the transpose of the numerator layout.

### example 1.2

For our example 1, first, we need to calculate the size of the gradient as
$$
\frac{\mathrm{d} f}{\mathrm{d} \mathbf{x}}
\in \mathbb{R}^{4 \times 1}
$$

Next, let's calculate the row vector as
$$
\frac{\mathrm{d} f}{\mathrm{d} \mathbf{x}} = 
\begin{bmatrix}
    \frac{\partial f(\mathbf{x})}{\partial x_{1}} \\
    \frac{\partial f(\mathbf{x})}{\partial x_{2}} \\
    \frac{\partial f(\mathbf{x})}{\partial x_{3}} \\
    \frac{\partial f(\mathbf{x})}{\partial x_{4}}
\end{bmatrix}
$$

Next, let's calculate the partial derivatives
$$
\frac{\partial f(\mathbf{x})}{\partial x_{i}} = 6x_{i}
$$

Finally, we can compute the gradient of $f$
$$
\frac{\mathrm{d} f}{\mathrm{d} \mathbf{x}} =
6 \mathbf{x} \in \mathbb{R}^{4 \times 1}
$$

**Remark**: the size of the gradient is a column vector.

## mixture layout

Mixture layout does not differentiate between column vectors and row vectors, 
ignoring the “1” axes that are often unnecessary.

The gradient in *mixture layout* is
$$
\frac{\mathrm{d} f}
{\mathrm{d} {\color{Magenta} \mathbf{x}}} = 
\begin{bmatrix}
    \frac{\partial f(\mathbf{x})}{\partial x_{1}} &
    \frac{\partial f(\mathbf{x})}{\partial x_{2}} &
    \cdots &
    \frac{\partial f(\mathbf{x})}{\partial x_{n}} 
\end{bmatrix} \in 
\mathbb{R}^{{\color{Magenta} n}}
$$
where the dimensionality/size of the gradient is the size of $\mathbf{x}$.

### example 1.3

For our example 1, first, we need to calculate the size of the gradient as
$$
\frac{\mathrm{d} f}{\mathrm{d} \mathbf{x}}
\in \mathbb{R}^{4}
$$

We can compute the gradient of $f$
$$
\frac{\mathrm{d} f}{\mathrm{d} \mathbf{x}} =
6 \mathbf{x}
$$

In [3]:
example_1_x = np.random.randn(4)

def example_1_f(x: np.ndarray):
    return 3 * (x.T @ x)

example_1_f(example_1_x).size

1

In [4]:
# let's compute the gradient using autograd.jacobian
grad_1 = jacobian(example_1_f, 0)(example_1_x)
grad_1

array([  1.34851952,  -3.06929364, -13.585319  ,   9.49406441])

In [5]:
# let's calculate the gradient ourselves
our_grad_1 = 6 * example_1_x
our_grad_1

array([  1.34851952,  -3.06929364, -13.585319  ,   9.49406441])

In [6]:
# let's comparate both solution
mape(grad_1, our_grad_1)

0.0

# Level 2 - matrix

For a function $\mathbf{f}: \mathbb{R}^{m} \to \mathbb{R}^{n}$, 
$\mathbf{x} \mapsto \mathbf{f}(\mathbf{x})$,
$\mathbf{x} \in \mathbb{R}^{m}$ we define the *Jacobian* of $\mathbf{f}$ as
$$
\mathbf{J} = 
\nabla_{\mathbf{x}}f = \text{grad} \mathbf{f} = 
\frac{\mathrm{d} \mathbf{f}}{\mathrm{d} \mathbf{x}}
$$

Next, we will see how the Jacobian is defined in numerator, denominator, and mixed layout.

Example 2:
For $\mathbf{x} \in \mathbb{R}^{4}$, we have
$$
\mathbf{f}(\mathbf{x}) = 
7 \mathbf{A} \mathbf{x}
\in \mathbb{R}^{3}
$$
where $\mathbf{A} \in \mathbb{R}^{3 \times 4}$.

Then, let's compute its gradient.

## numerator layout

The Jacobian of a function $\mathbf{f}: \mathbb{R}^{n} \to \mathbb{R}^{m}$
in *numerator layout* is
$$
\begin{align*}
\mathbf{J} &= 
\frac{\mathrm{d} {\color{Cyan} \mathbf{f}}}
{\mathrm{d} {\color{Magenta} \mathbf{x}}} = 
\begin{bmatrix}
    \frac{\partial \mathbf{f}(\mathbf{x})}{\partial x_{1}}
    & \cdots &
    \frac{\partial \mathbf{f}(\mathbf{x})}{\partial x_{n}} \\
\end{bmatrix} \\ &=
\begin{bmatrix}
    \frac{\partial f_{1}(\mathbf{x})}{\partial \mathbf{x}}
    \\ \vdots \\
    \frac{\partial f_{m}(\mathbf{x})}{\partial \mathbf{x}} \\
\end{bmatrix} \\ &=
\begin{bmatrix}
    \frac{\partial f_{1}(\mathbf{x})}{\partial x_{1}}
    & \cdots &
    \frac{\partial f_{1}(\mathbf{x})}{\partial x_{n}}
    \\ \vdots & \ddots & \vdots \\
    \frac{\partial f_{m}(\mathbf{x})}{\partial x_{1}}
    & \cdots & 
    \frac{\partial f_{m}(\mathbf{x})}{\partial x_{n}}
\end{bmatrix}
\in 
\mathbb{R}^{{\color{Cyan} m} \times {\color{Magenta} n}}
\end{align*}
$$


**Note**: if $m = 1$, then we have $J \in \mathbb{R}^{1 \times n}$
like Level 1.

**Remark**: 
$$
\frac{\partial \mathbf{f}(\mathbf{x})}{\partial x_{i}} =
\begin{bmatrix}
    \frac{\partial f_{1}(\mathbf{x})}{\partial x_{i}}
    \\ \vdots \\
    \frac{\partial f_{m}(\mathbf{x})}{\partial x_{i}}
\end{bmatrix}
\in \mathbb{R}^{m \times 1}
$$
and
$$
\frac{\partial f_{i}(\mathbf{x})}{\partial \mathbf{x}} =
\begin{bmatrix}
    \frac{\partial f_{i}(\mathbf{x})}{\partial x_{1}}
    & \cdots &
    \frac{\partial f_{i}(\mathbf{x})}{\partial x_{n}}
\end{bmatrix}
\in \mathbb{R}^{1 \times n}
$$

### example 2.1

For our example 2, first, we need to calculate the size of the Jacobian as
$$
\frac{\mathrm{d} \mathbf{f}}{\mathrm{d} \mathbf{x}}
\in \mathbb{R}^{3 \times 4}
$$

# 🖋️ TODO

+ compute for $x\in\mathbb{R}^{m}$ and $f\in\mathbb{R}^{n}$
+ chain rule 
+ compute for $x\in\mathbb{R}^{m \times n}$ and $f\in\mathbb{R}$
+ compute for $x\in\mathbb{R}^{m \times n}$ and $f\in\mathbb{R}^{p}$
+ compute for $x\in\mathbb{R}^{m \times n}$ and $f\in\mathbb{R}^{p \times q}$
+ einstein summation