**Partial Derivatives and Gradients**  
-------------------------------------

Thus far, we have been differentiating functions of just one variable. In deep learning, we also need to work with functions of *many* variables. We briefly introduce notions of the derivative that apply to such *multivariate* functions.

Let $y = f(x_1, x_2, \dots, x_n)$ be a function with $n$ variables. The *partial derivative* of y with respect to its $i^{th}$ parameter $x_i$ is

$$
\frac{\partial y}{\partial x_i}
\;=\;
\lim_{h \to 0}
\frac{f(x_1,\dots,x_{i-1},x_i + h,\,x_{i+1},\dots,x_n) \;-\; f(x_1,\dots,x_i,\dots,x_n)}{h}.
\tag{2.4.6}
$$

To calculate $\frac{\partial y}{\partial x_i}$, we treat all other $x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_{n}$ as constants and calculate the derivative of $y$ with respect to $x_i$. The following notational conventions for partial derivatives are all common and all mean the same thing:

$$
\frac{\partial y}{\partial x_i}
=
\frac{\partial f}{\partial x_i}
=
\partial_{x_i} f
=
\partial_i f
=
f_{x_i}
=
f_i
=
D_i f
=
D_{x_i} f.
\tag{2.4.7}
$$

We can concatenate partial derivatives of a multivariate function with respect to all its variables to obtain a vector that is called the gradient of the function. Suppose that the input of function $f:\mathbb R^n\to\mathbb R$ is an $n$-dimensional vector $\mathbf x=[x_1,\dots,x_n]^T$ and the output is a scalar. The gradient of the function $f$ with respect to **x** is a vector of $n$ partial derivatives:

$$
\nabla_{\mathbf x}f(\mathbf x)
=
\begin{bmatrix}
\partial_{x_1}f(\mathbf x)\\
\partial_{x_2}f(\mathbf x)\\
\vdots\\
\partial_{x_n}f(\mathbf x)
\end{bmatrix}.
\tag{2.4.8}
$$

When there is no ambiguity $\nabla_{\mathbf x} f(\mathbf x)$ is typically replaced by $\nabla f(\mathbf x)$. The following rules come in handy for differentiating multivariate functions:

- For all $A\in\mathbb R^{m\times n}$ we have $\nabla_{\mathbf x}(A\mathbf x)=A^T$ and $\nabla_{\mathbf x}(\mathbf x^T A)=A$.  
- For square matrices $A\in\mathbb R^{n\times n}: \nabla_{\mathbf x}(\mathbf x^T A\mathbf x)=(A+A^T)\mathbf x$, in particular $\nabla_{\mathbf x}\|\mathbf x\|^2=2\mathbf x$.

Similarly, for any matrix $X$, we have

$$
\nabla_X\|X\|_F^2=2X.
$$

## **4. Chain Rule**

### **4.1 Introduction to the Chain Rule**

The Chain Rule is a mathematical tool used to compute the derivative of composite functions.  
In deep learning, we deal with complex, nested functions across multiple layers. The Chain Rule enables us to compute gradients for optimization.

There are two main cases:

- **Single-variable functions**: $y = f(g(x))$
- **Multivariable functions**: $y = f(\mathbf{u})$, where $\mathbf{u} = g(\mathbf{x})$

### **4.2 Chain Rule for Single-Variable Functions**

**Formula:** If $y = f(u)$ and $u = g(x)$, and both $f$ and $g$ are differentiable at their respective points, then:

$$
\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}
$$


**Explanation:**

- $\frac{dy}{dx}$: Rate of change of $y$ with respect to $x$
- $\frac{dy}{du}$: Rate of change of $y$ with respect to $u$
- $\frac{du}{dx}$: Rate of change of $u$ with respect to $x$

The Chain Rule "chains" these derivatives together.

**Example: Find the derivative of the function $y = \sin(x^2)$.**

**Step 1: Identify the composition**

Let
$$
u = x^2, \quad \text{then} \quad y = \sin(u).
$$

**Step 2: Apply the Chain Rule**

By the Chain Rule:

$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$$

**Step 3: Compute each part**

- $\frac{dy}{du} = \cos(u)$  
- $\frac{du}{dx} = 2x$

**Step 4: Substitute back**

$$\frac{dy}{dx} = \cos(u) \cdot 2x = \cos(x^2) \cdot 2x$$

### **4.3 Chain Rule for Multivariable Functions**

**Formula:**  
If $y = f(\mathbf{u})$, where $\mathbf{u} = (u_1, u_2, \dots, u_m)$, and each $u_i = g_i(\mathbf{x})$, with $\mathbf{x} = (x_1, x_2, \dots, x_n)$, then:

$$
\frac{\partial y}{\partial x_i} = \frac{\partial y}{\partial u_1} \frac{\partial u_1}{\partial x_i} + \frac{\partial y}{\partial u_2} \frac{\partial u_2}{\partial x_i} + \cdots + \frac{\partial y}{\partial u_m} \frac{\partial u_m}{\partial x_i}
$$

**In vector form:**

$$
\nabla_{\mathbf{x}} y = \mathbf{A} \nabla_{\mathbf{u}} y
$$

Where:  
- $\mathbf{A} \in \mathbb{R}^{n \times m}$ is a matrix containing the partial derivatives $\frac{\partial u_j}{\partial x_i}$.  
- $\nabla_{\mathbf{x}} y$: Gradient of $y$ with respect to $\mathbf{x}$.  
- $\nabla_{\mathbf{u}} y$: Gradient of $y$ with respect to $\mathbf{u}$.

**Explanation:**

- This is the generalized Chain Rule for multivariable functions.  
- The matrix $\mathbf{A}$ (also called the **Jacobian**) represents the relationship between $\mathbf{u}$ and $\mathbf{x}$.  
- The result is a vector-matrix product, which is common in deep learning for gradient computation.

**Example:**

Suppose:

- $y = u_1^2 + u_2^2$  
- $u_1 = x_1 + x_2$, $u_2 = x_1 - x_2$

Compute $\frac{\partial y}{\partial x_1}$ and $\frac{\partial y}{\partial x_2}$.

**Step 1: Compute the partial derivatives**

- $\frac{\partial y}{\partial u_1} = 2u_1$  
- $\frac{\partial y}{\partial u_2} = 2u_2$  
- $\frac{\partial u_1}{\partial x_1} = 1$, $\frac{\partial u_1}{\partial x_2} = 1$  
- $\frac{\partial u_2}{\partial x_1} = 1$, $\frac{\partial u_2}{\partial x_2} = -1$

**Step 2: Apply the Chain Rule**

$$
\frac{\partial y}{\partial x_1} = \frac{\partial y}{\partial u_1} \cdot \frac{\partial u_1}{\partial x_1} + \frac{\partial y}{\partial u_2} \cdot \frac{\partial u_2}{\partial x_1} = 2u_1 \cdot 1 + 2u_2 \cdot 1 = 2u_1 + 2u_2
$$

$$
\frac{\partial y}{\partial x_2} = \frac{\partial y}{\partial u_1} \cdot \frac{\partial u_1}{\partial x_2} + \frac{\partial y}{\partial u_2} \cdot \frac{\partial u_2}{\partial x_2} = 2u_1 \cdot 1 + 2u_2 \cdot (-1) = 2u_1 - 2u_2
$$

**Step 3: Substitute $u_1 = x_1 + x_2$, $u_2 = x_1 - x_2$**

$$
\frac{\partial y}{\partial x_1} = 2(x_1 + x_2) + 2(x_1 - x_2) = 4x_1
$$

$$
\frac{\partial y}{\partial x_2} = 2(x_1 + x_2) - 2(x_1 - x_2) = 4x_2
$$


**Vector Form:**

- Gradient: $\nabla_{\mathbf{x}} y = \begin{bmatrix} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 4x_1 \\ 4x_2 \end{bmatrix}$  
- Jacobian Matrix: $\mathbf{A} = \begin{bmatrix} \frac{\partial u_1}{\partial x_1} & \frac{\partial u_1}{\partial x_2} \\ \frac{\partial u_2}{\partial x_1} & \frac{\partial u_2}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}$  
- Gradient w.r.t. $\mathbf{u}$: $\nabla_{\mathbf{u}} y = \begin{bmatrix} 2u_1 \\ 2u_2 \end{bmatrix}$

**Verification:**

$$
\nabla_{\mathbf{x}} y = \mathbf{A} \nabla_{\mathbf{u}} y = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \begin{bmatrix} 2u_1 \\ 2u_2 \end{bmatrix} = \begin{bmatrix} 2u_1 + 2u_2 \\ 2u_1 - 2u_2 \end{bmatrix}
$$

Substitute $u_1 = x_1 + x_2$, $u_2 = x_1 - x_2$ to get:

$$
\nabla_{\mathbf{x}} y = \begin{bmatrix} 4x_1 \\ 4x_2 \end{bmatrix}
$$

This matches our earlier result.

