## Partial Derivatives and Gradients

Thus far, we have been differentiating functions of just one variable. In deep learning, we also need to work with functions of *many* variables. We briefly introduce notions of the derivative that apply to such *multivariate* functions.

Let $y = f(x_1, x_2, \dots, x_n)$ be a function with **n** variables. The *partial derivative* of y with respect to its $i^{th}$ parameter $x_i$ is
$$
\frac{\partial y}{\partial x_i}
\;=\;
\lim_{h \to 0}
\frac{f(x_1,\dots,x_{i-1},x_i + h,\,x_{i+1},\dots,x_n) \;-\; f(x_1,\dots,x_i,\dots,x_n)}{h}.
\tag{2.4.6}
$$

To calculate $\frac{\partial y}{\partial x_i}$, we treat all other $x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_{n}$ as constants and calculate the derivative of $y$ with respect to $x_i$. The following notational conventions for partial derivatives are all common and all mean the same thing:
$$
\frac{\partial y}{\partial x_i}
=
\frac{\partial f}{\partial x_i}
=
\partial_{x_i} f
=
\partial_i f
=
f_{x_i}
=
f_i
=
D_i f
=
D_{x_i} f.
\tag{2.4.7}
$$
We can concatenate partial derivatives of a multivariate function with respect to all its variables to obtain a vector that is called the gradient of the function. Suppose that the input of function $f:\mathbb R^n\to\mathbb R$ is an $n$-dimensional vector $\mathbf x=[x_1,\dots,x_n]^T$ and the output is a scalar. The gradient of the function $f$ with respect to $x$ is a vector of $n$ partial derivatives:
$$
\nabla_{\mathbf x}f(\mathbf x)
=
\begin{bmatrix}
\partial_{x_1}f(\mathbf x)\\
\partial_{x_2}f(\mathbf x)\\
\vdots\\
\partial_{x_n}f(\mathbf x)
\end{bmatrix}.
\tag{2.4.8}
$$

When there is no ambiguity $\nabla_{\mathbf x} f(\mathbf x)$ is typically replaced by $\nabla f(\mathbf x)$.

The following rules come in handy for differentiating multivariate functions:

- Rule 1: For all $\mathbf A\in\mathbb R^{m\times n}$ we have $\nabla_{\mathbf x}(\mathbf A\mathbf x)=\mathbf A^T$ and $\nabla_{\mathbf x}(\mathbf x^T \mathbf A)=\mathbf A$. \\
We have:
$$ \mathbf A 
= 
\begin{bmatrix}
A_{00}\hspace{0.5em} A_{01}\hspace{0.5em} \dots \hspace{0.5em} A_{0n} \\
A_{10}\hspace{0.5em} A_{11}\hspace{0.5em} \dots \hspace{0.5em} A_{1n}\\
\vdots\\
A_{m0}\hspace{0.5em} A_{m1}\hspace{0.5em} \dots \hspace{0.5em} A_{mn}
\end{bmatrix}
.\quad
\mathbf x
=
\begin{bmatrix}
x_{0} \\
x_{1}\\
\vdots\\
x_{n}
\end{bmatrix}
$$
Then
$$\mathbf A\mathbf x 
= 
\begin{bmatrix}
A_{00}x_{0} + A_{01}x_{1} + \dots + A_{0n}x_{n}\\
A_{10}x_{0} + A_{11}x_{1} + \dots + A_{1n}x_{n}\\
\vdots\\
A_{n0}x_{0} + A_{n1}x_{1} + \dots + A_{nn}x_{n}
\end{bmatrix}
$$
Set $f = \mathbf A\mathbf x$, then
$$
\nabla_{\mathbf x}f(\mathbf x)
=
\begin{bmatrix}
\partial_{x_0}f_{0}(\mathbf x) \hspace{0.5em} \partial_{x_1}f_{0}(\mathbf x) \hspace{0.5em} \dots \hspace{0.5em} \partial_{x_n}f_{0}(\mathbf x)\\
\partial_{x_0}f_{1}(\mathbf x) \hspace{0.5em} \partial_{x_1}f_{1}(\mathbf x) \hspace{0.5em} \dots \hspace{0.5em} \partial_{x_n}f_{1}(\mathbf x)\\
\vdots\\
\partial_{x_0}f_{n}(\mathbf x) \hspace{0.5em} \partial_{x_1}f_{n}(\mathbf x) \hspace{0.5em} \dots \hspace{0.5em} \partial_{x_n}f_{n}(\mathbf x)
\end{bmatrix} ^{T}
= 
\begin{bmatrix}
A_{00}\hspace{0.5em} A_{01}\hspace{0.5em} \dots \hspace{0.5em} A_{0n} \\
A_{10}\hspace{0.5em} A_{11}\hspace{0.5em} \dots \hspace{0.5em} A_{1n}\\
\vdots\\
A_{m0}\hspace{0.5em} A_{m1}\hspace{0.5em} \dots \hspace{0.5em} A_{mn}
\end{bmatrix} ^{T}
= \mathbf A ^{T}
$$
So for all $\mathbf A\in\mathbb R^{m\times n}$, we have $$\nabla_{\mathbf x}(\mathbf A\mathbf x)=\mathbf A^T.$$
Similarly for all $\mathbf A\in\mathbb R^{n\times m}$, we have $$\nabla_{\mathbf x}(\mathbf x^T \mathbf A)=\mathbf A.$$
- Rule 2: For square matrices $\mathbf A\in\mathbb R^{n\times n}: \nabla_{\mathbf x}(\mathbf x^T \mathbf A\mathbf x)=(\mathbf A+\mathbf A^T)\mathbf x$, in particular $\nabla_{\mathbf x}\|\mathbf x\|^2=2\mathbf x$.

Similarly, for any matrix $X$, we have $\nabla_X\|X\|_F^2=2X.$