## Partial Derivatives and Gradients

Thus far, we have been differentiating functions of just one variable. In deep learning, we also need to work with functions of *many* variables. We briefly introduce notions of the derivative that apply to such *multivariate* functions.

Let $y = f(x_1, x_2, \dots, x_n)$ be a function with **n** variables. The *partial derivative* of y with respect to its $i^{th}$ parameter $x_i$ is
$$
\frac{\partial y}{\partial x_i}
\;=\;
\lim_{h \to 0}
\frac{f(x_1,\dots,x_{i-1},x_i + h,\,x_{i+1},\dots,x_n) \;-\; f(x_1,\dots,x_i,\dots,x_n)}{h}.
\tag{2.4.6}
$$

To calculate $\frac{\partial y}{\partial x_i}$, we treat all other $x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_{n}$ as constants and calculate the derivative of $y$ with respect to $x_i$. The following notational conventions for partial derivatives are all common and all mean the same thing:
$$
\frac{\partial y}{\partial x_i}
=
\frac{\partial f}{\partial x_i}
=
\partial_{x_i} f
=
\partial_i f
=
f_{x_i}
=
f_i
=
D_i f
=
D_{x_i} f.
\tag{2.4.7}
$$
We can concatenate partial derivatives of a multivariate function with respect to all its variables to obtain a vector that is called the gradient of the function. Suppose that the input of function $f:\mathbb R^n\to\mathbb R$ is an $n$-dimensional vector $\mathbf x=[x_1,\dots,x_n]^T$ and the output is a scalar. The gradient of the function $f$ with respect to $x$ is a vector of $n$ partial derivatives:
$$
\nabla_{\mathbf x}f(\mathbf x)
=
\begin{bmatrix}
\partial_{x_1}f(\mathbf x)\\
\partial_{x_2}f(\mathbf x)\\
\vdots\\
\partial_{x_n}f(\mathbf x)
\end{bmatrix}.
\tag{2.4.8}
$$

When there is no ambiguity $\nabla_{\mathbf x} f(\mathbf x)$ is typically replaced by $\nabla f(\mathbf x)$.

The following rules come in handy for differentiating multivariate functions:

- Rule 1: For all $\mathbf A\in\mathbb R^{m\times n}$ we have $\nabla_{\mathbf x}(\mathbf A\mathbf x)=\mathbf A^T$ and $\nabla_{\mathbf x}(\mathbf x^T \mathbf A)=\mathbf A$.

We have:
$$ \mathbf A
=
\begin{bmatrix}
A_{00}\hspace{0.5em} A_{01}\hspace{0.5em} \dots \hspace{0.5em} A_{0n} \\
A_{10}\hspace{0.5em} A_{11}\hspace{0.5em} \dots \hspace{0.5em} A_{1n}\\
\vdots\\
A_{m0}\hspace{0.5em} A_{m1}\hspace{0.5em} \dots \hspace{0.5em} A_{mn}
\end{bmatrix}
.\quad
\mathbf x
=
\begin{bmatrix}
x_{0} \\
x_{1}\\
\vdots\\
x_{n}
\end{bmatrix}
$$
Then
$$\mathbf A\mathbf x
=
\begin{bmatrix}
A_{00}x_{0} + A_{01}x_{1} + \dots + A_{0n}x_{n}\\
A_{10}x_{0} + A_{11}x_{1} + \dots + A_{1n}x_{n}\\
\vdots\\
A_{n0}x_{0} + A_{n1}x_{1} + \dots + A_{nn}x_{n}
\end{bmatrix}
$$
Set $f = \mathbf A\mathbf x$, then
$$
\nabla_{\mathbf x}f(\mathbf x)
=
\begin{bmatrix}
\partial_{x_0}f_{0}(\mathbf x) \hspace{0.5em} \partial_{x_1}f_{0}(\mathbf x) \hspace{0.5em} \dots \hspace{0.5em} \partial_{x_n}f_{0}(\mathbf x)\\
\partial_{x_0}f_{1}(\mathbf x) \hspace{0.5em} \partial_{x_1}f_{1}(\mathbf x) \hspace{0.5em} \dots \hspace{0.5em} \partial_{x_n}f_{1}(\mathbf x)\\
\vdots\\
\partial_{x_0}f_{n}(\mathbf x) \hspace{0.5em} \partial_{x_1}f_{n}(\mathbf x) \hspace{0.5em} \dots \hspace{0.5em} \partial_{x_n}f_{n}(\mathbf x)
\end{bmatrix} ^{T}
=
\begin{bmatrix}
A_{00}\hspace{0.5em} A_{01}\hspace{0.5em} \dots \hspace{0.5em} A_{0n} \\
A_{10}\hspace{0.5em} A_{11}\hspace{0.5em} \dots \hspace{0.5em} A_{1n}\\
\vdots\\
A_{m0}\hspace{0.5em} A_{m1}\hspace{0.5em} \dots \hspace{0.5em} A_{mn}
\end{bmatrix} ^{T}
= \mathbf A ^{T}
$$
So for all $\mathbf A\in\mathbb R^{m\times n}$, we have $$\nabla_{\mathbf x}(\mathbf A\mathbf x)=\mathbf A^T.$$
Similarly for all $\mathbf A\in\mathbb R^{n\times m}$, we have $$\nabla_{\mathbf x}(\mathbf x^T \mathbf A)=\mathbf A.$$
- Rule 2: For square matrices $\mathbf A\in\mathbb R^{n\times n}: \nabla_{\mathbf x}(\mathbf x^T \mathbf A\mathbf x)=(\mathbf A+\mathbf A^T)\mathbf x$, in particular $\nabla_{\mathbf x}\|\mathbf x\|^2=2\mathbf x$.

Let $f(\mathbf x) = \mathbf x^T \mathbf A\mathbf x = \sum_{i=1}^n \sum_{j=1}^n x_i A_{ij} x_j$, then
$$
\nabla_{\mathbf x}f(\mathbf x)
=
\begin{bmatrix}
\partial_{x_1}f(\mathbf x)\\
\partial_{x_2}f(\mathbf x)\\
\vdots\\
\partial_{x_n}f(\mathbf x)
\end{bmatrix}
=
\begin{bmatrix}
\sum_{j=1}^{n} \frac{\partial x_{1}A_{1j}x_{j}}{\partial x_1} + \sum_{i=1}^{n} \frac{\partial x_{i}A_{i1}x_{1}}{\partial x_1}\\
\sum_{j=1}^{n} \frac{\partial x_{2}A_{2j}x_{j}}{\partial x_2} + \sum_{i=1}^{n} \frac{\partial x_{i}A_{i2}x_{2}}{\partial x_2}\\
\vdots\\
\sum_{j=1}^{n} \frac{\partial x_{n}A_{nj}x_{j}}{\partial x_n} + \sum_{i=1}^{n} \frac{\partial x_{i}A_{in}x_{n}}{\partial x_n}
\end{bmatrix}
=
\begin{bmatrix}
\sum_{j=1}^{n} A_{1j}x_{j} + \sum_{i=1}^{n} x_{i}A_{i1}\\
\sum_{j=1}^{n} A_{2j}x_{j} + \sum_{i=1}^{n} x_{i}A_{i2}\\
\vdots\\
\sum_{j=1}^{n} A_{nj}x_{j} + \sum_{i=1}^{n} x_{i}A_{in}\\
\end{bmatrix}
=
\begin{bmatrix}
\sum_{j=1}^{n} A_{1j}x_{j}\\
\sum_{j=1}^{n} A_{2j}x_{j}\\
\vdots\\
\sum_{j=1}^{n} A_{nj}x_{j}
\end{bmatrix}
+
\begin{bmatrix}
\sum_{i=1}^{n} x_{i}A_{i1}\\
\sum_{i=1}^{n} x_{i}A_{i2}\\
\vdots\\
\sum_{i=1}^{n} x_{i}A_{in}
\end{bmatrix}
=
\mathbf A \mathbf x + \mathbf A^{T} \mathbf x
=
(\mathbf A + \mathbf A^{T}) \mathbf x.
$$

Similarly, for any matrix $\mathbf X$, we have $\|\mathbf X\|_F^2 = \mathbf X^{T}\mathbf X = \mathbf X^{T}\mathbf I\mathbf X$, with $\mathbf I$ is Identity matrix.

Then, apply Rule 2, we have $\|\mathbf X\|_F^2 = \mathbf X^{T}\mathbf I\mathbf X = (\mathbf I + \mathbf I^{T})\mathbf X = 2\mathbf I\mathbf X = 2\mathbf X$

So, $\nabla_X\|X\|_F^2=2X.$

## Exercise
7.

We have $f(\mathbf x) = 3x_{1}^2 + 5e^{x_2}$, $\mathbf x = (x_1, x_2)^{T}$

Then
- $\partial_{x_1}f(\mathbf x) = 6x_1$
- $\partial_{x_2}f(\mathbf x) = 5e^{x_2}$

According to 2.4.8, 
$$
\nabla_{\mathbf x}f(\mathbf x)
=
\begin{bmatrix}
\partial_{x_1}f(\mathbf x)\\
\partial_{x_2}f(\mathbf x)
\end{bmatrix}
=
\begin{bmatrix}
6x_1 \\
5e^{x_2}
\end{bmatrix}.
$$

8.

We have $f(\mathbf x) = \|\mathbf x\|_{2} = \sqrt {x_1^2 + x_2^2 + ... + x_n^2}$

If $\mathbf x \neq \mathbf 0$,

$$
\nabla_{\mathbf x}f(\mathbf x)
=
\begin{bmatrix}
\partial_{x_1}f(\mathbf x)\\
\partial_{x_2}f(\mathbf x)\\
\vdots\\
\partial_{x_n}f(\mathbf x)
\end{bmatrix}
=
\begin{bmatrix}
\frac{2x_1}{\sqrt {x_1^2 + x_2^2 + ... + x_n^2}} \\
\frac{2x_2}{\sqrt {x_1^2 + x_2^2 + ... + x_n^2}} \\
\vdots \\
\frac{2x_n}{\sqrt {x_1^2 + x_2^2 + ... + x_n^2}}
\end{bmatrix}
=
\frac{2}{\sqrt {x_1^2 + x_2^2 + ... + x_n^2}}
\begin{bmatrix}
x_1 \\
x_2 \\
\vdots \\
x_n
\end{bmatrix}
= \frac{2}{\|\mathbf x\|_2}\mathbf x
$$

If $\mathbf x = \mathbf 0$,

$$
\frac{\partial f}{\partial x_i}
=
\lim_{h \to 0} \frac{f(0,...,0,h,0,...,0)-f(\mathbf 0)}{h}
=
\lim_{h \to 0} \frac{\|(0,...,0,h,0,...,0)\|_2-0}{h}
$$

$$
\frac{\partial f}{\partial x_i}
=
\lim_{h \to 0} \frac{\sqrt {0^2+...0^2+h^2+0^2+...+0^2}}{h}
=
\lim_{h \to 0} \frac{\sqrt {h^2}}{h}
=
\lim_{h \to 0} \frac{|h|}{h}
$$

Because
$$
\lim_{h \to 0^+} \frac{|h|}{h}
=
1,
\lim_{h \to 0^-} \frac{|h|}{h}
=
-1
$$,
when $\mathbf x = \mathbf 0$, f is not differentiable.