In [1]:
import numpy as np
import matplotlib.pyplot as plt

<br>

# Differentials, Gradients, Level Curves, Hessian, Jacobians
---

All examples below are given for a function of 2 variables. They are easily extended for any number of variables.

<br>

### Partial derivatives

We define the partial derivative of a function $f(x,y)$ with respect to $x$ as the limit of the slope of change of $f$ when we modify $x$ by $dx$:

&emsp; $\displaystyle \frac{\partial f}{\partial x} = \underset{h \rightarrow \infty}{lim} \frac{f(x+h,y)}{h}$
&emsp; and
&emsp; $\displaystyle \frac{\partial f}{\partial y} = \underset{h \rightarrow \infty}{lim} \frac{f(x,y+h)}{h}$

Intuitively, we can think of it as "small change of $f$ over small change of $x$", but this intuition is also really confusing when it comes to thinking "how small is the change". It helps to keep in mind that the formal definition is about limits.

<br>

### Differentials

Consider the function $f(x,y)$, we define the differential of $f$ as being equal to:

&emsp; $\boxed{df = \frac{\partial f}{\partial x} dx + \frac{\partial f}{\partial y} dy}$
&emsp; or best seen as a function:
&emsp; $\displaystyle df(x,y,dx,dy) = \frac{\partial f}{\partial x}(x,y)  dx + \frac{\partial f}{\partial y}(x,y) dy$

When we fix $x$ and $y$, it looks like the equation of a plan embedded in 3D, $ax + by = 0$. Indeed, this function actually describes the **equation of a plan tangent to $f$ at the point $(x,y)$**. To see it, we rewrite this expression to make $f(x_0,y_0)$ appear:

&emsp; $\displaystyle f(x_0+dx,y_0+dy)-f(x_0,y_0) = \frac{\partial f}{\partial x} dx + \frac{\partial f}{\partial y} dy$
&emsp; or 
&emsp; $\displaystyle f(x,y) = f(x_0,y_0) + \frac{\partial f}{\partial x} (x-x_0) + \frac{\partial f}{\partial y} (y-y_0)$

This form actually is the order 1 Taylor expansion of $f$ at the point $(x_0, y_0)$. We can see that equation of the plan is therefore:

&emsp; $\displaystyle f(x_0,y_0) + \frac{\partial f}{\partial x} (x-x_0) + \frac{\partial f}{\partial y} (y-y_0) = 0$
&emsp; whose normal is:
&emsp; $\displaystyle \big (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \big )$

<br>

### Gradient

Consider the function $f(x,y)$, we define the gradient of $f$ as:

&emsp; $\boxed{ \nabla f = \big ( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \big ) }$
&emsp; which is a vector holding the **partial derivatives of $f$** with respect to each of its inputs variables

The gradient as several interesting properties, which offer different point of views on it:

1. It is the vector normal to the plan defined by the differential: it is **orthogonal to the level curves of $f$**
2. We can see that $df = \nabla f . (dx, dy)$: the change of $f$ along a vector $u$ is $\nabla f . u$
3. It defines the **direction along which the value of $f$ changes the fastest** (thee steepest ascent)

This last property can be seen from property (2): the dot product between two vectors $u$ and $v$ is equal to $\Vert u \Vert \Vert v \Vert \cos \theta$, and is therefore maximized when $\theta = 0$, that is when the two vectors are colinear and pointing in the same direction.

<br>

### Jacobian

Say the function $f$ has $N$ inputs and $M$ outputs. We can see it as a collection of functions $f_1$ to $f_M$, each of which produces a component (the value of one dimension). The Jacobian is defined as a matrix where **each row contains the gradient of one of the component**:

&emsp; $\displaystyle J = \begin{pmatrix} \nabla f_1 \\ \vdots \\ \nabla f_M \end{pmatrix}$
&emsp; or 
&emsp; $\displaystyle J = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \dots & \frac{\partial f_1}{\partial x_N} \\ \vdots & & \vdots \\ \frac{\partial f_M}{\partial x_1} & \dots & \frac{\partial f_M}{\partial x_N} \end{pmatrix}$

To compute the differentials along each dimensions, and similarly was what we did for the gradient, we multiply the Jacobian matrix with a vector $dx = (dx_1, \dots dx_N)^T$ containing the differentials of the inputs:

&emsp; $\displaystyle \begin{pmatrix} df_1 \\ \vdots \\ df_M \end{pmatrix} = J \begin{pmatrix} dx_1 \\ \vdots \\ dx_N \end{pmatrix} = \begin{pmatrix} \nabla f_1 \\ \vdots \\ \nabla f_M \end{pmatrix} \begin{pmatrix} dx_1 \\ \vdots \\ dx_N \end{pmatrix} = \begin{pmatrix} \nabla f_1 . dx \\ \vdots \\ \nabla f_M . dx \end{pmatrix}$

Note that the Jacobian has properties that are similar to the gradient: the level curves are replaced by the **null space (or kernel)** of the Jacobian, defined by all the vector $x$ such that: $J x = 0$. This equation is similar to the **orthogonality property** of the gradient.

<br>

### Hessian matrix

The Hessian matrix is the Jacobian of the gradient of $f$. Indeed, the gradient can be seen as a function with multiple components: the partial derivatives of $f$ along each of its input dimensions. The Hessian matrix is therefore equal to:

&emsp; $\displaystyle J = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1 \partial x_1} & \dots & \frac{\partial^2 f}{\partial x_1 \partial x_N} \\ \vdots & & \vdots \\ \frac{\partial^2 f}{\partial x_N \partial x_1} & \dots & \frac{\partial^2 f}{\partial x_N \partial x_N} \end{pmatrix}$
&emsp; which is a symmetric matrix since $\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$

The "definite-ness" of the Hessian matrix allows to identify whether a critical point $\nabla f = 0$ is:

* a local minimum: $H$ is positive definite, i.e. $x^T H x \ge 0, \forall x$
* a local maximum: $H$ is negative definite, i.e. $x^T H x \le 0, \forall x$
* a saddle point otherwise

Since the Hessian is symmetric, it has real eigenvalues, and we can evaluate its "definite-ness" by looking at the sign of these eigenvalues. If they are all positive, the matrix is positive definite. If they are all negative, the matrix is negative definite.

<br>

### Chain rule

If we consider a function $f(x,y)$, where $x(u,v)$ and $y(u,v)$ are themselves functions, we can compute the variations of $f$ with respect to $u$ and $v$ by using the chain rule:

&emsp; $\displaystyle \frac{\partial f}{\partial u} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial u} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial u}$
&emsp; and
&emsp; $\displaystyle \frac{\partial f}{\partial v} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial v} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial v}$

The chain rule follows from the definition of differentials, if we just replace $dx$ and $dy$ by their definition and keep the terms of the component we are interested about, for instance $du$ if we are interested in the partial derivative with respect to $u$:

&emsp; $\displaystyle df = \frac{\partial f}{\partial x} dx + \frac{\partial f}{\partial y} dy$
&emsp; with 
&emsp; $\displaystyle dx = \frac{\partial x}{\partial u} du + \frac{\partial x}{\partial v} dv$
&emsp; and 
&emsp; $\displaystyle dy = \frac{\partial y}{\partial u} du + \frac{\partial y}{\partial v} dv$

The chain rule can be written more generically as:

&emsp; $\boxed{\frac{\partial f}{\partial y_j} = \sum_i \frac{\partial f}{\partial x_i} \frac{\partial x_i}{\partial y_j}}$
&emsp; where $x_i$ are the input variable of $f$ that depends on $y_j$

We see from the indices that it ressembles matrix multiplication. It can indeed be re-written as a **product between the gradient of $f$ with respect to $x$ and the Jacobian of $x$ with respect to $y$**:

&emsp; $\displaystyle \nabla_{y} f = \begin{pmatrix} \frac{\partial f}{\partial x_1} & \dots & \frac{\partial f}{\partial x_M} \end{pmatrix} \begin{pmatrix} \frac{\partial x_1}{\partial y_1} & \dots & \frac{\partial x_1}{\partial y_N} \\ \vdots & & \vdots \\ \frac{\partial x_M}{\partial y_1} & \dots & \frac{\partial x_M}{\partial y_N} \end{pmatrix}$
&emsp; $\implies$
&emsp; $\boxed{\nabla_{y} f = (\nabla_x f)^T J_y(x)}$

<br>

### Taylors expansion

The Taylor expansion for a **single variable function** $f(x)$ in the neighborhood of $a$:

&emsp; $\displaystyle f(x) = \sum_{n=0}^{\infty} {\frac{f^{(n)}(a)}{n!}}(x-a)^n$
&emsp; where
&emsp; $f^{(n)}$ is the $n^{th}$ derivative of $f$

**Proof:** Recursively match each order, by first matching the value of the function, then the value of the derivative, then the value of the second derivative, and so on... The factorials naturally appears as the result of differentiating a polynomial:

&emsp; $\displaystyle f^{(0)}(x) = \sum_{n=0}^{\infty} b_n (x-a)^n$
&emsp; $\implies$
&emsp; $f(a) = b_0$

&emsp; $\displaystyle f^{(1)}(x) = \sum_{n=1}^{\infty} n b_n (x-a)^{n-1}$
&emsp; $\implies$
&emsp; $f^{(1)}(a) = 1 \times b_1$

&emsp; $\displaystyle f^{(2)}(x) = \sum_{n=2}^{\infty} n (n-1) b_n (x-a)^{n-2}$
&emsp; $\implies$
&emsp; $f^{(2)}(a) = 2 \times 1 \times b_2$

&emsp; $\displaystyle f^{(3)}(x) = \sum_{n=3}^{\infty} n (n-1) (n-2) b_n (x-a)^{n-3}$
&emsp; $\implies$
&emsp; $f^{(3)}(a) = 3 \times 2 \times 1 \times b_3$

&emsp; $\dots$

<br>

### Taylors expansion (multivariate)

The Taylor expansion for a **multi variable function** $f(x)$ in the neighborhood of $a$:

&emsp; $\displaystyle f(x) = \sum_{|\alpha| = 0}^{\infty} {\frac {D^\alpha f(a)}{|\alpha|!}}(x-a)^{\alpha}$
&emsp; where
&emsp; $|\alpha| = \sum_n \alpha_i$
&emsp; and
&emsp; $D^{\alpha} f = \frac {\partial^{|\alpha|} f}{\partial x_1^{\alpha_1} \cdots \partial x_n^{\alpha_n}}$

The Taylor expansion at order 2 is better known in terms of the gradient and the Hessian matrix:

&emsp; $\displaystyle f(x) \simeq f(a) + (x - a)^T \nabla f + (x - a)^T H (x-a)$

This formula also helps to understand why we need to check the "definite-ness" of the Hessian in order to check if a critical point is a local minimum, a local maximum, or a saddle point. Indeed, if the gradient is null, the Hessian determines in which direction $f(x)$ will move in the neighborhood of $a$.

<br>

# Optimization with equality constraints
---

<br>

### Optimization with one constraint

Consider minimizing / maximizing the function $f(x)$ constraint to $g(x) = 0$. The critical points $x_0$ we are looking for are the ones where $\nabla f(x_0) = \lambda \nabla g(x_0)$ with $\lambda \ne 0$.

> **Proof:** Consider the level curves of $f(x) = c$. At a critical point $x_0$, this level curve must be tangent to the level curve $g(x) = 0$, or otherwise we could move along the curve $g(x) = 0$ to find a better value for $f$. The tangent of a level curve is the gradient of the function, and so $\nabla f$ must be colinear to $\nabla g$.

We can therefore create a function $\mathcal{L}$ for Lagrangian, that combines $f$ and $g$ such that its critical points encapsulate the optimization objective:

&emsp; $\mathcal{L}(x,\lambda) = f(x) - \lambda g(x)$
&emsp; ,
&emsp; $\nabla_x \mathcal{L} = 0 \implies \nabla f(x) = \lambda \nabla g(x)$
&emsp; and 
&emsp; $\displaystyle \frac{\partial \mathcal{L}}{\partial \lambda} = 0 \implies g(x) = 0$

Searching for the critical points of the langrangian will give us potential solutions to our problem (we still have to check for their values to check if they are valid maximum of minimums).

<br>

### Optimization with multiple constraints

Consider minimizing / maximizing the function $f(x)$ constraint to $g_1(x) = 0, \dots g_n(x) = 0$. The critical points $x_0$ we are looking for are the ones where $\nabla f(x_0) = \lambda_1 \nabla g_1(x_0) + \dots + \lambda_n \nabla g_n(x)$ such that $\exists i, \lambda_i \ne 0$.

Similarly as above, we can build a lagrangian and look for its critical points in order to solve the constraint problem:

&emsp; $\mathcal{L}(x,\lambda) = f(x) - \sum_i \lambda_i g_i(x)$
&emsp; such that
&emsp; $\exists i, \lambda_i \ne 0$

**Proof:** Consider the $G(x) = 0$, where $G(x) = (g_1(x), \dots g_n(x))$ is a manifold because the $g_i$ are smooth. Again, if $f(x_0)$ changes along a neighboring point of $G(x_0)$, then $x_0$ cannot be a critical point, or otherwise we could move along $G$ to find a better spot.

The normal of the plan along $G$ at point $x_0$ is equal to the Jacobian $J(G)$ of $G$. Any vector $x$ that is in the plan, that is perpendicular to $J(G)$, must be so that $\nabla f$ along that vector $x$ is null:

&emsp; $J(G) = \begin{pmatrix} \nabla g_1 \\ \vdots \\ \nabla g_n \end{pmatrix}$
&emsp;
&emsp; $x J(G) = 0 \implies x^T \nabla f = 0$

This means that the kernel (null space) of $J(G)$ is contained in the kernel of $\nabla f$. And so the space spanned by $\nabla f$ is contained into the space spanned by $J(G)$. This means that the row $\nabla f$ is expressible as a linear combination of the row space of $J(G)$: $\nabla f = \sum_n \lambda_i \nabla g_i$.

<br>

### Example: deriving properties through Lagrangian

Lagrangians and Lagrange multipliers ($\lambda_i$) can be used to solve optimization problems in closed forms. They can also be used to derive some interesting properties between quantities.

For instance, say we have a covariance matrix $S$ and we are looking for directions, that is unit vectors $v$ (the constraint is on the norm) such that the quantity $v^T S v$ is maximized (such that the variance in that direction is maximized). We can build a Lagrangian for this:

&emsp; $\mathcal{L}(x,\lambda) = x^T S x - \lambda ( x^T x - 1 )$
&emsp; and
&emsp; $\nabla_x \mathcal{L} = 0$
&emsp; $\implies$
&emsp; $S x = \lambda x$

We therefore show that vectors that satisfy this are eigen vectors of the covariance matrix.

<br>

# Optimization with inequality constraints
---

When things are not symmetric (minimizing is not the same as maximizing) anymore.

<br>

### Maximization with one inequality constraint

Consider **maximizing** the function $f(x)$ constraint to $g(x) \ge 0$. The critical points $x_0$ are such that:

* $\nabla f(x_0) = - \lambda \nabla g(x_0)$ with $\lambda \ge 0$
* $\lambda \ne 0$ if and only if $g(x_0) = 0$

We can therefore create a function $\mathcal{L}$ for Lagrangian, that combines $f$ and $g$ such that its critical points encapsulate the optimization objective:

&emsp; $\boxed{\mathcal{L}(x,\lambda) = f(x) + \lambda g(x)}$ with $\lambda \ge 0$

**Proof:** For $g(x_0) = 0$, consider the level curves of $f(x) = c$. At a critical point $x_0$, this level curve must be tangent to the level curve $g(x) = 0$, but must also be such that $\nabla f$ point toward the forbidden region, i.e $g(x) < 0$, toward the descending $g$. It means that $\nabla f$ must be colinear and point to the other direction of $\nabla g$.

<br>

### Minimization with one inequality constraint

Consider **minimizing** the function $f(x)$ constraint to $g(x) \ge 0$. The critical points $x_0$ are such that:

* $\nabla f(x_0) = \lambda \nabla g(x_0)$ with $\lambda \ge 0$
* $\lambda \ne 0$ if and only if $g(x_0) = 0$

We can therefore create a function $\mathcal{L}$ for Lagrangian, that combines $f$ and $g$ such that its critical points encapsulate the optimization objective:

&emsp; $\boxed{\mathcal{L}(x,\lambda) = f(x) - \lambda g(x)}$ with $\lambda \ge 0$

<br>

### General case (multiple constraints of different types)

To **maximize** the function $f(x)$ subject to the constraints $g_i(x) = 0$ and $h_j(x) \ge 0$, find the critical points of the Lagrangian:

&emsp; $\mathcal{L}(x,\lambda,\mu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_i \mu_i h_i(x)$
&emsp; with
&emsp; $\lambda_i \ne 0$
&emsp; and
&emsp; $\mu_j \ge 0$

To **minimize** the function $f(x)$ subject to the constraints $g_i(x) = 0$ and $h_j(x) \ge 0$, find the critical points of the Lagrangian:

&emsp; $\mathcal{L}(x,\lambda,\mu) = f(x) - \sum_i \lambda_i g_i(x) - \sum_i \mu_i h_i(x)$
&emsp; with
&emsp; $\lambda_i \ne 0$
&emsp; and
&emsp; $\mu_j \ge 0$

**The simple mnemonic is + for maximization and - for minimization** and make the constraints on the factors positive when dealing with inequality constraints.