## Mathematical Optimization Series

# 1. Mathematical optimization algorithms

The tools of mathematical optimization are designed to find the smallest value(s) of a function, otherwise referred to as a *global minimum* (one smallest point) or *global minima* (many smallest points).  Note that in machine learning / deep learning applications such functions always have *finite global minima*, i.e., their global minima are not negative infinity.  

In the next several posts we introduce widely used mathematical optimization algorithms for finding global minima of a function $g\left(\mathbf{w}\right)$. The formal manner of describing the minimization of a function $g$ is written as

\begin{equation}
\underset{\mathbf{w}}{\mbox{minimize}}\,\,\,\,g\left(\mathbf{w}\right)
\end{equation}

which is simply shorthand for saying ''minimize $g$ over all input values $\mathbf{w}$''.

The algorithms we discuss operate by sequentially decreasing the value of $g$, halting only when a stationary point is found i.e., a point satisfying the first order condition (discussed in the previous post). Thus one can also consider these techniques as numerical methods for solving the first order system of equations $\nabla g\left(\mathbf{w}\right)=\mathbf{0}_{N\times1}$ which - as we discussed previously - one can rarely solve by hand.

## 1.1  The big picture

## 1.1  The big picture

All numerical schemes for minimizing a general function $g$ work as follows:

<div style="background-color:rgba(0, 0, 0, 0.0470588); text-align:center; vertical-align: middle; padding:40px 0;">
 <p>
 
1)  Start the minimization process at some *initial
point* $\mathbf{w}^{0}$.

<br /><br />

2)  Take iterative steps denoted by $\mathbf{w}^{1},\,\mathbf{w}^{2},\,\ldots$,
going ``downhill'' towards a stationary point of $g$.

<br /><br />

3)  Repeat step 2) until the sequence of points converges to a stationary point of
$g$.


 </p>
 </div>

This idea is illustrated in figure below for the minimization of a non-convex function. Notice that since this function has three stationary points, the one we reach by traveling downhill depends entirely on where we begin the optimization process. Ideally we would like to find the global minimum, or the lowest of the function's minima, which for a general non-convex function requires that we run the procedure several times with different initializations (or starting points). 

<p>
<img src= '../../mlrefined_images/math_optimization_images/Fig_2_5_1.png' width="75%" height="100%"/>
</p>
<p>
<img src= '../../mlrefined_images/math_optimization_images/Fig_2_5_2.png' width="75%" height="100%"/>
</p>

**Figure 1:** The stationary point of a generic function found via an iterative method of mathematical optimization is dependent on the choice of initial point $w^{0}$. In the top panel our initialization leads us to find the global minimum, while in the bottom panel the two different initializations lead to a saddle point on the left, and a non-global minimum on the right.


The numerical methods discussed here halt at a stationary point $\mathbf{w}$, that is a point where $\nabla g\left(\mathbf{w}\right)=\mathbf{0}_{N\times1}$, which as we have previously seen may or may not constitute a minimum of $g$ if $g$ is non-convex. However this issue does not at all preclude the use of non-convex cost functions in machine learning (or other scientific disciplines), it is simply worth being aware
of.

## 2.2 Stopping conditions

One of several stopping conditions may be selected to halt numerical algorithms that seek stationary points of a given function
$g$. The two most commonly used stopping criteria include:

 <div style="background-color:rgba(0, 0, 0, 0.0470588); text-align:center; vertical-align: middle; padding:40px 0;">
 <p>
 
1)  When a pre-specified number of iterations are complete.

<br /><br />

2)  When the gradient is small enough, i.e., $\left\Vert \nabla g\left(\mathbf{w}^{k}\right)\right\Vert _{2}<\epsilon$
for some small $\epsilon>0$.

 </p>
 </div>

Perhaps the most naive stopping condition for a numerical algorithm is to halt the procedure after a pre-defined number of iterations. Note that this extremely simple condition does not provide any convergence guarantee, and hence is typically used in practice in conjunction with other stopping criteria as a necessary cap on the number of iterations when the convergence is achieved slowly. The second condition directly translates our desire to finding a stationary point at which the gradient is by definition zero. One could also stop the procedure when continuing it does not considerably decrease the objective function (or the stationary point itself) from one iteration to the next. 