## Mathematical Optimization Series

# Part 2: Naive search methods

In this post we describe our two rather naive approaches to finding the global minima of a function - random evaluation and random local search.  Both approaches are quite simple, one might think of them as lazy first attempts at mathematical optimization.  While neither approach is often used in machine learning, conceptually the latter approach - random local search - allows us to discuss several key concepts (in a rather simple context) that all further more professional / effecient algorithms will employ.  This includes the notion of iterative algorithms for minimizing generic functions, the notion of a step length, the value of randomness in avoiding saddle points, cost-function plots, and more.

In [1]:
# imports from custom library
import sys
sys.path.append('../../')
import matplotlib.pyplot as plt
from mlrefined_libraries import basics_library as baslib
from mlrefined_libraries import math_optimization_library as optlib
import autograd.numpy as np
from matplotlib import gridspec
import math
%matplotlib notebook

# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

What we can introduce here

- the notion of a step length, both fixed and diminishing (not sure about adjustable here, besides thats a more advanced topic than I imagine covering at the start of the text)


- the notion of moving towards a minimum in the input plane, 1-d and 2-d input examples


- the notion of visualizing higher dimensional paths via cost function decrease


- the notion of random movements being helpful in overcoming saddle point


- the notion of local search

# RANDOM EVALUATION

## Experiment 1: low dimension - grid and random eval

Perfectly fine for low dimension - here we illustrate with one dimensional input.

In [5]:
# define function
func = lambda w: np.dot(w.T,w) + 0.2
num_samples = 5
view = [20,140]

# plot 2d and 3d version, with even grid and randomly selected points
optlib.random_method_experiments.double_plot(func,num_samples,view = view)

<IPython.core.display.Javascript object>

## Experiment: Simple quadratic, as we raise in dimension and evaluate

Take 1000 samples from dimension = 1 to 100 just for show, evaluate in a corresponding simple symmetric quadratic function.  Plot the mean and std evaluation at each dimension.  Of course it increases - and we can even calculate the average evaluation as dimension increases.  Which we do - likely for a footnote calculation - after the experiment.

In [13]:
# run experiment for global random evaluation
optlib.random_method_experiments.random_eval_experiment()

<IPython.core.display.Javascript object>

Each axis of input is sampled according to a uniform distrbiution on the interval $[-1,1]$.  Thus the *average value along each input dimension* is equal to 0 (as the average of a uniform on the inverval $[a,b]$ is given as $\frac{1}{2}(a+b)$).

The problem is that the probability that all input elements are small in magnitude (close to zero or equal to zero) *simultaneously* gets exponentially smaller as we go up in dimension.  For example

- in one dimension, the probability of selecting something on the interval $[-0.1,0.1]$ is - by definition - $p(v \leq |0.1|) = \frac{0.2}{2} = 0.1$.  Since each dimension is drawn independently, this means that in $D$ dimensions the probability of drawing each element $v_i$ so that $v_i \leq | 0.1 | $ is $p(v_i \leq |0.1|,\,\, i = 1,...,D) = (0.1)^D$. 

Thus as our dimension increases the probability of randomly accessing points close to the true global minimum at the origin rapidly diminishes - it diminishes exponentially.  In order to keep up with this our sampling would have to increase exponentially with dimension as well - which is computationally infeasible.

# RANDOM LOCAL SEARCH

In [16]:
# run experiment for global random evaluation
optlib.random_method_experiments.random_local_experiment()

<IPython.core.display.Javascript object>

# Random local search trail 

In [135]:
# define function, and points at which to take derivative
func = lambda w: np.dot(w.T,w) + 2
pt = [2,2];

# animate 2d slope visualizer
view = [40,50]
optlib.random_local_search.visualize3d(func=func,view = view,pt = pt,wmax=2)

<IPython.core.display.Javascript object>

In [38]:
# define function, and points at which to take derivative
func = lambda w: np.tanh(4*w[0] + 4*w[1]) + max(4*0.1*w[0]**2,1) + 1
pt = [1,1]

# animate 2d slope visualizer
view = [20,-60]
optlib.random_local_search.visualize3d(func=func,view = view,pt = pt,wmax=2,max_its = 5,num_contours = 10)

<IPython.core.display.Javascript object>

In [9]:
# define function, and points at which to take derivative
func = lambda w: (1 - w[0])**2 + 100*(w[1] - w[0]**2)**2
pt = [-1.9, 1]

# animate 2d slope visualizer
view = [40,-50]
optlib.random_local_search.visualize3d(func=func,view = view,pt = pt,wmax=3,num_contours = 50)

<IPython.core.display.Javascript object>

# Summary

While it is a good first algorithm to discuss due to its ease of implementation and minimal use of the cost function - we need only be able to evaluate the function itself - random search is typically ineffective for modern machine learning applications.  

This is first off because the cost functions we deal with in machine learning have known algebraic forms, thus allowing us to leverage their derivatives to quickly and effeciently determine directions that decrease function value (i.e., we need not search around randomly for directions that do this).  We will see this with the other algorithms we discuss - gradient descent and Newton's method - in significant detail.  Moreover many modern machine learning cost functions have input dimension $N$  on the order of thousands - to hundreds of millions.  In such contexts randomly seeking out direction that substantially decreases a function's value become wildly ineffecient, requiring exponentially more sampling.