## The curse of dimensionality and the method of naive evaluation

We all live in a world with three spatial dimensions. As such we have evolved to have a good intuition/understanding of three-dimensional problems. Additionally things get considerably easier when we go back to two dimensions, and extremely easier when we go all the way back to the one dimensional space: a simple line. This trend is not good news considering that in machine learning, we usually have to change course and study problems in dimensions much larger than three - sometimes in hundreds of millions or even billions. The notion that analysis of problems in high dimensional spaces gets rapidly worse as we increase dimensionality is generally referred to as the curse of dimensionality, which has undesired consequences in many contexts within machine learning. For instance the curse of dimensionality renders the naive evaluation method of finding a function's minimum useless, as we discuss below.  

Recall that with naive evaluation, we take sample points from the input space of the problem, evaluate the function only at these samples, and keep the one giving the smallest function evaluation. Sampling the input space can be done randomly or in a uniform fashion. Let's first consider the latter case in one dimension where, as shown in the figure below (left panel), the consecutive samples are apart from each other by a distance of $d$. The distance $d$ here is such that we end up with 3 samples per unit interval in this case. Keeping $d$ fixed and moving from one dimension to two (middle panel) and three (right panel) dimensions, we sample along each dimension with the same frequency as we did in one dimension, resulting in a total of 9 samples per unit area and 27 samples per unit volume, respectively. This exponential growth in the number of samples/evaluations is hugely problematic when dealing with high-dimensional spaces.   

<p><img src="curse_1.png" width="70%" height="auto"></p>

Note that this issue remains regardless of whether we take samples regularly (as we did above) or randomly. To see that let us set the total number of samples to a fixed value, say 10, to be taken randomly as shown in the figure below. Once again, due to the curse of dimensionality, the average number of samples per unit hypercube drops exponentially as the dimensionality increases, leaving behind many regions of the space without a single sample.

<p><img src="curse_2.png" width="70%" height="auto"></p>

## The curse of dimensionality and random selection of descent directions 

The curse of dimensionality also poses a major obstacle to random direction selection particularly in high-dimensional spaces. We illustrate this through a simple example where we aim to find a descent direction for the simple quadratic function

$$g\left(\mathbf{w}\right)=\mathbf{w}^{T}\mathbf{I}_{N\times N}\mathbf{w}$$
starting at the point 

$$\mathbf{w}^{0}=\left[\begin{array}{c}
1\\
0\\
0\\
\vdots\\
0
\end{array}\right]_{ N\times1}$$


When $N=1$, this reduces to finding a descent direction at random for the function $g(w)=w^2$ starting at $w^0=1$, as shown in the figure below. 

In [None]:
hypersphere_1d

<p><img src="hypersphere_1d.png" width="50%" height="auto"></p>

Here, starting at $w^0=1$, there are only 2 unit directions we can move in: (i) the negative direction toward the origin shown in yellow, which is a descent direction (as it takes us to the minimum of our quadratic function), or (ii) away from the origin shown in blue, which is indeed an ascent direction (as the function evaluation increases at its endpoint). So in this case, if we decide to choose our direction randomly we will have a $\frac{1}{2}=50\%$ descent probability. Not too bad!  

Let's see what happens when $N=2$. As you can see in left panel of the figure below, starting at $\mathbf{w}^{0}=\left[\begin{array}{cc}
1 & 0\end{array}\right]^{T}$ (shown by a blue circle) we now have infinitely many unit directions to choose from, where only a fraction of them whose endpoint lie inside the unit circle (centered at origin) are descent directions. Therefore if we were to choose a unit direction randomly, the descent probability would be calculated as the length of the yellow arc in the figure divided by the entire length of the unit circle centered at $\mathbf{w}^{0}$.

$$\text{descent probability}=\frac{\text{length of yellow arc}}{\text{length of unit circle}}$$

For more clarity, the two-dimensional input space is re-drawn in the right panel of the figure below.     

<p><img src="hypersphere_2d.png" width="100%" height="auto"></p>

Notice the black circle shown in the right panel, centered at the midpoint of $\mathbf{w}^{0}$ and the origin, completely encompasses the yellow arc, and hence one-half of its length is greater than that of the yellow arc. In other words, the length of the yellow arc is upper-bounded by the length of the black semi-circle that lie inside the unit circle, and we have  

$$\text{descent probability}<\frac{1}{2}\cdot\frac{\text{length of black circle}}{\text{length of unit circle}}$$

Both the numerator and the denominator are now easy to compute, noticing that a simple application of the Pythagorean theorem gives the radius of the black circle as $\frac{\sqrt{3}}{2}$.

$$\text{descent probability}<\frac{1}{2}\cdot\frac{2\pi\left(\frac{\sqrt{3}}{2}\right)}{2\pi\left(1\right)}=\frac{\sqrt{3}}{2}=0.433$$

Therefore in two dimensions, the chance of randomly selecting a descent direction is at most 43%, down 7% from its value in one dimension. This rather slight decrease may not seem like a deal breaker at first, until realizing that the descent probability shrinks at an exponential rate as we increase $N$. In higher dimensions we can still use the same geometric argument we made above to find an upperbound to the descent probability, only this time we are dealing with hyperspheres instead of circles. More specifically, in $N$ we can write  

$$\text{descent probability}<\frac{1}{2}\cdot\frac{\text{surface area of encompassing hypersphere of radius } \frac{\sqrt{3}}{2}}{\text{surface area of unit hypersphere}}=\frac{1}{2}\cdot\left(\frac{\sqrt{3}}{2}\right)^{N-1}$$

For instance, when $N=30$ the descent probability falls below 1%. 