# Breaking it Down: Gradient Descent

In this post, we will: 
- Define gradient descent
- Explore the calculus behind gradient descent for a univariate function
- Visualize gradient descent of multivariate functions
- Contextualize gradient descent with logistic regression

**Outline**
1. [What is Gradient Descent?](#1-what-is-logistic-regression)
2. [Breaking Down Gradient Descent](#2-breaking-down-gradient-descent)
    1. [Computing the gradient](#21-computing-the-gradient)
    2. [Descending the gradient](#22-descending-the-gradient)
3. [Descending multivariate functions](#3-descending-the-gradient-of-multivariate-functions)
4. [Conclusion: Contextualizing Gradient Descent](#4-conclusion-contextualizing-gradient-descent)
5. [Resources](#5-resources)

### 1. What is Gradient Descent?
Gradient descent is an optimization algorithm that is used to improve the performance of deep/machine learning models. Over a repeated series of training steps, gradient descent serves to identify optimal parameters to minimize the cost of a model. In the next section, we're going to step down from this satellite-view description and look more closely at what gradient descent actually is.

<p align="center">
<video controls src="media/himmelblau_path7.mov" width="60%">
</p>

### 2. Breaking Down Gradient Descent
To gain an intuitive understanding of gradient descent, let's first ignore machine and deep learning. To start, let's instead work with a simple function:

$$ f(x) = x^2 $$

The goal in gradient descent is to find the *minima* of a function, or the lowest possible output value of the function. In other words, given some function $f(x)$, how can we find the value of $x$ such that the output of $f(x)$ approaches $0$? For our simple example function, the obvious answer is $x = 0$.

The important part of this problem is: if we initialize $x$ to some random number, say $x = 1.8$, is there some way to automatically update $x$ so that it eventually produces the *minimal* output of the function? This is essentially the goal in machine/deep learning gradient descent. We want to *automatically* find best weights in the model that will produce the *mimimal* output from our cost function.

We can automatically find (or come close to) these minima with gradient descent in a two step process. 

1. First, we need to find the *gradient* (or *slope*) of the function at the point where our input parameter $x$ sits. 
2. Then, we need to *update* our input parameter $x$ by telling it to take a step *down* the gradient.

This process is then repeated over and over until the output of our function stabilizes at a minima or it reaches a defined tolerance level.

<p align="center">
<video controls src="media/x**2_descent.mov" width="60%">
</p>

##### 2.1. Computing the gradient
To find the slop (or *gradient*) of the function $f(x)$ at any value of $x$, we can differentiate* the function. Differentiating the simple example function is simple with the power rule $ \frac{d}{dx}x^n = nx^{n-1}$, providing us with: $ f'(x) = 2x $.

Using our starting point $x = 1.8$, we find our starting gradient to be $dx = 3.6$.

Below let's write a simple function in python to automatically differentiate this function for us.

###### *I'd strongly recommend checking out [3Blue1Brown's video][3b1b] to intuitively understand differentiation. The differentiation of this sample function from first principals can be seen [here][socratic].

[3b1b]: https://www.youtube.com/watch?v=9vKqVkMQHKk&list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr&index=2&t=2s
[socratic]: https://socratic.org/questions/how-you-you-find-the-derivative-f-x-x-2-using-first-principles

In [1]:
def compute_gradient(x: float) -> float:
    """Compute the gradient of an input to the function f(x) = x**2.

    Args:
        x (float)

    Returns:
        float: dx
    """
    dx = 2 * x
    return dx

x = 1.8
dx = compute_gradient(x)
print(f"Gradient at x = 1.8: dx = {dx}")

Gradient at x = 1.8: dx = 3.6


##### 2.2. Descending the gradient
Once we find the gradient of the starting point, we want to update our input variable in such a way that will move it *down* this gradient. Again, we want to move *down* the gradient so that the output of our function will be minimized.

To do this, we can simply subtract the gradient from the input variable. But if you've looked closely enough, you'd see that subtracting the entire gradient from the input variable $x$ would cause it to infinitely bounce back and forth from *1.8* to *-1.8*, never allowing it to approach *0*.

Instead, we can define a *Learning_Rate = 0*. We'll use this learning rate to scale the gradient prior to subtracting it from our input variable. Large learning rates produce large jumps along the function, and small learning rates lead to small steps along the function.

Lastly, we'll eventually have to stop the gradient descent, otherwise it would continue endlessly as it approaches 0. For this example, we'll simply stop the descent once the gradient of $x$, $dx < 0.01$.


In [3]:
def descend_gradient(x: float, learning_rate: float = 0.1) -> float:
    """Descends gradient of a point on the input function f(x) = x**2.

    Args:
        x (float)
        learning_rate (float): The rate by which the input variable is updated.
        Defaults to 0.1.

    Returns:
        float
    """
    dx = compute_gradient(x)
    x -= dx * learning_rate  # step the input variable 'down' the gradient
    return x

x = 1.8
iterations = 0
tolerance = 0.01
while compute_gradient(x) > tolerance:
    x = descend_gradient(x)
    iterations += 1

print(f"Function minimum found in {iterations} iterations. X = {x:0.2f}")

Function minimum found in 27 iterations. X = 0.00


As seen in the video above, our starting value of $x = 1.8$ was able to automatically be updated to $x = 0.0$ through the iterative process of gradient descent.

### 3. Descending the gradient of multivariate functions
Hopefully this univariate example provided some foundational insight into what gradient descent actually does. Let's look at it in the context of some multivariate functions now. 

Let's first visualize gradient descent of [Himmelblau's function][himmelblau].
$$ f_{Himmelblau}(x, y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2$$

There will be a few key differences for the descent of multivariate functions.

First, we will need to compute *partial* derivatives in order to update each variable. In the Himmelblau function above, the gradient of $x$ depends on $y$ (their sums are squared, requiring the chain rule). This means that the derivative of $x$ will contain $y$, and vice versa.

Second, you may have noticed that in our first simple function that there was only one minima. In reality, there will be many unknown local minima in our models. This means that the starting point of our variables and the behavior of our descent function will change which minima the variables end in.

To visualize the descent of this landscape, we're going to initialize our starting point as $x = 0.0$ and $y = -0.5 $. We can then watch the descent of each variable in it's own dimension, sliced by the position of the opposite variable. This 2D slices are overlaid above an $x, y$ plot of the function as the point descends to a minima.

https://gfycat.com/ifr/MellowLegitimateEasteuropeanshepherd

[himmelblau]:https://en.wikipedia.org/wiki/Himmelblau%27s_function



Now let's visualize the descent of the same point in 3D using my [grad-descent-visualizer][gdv] package created with the help of [PyVista][pyvista].

[gdv]: https://github.com/JacobBumgarner/grad-descent-visualizer
[pyvista]: https://github.com/pyvista/pyvista

<p align="center">
<video controls src="media/himmelblau_path6x.mov" width="60%">
</p>

From this visualization, we can see that the Himmelblau function has four global minima, and that the minima that the starting points land in depends on their original position.

Now let's visualize the descent of some more functions! We'll place a grid of points across each of these functions and watch how the points move as they descend whatever gradient they are sitting on.

The [Sphere Function][sphere].

<p align="center">
<video controls src="media/sphere.mov" width="60%">
</p>

The [Griewank Function][griewank].
<p align="center">
<video controls src="media/griewank_functionx.mov" width="60%">
</p>

The [Six-Hump Camel Function][six-hump-camel]. Notice the many local minima of the function.
<p align="center">
<video controls src="media/six_camel_path2x.mov" width="60%">
</p>

And lastly, the [Easom Function][easom]. Notice how many points sit still because they are initialized on a flat gradient.
<p align="center">
<video controls src="media/easomx.mov" width="60%">
</p>




[sphere]: https://www.sfu.ca/~ssurjano/spheref.html
[griewank]: https://www.sfu.ca/~ssurjano/griewank.html
[six-hump-camel]: https://www.sfu.ca/~ssurjano/camel6.html
[easom]: https://www.sfu.ca/~ssurjano/easom.html


### 4. Conclusion: Contextualizing Gradient Descent
So far we've worked through gradient descent with a univariate function and have visualized descent of various multivariate functions. In reality, modern deep learning models have ***vastly*** more parameters than what we've worked three here. For example, Hugging Face's newest natural language processing model, Bloom, has *175 billion* parameters.

This number of parameters is definitely intimidating, and the functions are certainly more complex than what we've examined in this post.

It's important to realize that the *foundations* of what we've learned still apply. During each iteration of training, the gradient of each parameter is calculated. This gradient will then be subtracted from the parameter so that the parameter 'steps down' it's gradient, pushing it to produce a minimal output from the model's cost function.

Thanks for reading!

### 5. Resources
- [3Blue1Brown](https://www.youtube.com/c/3blue1brown)  
    - [Gradient Descent](https://www.youtube.com/watch?v=IHZwWFHWa-w)
    - [Derivatives](https://www.youtube.com/watch?v=9vKqVkMQHKk&t=10s)
- [Simon Fraser University: Test Functions for Optimization](https://www.sfu.ca/~ssurjano/optimization.html)
- [PyVista](https://docs.pyvista.org)
- [Michael Nielsen's Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/chap1.html)