# Gradient Descent Method



You know that the Ordinary Least Squares (OLS) is trying to **find the optimal parameters that minimize the cost function**. You might have realized that OLS gives you the normal equation (a closed-form solution) for the optimal values of parameters. If you put in the values of $X$ and $y$ in the normal equation and perform some finite number of operations, you will get the optimal values of the parameters **in a single iteration**. This method is a non-iterative approach for optimization. However, you have also learned that this non-iterative method is not possible when
  * $X^TX$ is not invertible, .  
  * number of features is larger than the number of samples.

Moreover, the inverse operation, $(X^TX)^{-1}$ is computationally very expensive. The time complexity of this operation increases significantly with the increase in number of features. So for a large dataset with a large number of features, OLS becomes computationally expensive and infeasible.

In this section you will learn about **Gradient Descent**, an iterative approach to find the optimal parameters.

## Iterative Approach for Optimization

Before diving deep into Gradient Descent, let's understand what an iterative approach for optimization actually means. Unlike the OLS, iterative approach finds the optimal parameters in a number of iterations. It uses a random initial guess and a sequence of approximations to find the optimal parameters. Each approximation in the sequence is computed by applying an *update rule* over the approximation of the previous step. Gradient descent is a first order iterative optimization approach *(First order because it uses the first order derivative to find the optimal parameters)*. It is one of the most popular algorithms in the field of machine learning and deep learning.

## Gradient Descent

Let's start with an example. Suppose there is a valley between two tall hills and you are stuck at the top of a hill. You are trying to get to the valley at the base of the hill but its all foggy and you can't see a thing. What would you do?

Well one way is to begin by feeling the ground around you and take steps down in the steepest direction. You take large steps when the slope is steep and small steps when the slope is gradual. If you continue this long enough, you will reach a point on the hill where it is no longer possible to take a step downwards. Using this approach you are likely to reach the base of the hill.





In this analogy, the hills represents the function you are trying to minimize. The valley at the base of the hill represents the minimum point in the function. To minimize the function, we need to take steps proportional to the slope in the steepest direction. For this, the gradient of the function can be used. The gradient of a function at a point,  points towards the direction of the steepest ascent in the function at that point. If we take the negative of the gradient then it will point towards the direction of the steepest descent in the function. The magnitude of the gradient of the function at a point gives the slope of the function at that point. So the gradient descent algorithm takes the steps proportional to the magnitude of the gradient in the direction negative to that of the gradient to find the minimum point in the function.

Well this is probably the most popular example to understand the concept of gradient descent. Now let's look into a more formal explanation of gradient descent and the steps involved in it in an algorithmic way.

### Gradient Descent Algorithm


Let's start with how gradient descent minimizes a function in general. Then you can move into the gradient descent for the cost function of linear regression. Suppose you have a multi-variable real-valued function $f:\mathbb{R}^{n}\rightarrow \mathbb{R}$ for which you want to find the input $\mathbf{x}$ that produces the smallest possible output $f(\mathbf{x})$. The gradient descent algorithm works in the following steps:

Step 1: Initialize the value of ${x}$ randomly

Step 2: Calculate the gradient of $f({x})$ with respect to ${x}$ ie.  $\frac{\partial\ f({x})}{\partial\ {x}}$

Step 3: Update ${x}$ as:

$${x} := {x} - \alpha \frac{\partial\ f({x})}{\partial\ {x}}$$

$\hspace{10cm}$ where, $\alpha$ is the **learning rate** and '$:=$' is assignment operator

Step 4: Repeat steps 1, 2 and 3 until the value of $f({x})$ converges to the minimum value.


As the function approaches the minimum point, its gradient approaches zero and so the updates don't change $x$ much. At the minimum point of the function,  $\frac{\partial\ f({x})}{\partial\ {x}}\cong0$ and the solution converges at that point after certain number of iterations.

The function `gradient_descent` finds the optimal value of $x$ that minimizes the function $f(x)$ using gradient descent. It updates the value of $x$ repeatedly using the function's gradient until the gradient becomes very close to zero or the maximum number of iterations is reached. It finally returns the optimal value of $x$, `x_opt` and the number of iterations required to find that value.

In [1]:
def gradient_descent(gradient, x_init, alpha=0.01, max_iters=10000, precision=1e-8):
  x = x_init
  iteration = 0

  while abs(gradient(x)) > precision and iteration < max_iters:
    x = x - alpha * gradient(x)
    iteration += 1
  x_opt = x

  return x_opt, iteration

### Example:

Let's start with a simple example of finding the minimum value of the function $f(x) = x^2 + 3x -5$.

The gradient of the function $f(x)$ is $\frac{df(x)}{dx} = 2x+3$

Let's create a function for $f(x)$ and it's gradient $\frac{df(x)}{dx}$:

In [2]:
def f(x):
  return x**2 + 3 * x - 5

def gradient_f(x):
  return 2*x + 3

Now let's use Gradient Descent to find the optimal value of ${x}$ that minimizes the function $f({x})$. Let's choose a **random** value, say 2.4 to start with as the initial value of $x$ *ie.* `x_init = 2.4`. The next thing you need to choose is the value of the learning rate. Learning rate remains constant throughout the training process and generally takes small values between 0 and 1. Let's set the value of learning rate to be 0.25 *ie.* `alpha=0.25` for now. You will learn more about the role of learning rate in the coming sections.


In [3]:
x_init = 2.4
alpha = 0.25

x_optimal, steps = gradient_descent(gradient_f, x_init, alpha)
print("optimal x:", x_optimal)
print("min f(x):", f(x_optimal))
print("no. of steps:", steps)

optimal x: -1.4999999963678419
min f(x): -7.25
no. of steps: 30


If you plot the visualization of the working of gradient descent, it looks like the [animation shown](https://drive.google.com/uc?export=view&id=1rmR1S8nIG7cxbOKUxX6GkbYRXIpwuuTT). As you can see, each iteration decreases the value of the function $f(x)$ and brings the value of $x$ closer to the optimal value. At the minimum point in the function, the gradient becomes close to zero and the value of $x$ don't change by much. After some iterations, the value of $x$ finally converges to the optimal values.

<figure align="center">

<img src="https://i.postimg.cc/63cyY9Gb/Gradient-Descent.gif" height="450px">
<figcaption>Figure 1: Gradient Descent</figcaption>
</figure>




*Note: This animation and the ones following is just a representation and does not show all the iterations but some selected ones to give you an idea of how gradient descent actually works.*

## Role of Learning Rate

The learning rate $\alpha$ is a constant term in the gradient descent algorithm. It determines the size of the steps taken while moving towards the minimum of the function.

### Too Small Learning Rate

If the value of the learning rate is too small then the algorithm requires more iterations to converge and could get stuck in undesirable local minimum. Let's change the learning rate for the above example from 0.25 to 0.01.

In [None]:
alpha = 0.01

x_optimal, steps = gradient_descent(gradient_f, x_init, alpha)
print("optimal x:", x_optimal)
print("min f(x):", f(x_optimal))
print("no. of steps:", steps)

optimal x: -1.499999995053417
min f(x): -7.25
no. of steps: 1014


The [animation shown](https://drive.google.com/uc?export=view&id=1qdz8VK74BMYarGRUETfaFFBq7PEhmBRw) represents the Gradient descent with a learning rate of 0.01.

<figure align="center">
       <img src="https://i.postimg.cc/63cyY9Gb/Gradient-Descent.gif" height="450px">

<figcaption>Figure 2: Gradient Descent with too small learning rate</figcaption>
</figure>



As you can see the number of steps required has increased significantly from 30 to 1014 when you decrease the learning rate from 0.25 to 0.01

### Too Large Learning Rate

If the value of the learning rate is too large then the algorithm may overshoot the minimum point in the function. Due to this overshooting, the algorithm may take even more number of steps to converge or may not converge at all.

Let's set the learning rate to 0.99 in the above example.

In [None]:
alpha = 0.95
x_optimal, steps = gradient_descent(gradient_f, x_init, alpha)
print("optimal x:", x_optimal)
print("min f(x):", f(x_optimal))
print("no. of steps:", steps)

optimal x: -1.5000000046596569
min f(x): -7.25
no. of steps: 195


The [animation shown](https://drive.google.com/uc?export=view&id=1gIYcR1dmvhHdUrmFrS1O3-DUWnBb-UfR) represents the Gradient descent with a learning rate of 0.99.

<figure align="center">
       <img src="https://i.postimg.cc/g2cdHvKp/Gradient-Descent-High-LR.gif" height="450px">
       <figcaption>Figure 3: Gradient Descent with a too large learning rate</figcaption>
</figure>


As you can see the algorithm is often overshooting the minimum point and jumping from one side to the other. However it does converge after some iterations but this may not always be the case.

The learning rate is generally constant throughout the training process but there are other techniques called adaptive learning rates that vary the learning rate while training. You can learn about the adaptive learning rates in greater detail in the deep learning courses.  


## Non-convex functions

Till now you have only minimized convex functions using gradient descent .There is only one minimum point in a convex function and it is the global minimum of the function as shown in figure 4 below.

<figure align="center">
       <img src="https://i.postimg.cc/fLvQjgpM/convex-function.png" height="350px">
<figcaption>Figure 4: Convex function</figcaption>
</figure>


But what if the function is not convex? In such a case, there will be **multiple minimum points** in the function as shown in the figure 5 below. When we run the gradient descent on such non-convex  function, **it might get stuck at the local minimum and never reach the global minimum**. This is one of the biggest disadvantage of the gradient descent algorithm.

<figure align="center">
<img src="https://i.postimg.cc/mkvrKnLK/non-convex-function.png" height="350px">

<figcaption>Figure 5: Non-convex function</figcaption>
</figure>


