# Gradient descent



## What is Gradient Descent?

Gradient descent is an optimization algorithm used to minimize a function iteratively. Given a function $ f(x) $, the algorithm starts with an initial guess $ x_0 $ for the minimum and iteratively refines this guess. In each iteration, the gradient $ \nabla f(x) $ of the function at the current guess is computed. The gradient is a vector that points in the direction of the steepest ascent of the function. To minimize the function, one moves in the direction opposite to the gradient, updating the current guess according to the formula:

$$
x_{\text{new}} = x_{\text{old}} - \alpha \nabla f(x_{\text{old}})
$$

Here, $ \alpha $ is the learning rate, a hyperparameter that controls the step size. This process is repeated until the function value $ f(x) $ converges to a minimum value, within some predefined tolerance level.



## Uses of Gradient Descent

1. **Machine Learning and Deep Learning**: Used extensively for optimizing loss functions in various machine learning algorithms, especially in neural networks.
   
2. **Optimization Problems**: In operations research, for solving linear and non-linear optimization problems.
  
3. **Natural Language Processing**: In algorithms for text analysis and sentiment classification.
  
4. **Computer Vision**: For tasks like image recognition and object detection.

5. **Control Systems**: For optimizing control functions.

6. **Finance**: In portfolio optimization and algorithmic trading strategies.

## Advantages of Gradient Descent

1. **Simplicity**: The algorithm is simple to understand and easy to implement.

2. **Efficiency**: For large-scale problems, the algorithm can be more efficient than other optimization methods like the simplex method.

3. **Flexibility**: Can be adapted for various types of objective functions, including those that are non-differentiable using subgradients.

4. **Scalability**: Variants like stochastic gradient descent can handle very large datasets and high dimensions.

## Drawbacks of Gradient Descent

1. **Local Minima**: For non-convex functions, gradient descent can get stuck in local minima and fail to find the global minimum.

2. **Sensitivity to Learning Rate**: The learning rate $ \alpha $ needs to be carefully chosen. A rate that's too large can overshoot the minimum, while a rate that's too small can make the algorithm slow to converge or get stuck.

3. **Computational Cost**: Calculating the gradient can be computationally expensive for complex functions.

4. **Numerical Errors**: For very flat or very steep regions, numerical errors can become significant.

5. **Initialization Dependent**: The initial point $ x_0 $ can have a significant impact on the convergence of the algorithm, especially for non-convex functions.

#### Stopping conditions
- maxit, i.e. a predetermined maximum number of iterations. 
- abstol, i.e., stop when the function gets "close enough" to zero. 
- reltol, which is like your second suggestion, stop when the improvement drops below a threshold.

## Python demonstration

In [111]:
import numdifftools as nd
import numpy as np
from autograd import grad
import numpy as np

# Define objective function
objective_function = lambda x: (x[0] ** 2) + (3 * (x[1] ** 2))

# Initial point
P0 = np.array([2.0, 1.0])

# Learning rate
lr = 0.2

# Stopping tolerance
tol = 1e-6

In [126]:
def gradient_descent(function, initial_point, learning_rate, tolerance):
    point = initial_point
    nb_iter = 0
    old_loss = function(point)
    gradient_function = grad(function)
    
    while True:
        nb_iter += 1
        gradient = np.array(gradient_function(point))
        
        # Update rule
        point = point - (learning_rate * gradient)
        
        # Compute new loss
        new_loss = function(point)

        # Check for convergence
        if abs(new_loss - old_loss) < tolerance:
            print("Number of iterations:", nb_iter)
            break
            
        old_loss = new_loss

    return point


In [128]:
# Run of the algorithm
res = gradient_descent(objective_function, P0, lr, tol)
print("The minimum occurs at:", res)
objective_function(res)

Number of iterations: 16
The minimum occurs at: [5.64221981e-04 6.55360000e-12]


3.183464443978558e-07

In [124]:
def is_close(a, b, tol=1e-6):
    print(np.abs(a - b))
    return np.all(np.abs(a - b) < tol)

# Test 1: Objective function f(x, y) = x^2 + 3y^2
objective_function_2d = lambda x: (x[0] ** 2) + (3 * (x[1] ** 2))
P0_2d = np.array([2.0, 1.0])
lr = 0.2
tol = 1e-6
res_2d = gradient_descent(objective_function_2d, P0_2d, lr, tol)
if is_close(objective_function_2d(res_2d), 0.0):
    print("Test 1 Passed!")
else:
    print("Test 1 Failed!")
    
# Test 2: Objective function f(x, y, z) = x^2 + y^2 + z^2
objective_function_3d = lambda x: x[0] ** 2 + x[1] ** 2 + x[2] ** 2
P0_3d = np.array([2.0, 1.0, 3.0])
res_3d = gradient_descent(objective_function_3d, P0_3d, lr, tol)
if is_close(objective_function_3d(res_3d), 0.0):
    print("Test 2 Passed!")
else:
    print("Test 2 Failed!")
    
# Test 3: Objective function f(x, y) = (x - 1)^2 + (y + 2)^2
objective_function_shifted_2d = lambda x: (x[0] - 1) ** 2 + (x[1] + 2) ** 2
P0_shifted_2d = np.array([2.0, 1.0])
res_shifted_2d = gradient_descent(objective_function_shifted_2d, P0_shifted_2d, lr, tol)
if is_close(objective_function_shifted_2d(res_shifted_2d), 0.0):
    print("Test 3 Passed!")
else:
    print("Test 3 Failed!")

Number of iterations: 16
3.183464443978558e-07
Test 1 Passed!
Number of iterations: 17
4.0111651994129803e-07
Test 2 Passed!
Number of iterations: 17
2.8651179995799273e-07
Test 3 Passed!
