# Gradient Descent

Gradient descent is an optimization algorithm used to find the minimum of a function. It is commonly used in machine learning and deep learning to minimize the loss function and update the model's parameters.

## History

Gradient descent was first proposed by Cauchy in 1847. It has since been widely used in optimization problems, particularly in the field of machine learning, where it has become one of the most popular algorithms for training models.

## Mathematical Equations

Given a function `f(w)`, where `w` is a vector of parameters, the gradient descent algorithm aims to find the minimum of the function. The gradient of the function (∇f(w)) represents the direction of the steepest increase in the function's value. The update rule for gradient descent is:

w(t+1) = w(t) - η * ∇f(w(t))

where:
- w(t): The parameter vector at iteration t.
- η: The learning rate, which controls the step size in each iteration.
- ∇f(w(t)): The gradient of the function evaluated at w(t).

The learning rate η is a hyperparameter that needs to be tuned. If it is too small, the algorithm will converge slowly, while if it is too large, it may overshoot the minimum and diverge.

## Learning Algorithm

The learning algorithm for gradient descent consists of iteratively updating the model's parameters by following the negative gradient of the loss function. The main steps are:

1. Initialize the model's parameters randomly or using a predefined method.
2. Calculate the gradient of the loss function with respect to each parameter.
3. Update the parameters using the gradient descent update rule.
4. Repeat steps 2 and 3 until a stopping criterion is met (e.g., maximum number of iterations, minimum change in the loss function, or minimum change in the parameters).

## Pros and Cons

**Pros:**
- Simple and easy to understand.
- Can be applied to a wide range of problems.
- Efficient for large-scale datasets and problems with many parameters.
- Can be adapted to use adaptive learning rates or momentum.

**Cons:**
- Sensitive to the choice of learning rate and other hyperparameters.
- Can get stuck in local minima for non-convex functions.
- May be slow to converge for ill-conditioned problems or problems with a high degree of curvature.
- Requires the calculation of gradients, which can be computationally expensive for complex models.

## Suitable Tasks and Datasets

Gradient descent can be applied to a wide range of optimization tasks in machine learning and deep learning, including:

- Linear regression
- Logistic regression
- Neural networks
- Support vector machines (with the appropriate kernel and loss function)

It is particularly useful for problems with large datasets or a high number of parameters, where other optimization algorithms may be too computationally expensive.

## References

1. Cauchy, A. (1847). Méthode générale pour la résolution des systèmes d'équations simultanées. Comptes Rendus, 25, 536-538.
2. Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics (pp. 177-186). Springer.
3. Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

class GradientDescent:
    def __init__(self, learning_rate=0.01, max_iterations=1000):
        self.learning_rate = learning_rate
        self.max_iterations = max_iterations

    def fit(self, X, y):
        n_samples, n_features = X.shape

        # Add a column of ones to X for the bias term
        X = np.hstack((np.ones((n_samples, 1)), X))

        # Initialize weights
        self.weights = np.random.rand(n_features + 1)

        # Perform gradient descent
        for _ in range(self.max_iterations):
            gradient = (2 / n_samples) * X.T.dot(X.dot(self.weights) - y)
            self.weights -= self.learning_rate * gradient

    def predict(self, X):
        n_samples = X.shape[0]
        X = np.hstack((np.ones((n_samples, 1)), X))
        return X.dot(self.weights)

# Generate a synthetic regression dataset
X, y = make_regression(n_samples=100, n_features=1, noise=20, random_state=42)

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the linear regression model using gradient descent
model = GradientDescent(learning_rate=0.1, max_iterations=1000)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Visualize the results
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.legend()
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression using Gradient Descent')
plt.show()


# This code defines a simple gradient descent class and applies it to linear regression. The learning rate and maximum number of iterations are hyperparameters that can be adjusted. In this example, we use a learning rate of 0.1 and 1000 iterations. The model is evaluated on a test set using mean squared error, and the results are visualized in a scatter plot.

The learning rate is an important hyperparameter that affects the convergence of the gradient descent algorithm. If it is too small, the algorithm may converge slowly or get stuck in a local minimum. If it is too large, the algorithm may overshoot the minimum and diverge. Experimenting with different learning rates can help find the optimal value for a given problem.