<a href="https://colab.research.google.com/github/Samarth745/ML-algo-from-scratch/blob/main/Gradient_Descent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Mathematical Derivation of Gradient Descent

### Objective

Given a function $ f(\mathbf{w}) $, where $ \mathbf{w} $ represents the parameters (or weights) of the model, the goal of gradient descent is to minimize this function with respect to $ \mathbf{w} $. The objective function could be, for instance, a loss function such as the Mean Squared Error (MSE) in linear regression.


### Gradient
The gradient of the function $ f(\mathbf{w}) $ is a vector of partial derivatives with respect to each parameter

$ w_i $. It tells us the direction of the steepest ascent of the function.

$
\nabla f(\mathbf{w}) = \left[ \frac{\partial f}{\partial w_1}, \frac{\partial f}{\partial w_2}, \dots, \frac{\partial f}{\partial w_n} \right]
$

Since we want to minimize $ f(\mathbf{w}) $, we take steps in the opposite direction of the gradient.

### Gradient Descent Update Rule

At each iteration $ t $, we update the weights $ \mathbf{w} $ using the following rule:

\[
\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \alpha \nabla f(\mathbf{w}^{(t)})
\]

Where:
- $ \alpha $ is the learning rate, which controls the step size.
- $ \nabla f(\mathbf{w}^{(t)}) $ is the gradient of the function evaluated at the current parameter values.

### Intuition

- The term $ \nabla f(\mathbf{w}^{(t)}) $ gives the direction and magnitude of the steepest ascent of the function at $ \mathbf{w}^{(t)} $.
- The negative sign ensures that we move in the direction of steepest descent (i.e., toward minimizing $ f $).
- The learning rate $ \alpha $ controls how large the update step is. A small $ \alpha $ leads to small updates (slow convergence), and a large $ \alpha $ may lead to overshooting the minimum.

### Example: Gradient Descent for Mean Squared Error

Suppose $ f(\mathbf{w}) $ is the Mean Squared Error (MSE) for a linear regression model, defined as:

\[
f(\mathbf{w}) = \frac{1}{2m} \sum_{i=1}^{m} \left( y_i - \mathbf{x}_i^\top \mathbf{w} \right)^2
\]

where:
- $ m $ is the number of training samples.
- $ y_i $ is the true value for sample $ i $.
- $ \mathbf{x}_i $ is the feature vector for sample $ i $.

The gradient of the MSE with respect to the weights $ \mathbf{w} $ is:

\[
\nabla f(\mathbf{w}) = -\frac{1}{m} \sum_{i=1}^{m} \left( y_i - \mathbf{x}_i^\top \mathbf{w} \right) \mathbf{x}_i
\]

Thus, the update rule for the weights becomes:

\[
\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} + \frac{\alpha}{m} \sum_{i=1}^{m} \left( y_i - \mathbf{x}_i^\top \mathbf{w}^{(t)} \right) \mathbf{x}_i
\]

### Convergence

By repeating the update rule iteratively, the weights $ \mathbf{w} $ converge to the values that minimize the objective function $ f(\mathbf{w}) $, assuming an appropriate choice of the learning rate $ \alpha $.

### Stopping Criteria

The algorithm stops when:
- The change in $ f(\mathbf{w}) $ between iterations is smaller than a predefined threshold.
- The maximum number of iterations is reached.



In [217]:
# data manipulation
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt

# dataset
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [218]:
X, y = fetch_california_housing(return_X_y=True, as_frame=True)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=7)

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [225]:
class Batch_Gradient_Descent:
  def __init__(self, epochs):
    self.epochs = epochs

  def fit(self, X, y):
    num_of_rows, num_of_columns = X.shape ## Getting total rows and columns
    constant = np.random.rand()
    weights = np.random.rand(num_of_columns)
    alpha = 0.05
    for i in range(self.epochs):
      y_pred = np.dot(X, weights) + constant ## Calculate the prediction with current weights
      residual = y - y_pred ## Calculate the residual

      ## calculating the gradient
      weight_gradient = (-1/num_of_rows)*np.dot(X.T, residual) ## Derivating wrt weight vectors
      constant_gradient = -residual.mean() ## Derivating with respect to constant_vector

      ## Update the weights
      weights = weights - alpha*weight_gradient
      constant = constant - alpha*constant_gradient
    self.weights = weights
    self.constant = constant

  def predict(self, X):
    return np.dot(X, self.weights) + self.constant

In [220]:
class Stochastic_Gradient_Descent:
    def __init__(self, epochs):
        self.epochs = epochs

    def fit(self, X_, y_):
        num_of_rows, num_of_columns = X_.shape  # Getting total rows and columns
        constant = np.random.rand()
        weights = np.random.rand(num_of_columns)
        alpha = 0.09
        thetas = []

        for i in range(self.epochs):
            index = np.random.randint(num_of_rows)  # Sample a random row, not column
            X = X_[index]
            y = y_[index]
            y_pred = np.dot(X, weights) + constant  # Calculate the prediction with current weights
            residual = y - y_pred  # Calculate the residual

            # Calculating the gradient
            weight_gradient = -residual * X  # Derivative wrt weight vectors
            constant_gradient = -residual  # Derivative with respect to constant

            # Update the weights
            weights = weights - alpha * weight_gradient
            constant = constant - alpha * constant_gradient

            # Optionally store the weight values
            thetas.append(weights.copy())

        self.weights = weights
        self.constant = constant

    def predict(self, X):
        return np.dot(X, self.weights) + self.constant

In [221]:
class MiniBatch_Greadient_Descent:
  def __init__(self, epochs, batch_size):
    self.epochs = epochs
    self.batch_size = batch_size

  def fit(self, X_, y_):
    num_of_rows, num_of_columns = X_.shape ## Getting total rows and columns
    constant = np.random.rand()
    weights = np.random.rand(num_of_columns)
    alpha = 0.05
    for i in range(0,self.epochs):
      random_indices = np.random.choice(num_of_rows,
                                        size=self.batch_size,
                                        replace=False)
      X = X_[random_indices]
      y=y_[random_indices]
      y_pred = np.dot(X, weights) + constant ## Calculate the prediction with current weights
      residual = y - y_pred ## Calculate the residual

      ## valculating the gradient
      weight_gradient = (-1/num_of_rows)*np.dot(X.T, residual) ## Derivating wrt weight vectors
      constant_gradient = -residual.mean() ## Derivating with respect to constant_vector

      ## Update the weights
      weights = weights - alpha*weight_gradient
      constant = constant - alpha*constant_gradient
    self.weights = weights
    self.constant = constant

  def predict(self, X):
    return np.dot(X, self.weights) + self.constant

In [226]:
bgd = Batch_Gradient_Descent(epochs=100)
bgd.fit(X_train, y_train.values)
bgd.predict(X_test)

array([1.47266656, 2.22112071, 2.35372253, ..., 1.60359203, 2.07906991,
       4.13403838])

In [223]:
sgd = Stochastic_Gradient_Descent(epochs=100)
sgd.fit(X_train, y_train.values)
sgd.predict(X_test)

array([1.10022346, 2.50300633, 2.93163382, ..., 2.07060425, 2.56679668,
       4.58261316])

In [224]:
mgd = MiniBatch_Greadient_Descent(epochs=100, batch_size=20)
mgd.fit(X_train, y_train.values)
mgd.predict(X_test)

array([1.85326551, 1.82140566, 3.21939156, ..., 3.19734511, 3.2503211 ,
       3.48901472])

In [229]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
lr.predict(X_test)

array([1.65084222, 2.47797721, 2.42550694, ..., 1.86032202, 2.39125575,
       3.84598092])

As we can see the values from gradient descent are very similar to Original Values from multiple Linear regression