## **Imports**

In [None]:
# Importing Modules
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

## **Understanding Batch Gradient Descent**


Batch Gradient Descent is an optimization algorithm used to train machine learning models, especially linear regression. Think of it like finding the lowest point in a valley by taking steps in the steepest downward direction.

### **Core Concept**

*   **Goal:** To find the best values for the model's coefficients ($\theta$) and intercept ($b$) that minimize the difference between the predicted values ($\hat{y}$) and the actual values ($y$). This difference is measured by a cost function, typically the Mean Squared Error (MSE).

### **The Mechanism**

1.  **Initialization:**
    *   Start with initial guesses for the coefficients ($\theta$) and intercept ($b$). These are usually random values or zeros.

2.  **Iterative Optimization (Epochs):**
    *   Repeat the following steps for a fixed number of epochs (iterations) or until convergence:

3.  **Calculate Predicted Values:**
    *   For each data point in the training set, calculate the predicted value $\hat{y}$ using the current coefficients and intercept:
        $$\hat{y} = X\theta + b$$
        where $X$ is the training data matrix and $\theta$ are the coefficients.

4.  **Calculate the Cost (MSE):**
    *   Calculate the Mean Squared Error (MSE) as the cost function:
        $$J(\theta, b) = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2$$
        where $m$ is the number of training examples, $y_i$ is the actual value, and $\hat{y}_i$ is the predicted value for the $i$-th example.

5.  **Calculate the Gradient:**
    *   Calculate the partial derivative of the cost function with respect to the intercept ($b$) and each coefficient ($\theta_j$). This gives us the gradient, which points in the direction of the steepest increase in the cost.
    *   **Gradient with respect to the intercept:**
        $$\frac{\partial J}{\partial b} = -\frac{2}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)$$
    *   **Gradient with respect to the coefficients:**
        $$\frac{\partial J}{\partial \theta_j} = -\frac{2}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)x_{ij}$$
        where $x_{ij}$ is the value of the $j$-th feature for the $i$-th example.

6.  **Update Coefficients and Intercept:**
    *   Update the coefficients and intercept by subtracting a small portion of the gradient from their current values. The size of this portion is determined by the learning rate ($\alpha$).
    *   **Update intercept:**
        $$b = b - \alpha \frac{\partial J}{\partial b}$$
    *   **Update coefficients:**
        $$\theta_j = \theta_j - \alpha \frac{\partial J}{\partial \theta_j}$$

### **Key Components**

*   **Batch:** Uses the *entire* training dataset to calculate the gradient in each iteration.
*   **Learning Rate ($\alpha$):** Controls the step size during parameter updates. A smaller learning rate leads to slower but potentially more precise convergence, while a larger learning rate can be faster but risks overshooting or divergence.
*   **Epochs:** One full pass through the entire training dataset.

### **Pros of Batch Gradient Descent:**

*   Provides a stable convergence to the global minimum of the cost function (for convex cost functions).
*   The gradient calculation is accurate as it uses the entire dataset.

### **Cons of Batch Gradient Descent:**

*   Can be computationally expensive and slow for very large datasets because it needs to process all data points in each iteration.
*   Requires significant memory to load the entire dataset.

In summary, Batch Gradient Descent iteratively adjusts the model's parameters by considering the error across the entire dataset in each step, guided by the gradient and the learning rate, to find the best fit for the data.

## **Loading & Splitting the Dataset**

In [None]:
# Loading Datasets
X,y = load_diabetes(return_X_y=True)

In [None]:
# Printing shape of X and y
print(X.shape)
print(y.shape)

(442, 10)
(442,)


In [None]:
# Splitting the Dataset
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

## **Linear Regression for Reference**

In [None]:
# Fitting a Linear Regression Model
reg = LinearRegression()
reg.fit(X_train, y_train)

In [None]:
# Printing coefficients and intercept for reference
print(reg.coef_)
print(reg.intercept_)

[  -9.15865318 -205.45432163  516.69374454  340.61999905 -895.5520019
  561.22067904  153.89310954  126.73139688  861.12700152   52.42112238]
151.88331005254167


In [None]:
# Predicting and printing the R2 score
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)

0.4399338661568968

## **Batch Gradient Descent**

In [None]:
class GDRegressor:

    def __init__(self, learning_rate=0.01, epochs=100):
        self.coef_ = None  # Coefficients (weights) of the linear model
        self.intercept_ = None  # Intercept (bias) of the linear model
        self.lr = learning_rate  # Learning rate
        self.epochs = epochs  # Number of epochs

    def fit(self, X_train, y_train):
        # Initialize coefficients and intercept
        self.intercept_ = 0  # Initialize intercept to 0
        self.coef_ = np.ones(X_train.shape[1])  # Initialize coefficients to ones

        # Perform gradient descent for the specified number of epochs
        for i in range(self.epochs):
            # Calculate predicted values (y_hat)
            y_hat = np.dot(X_train, self.coef_) + self.intercept_

            # Calculate the derivative of the intercept (intercept_der)
            intercept_der = -2 * np.mean(y_train - y_hat)

            # Update the intercept using the learning rate and derivative
            self.intercept_ = self.intercept_ - (self.lr * intercept_der)

            # Calculate the derivative of the coefficients (coef_der)
            # The derivative is calculated as -2/n * (sum of (y_train - y_hat) * X_train)
            coef_der = -2 * np.dot((y_train - y_hat), X_train) / X_train.shape[0]

            # Update the coefficients using the learning rate and derivative
            self.coef_ = self.coef_ - (self.lr * coef_der)

        # Print the final intercept and coefficients (optional, for inspection)
        print(self.intercept_, self.coef_)

    def predict(self, X_test):
        # Calculate and return the predicted values using the learned coefficients and intercept
        return np.dot(X_test, self.coef_) + self.intercept_

In [None]:
# Creating instance of our custom implementation
gdr = GDRegressor(epochs=300, learning_rate=0.5)

In [None]:
# Training the instance
gdr.fit(X_train, y_train)

152.08008430351475 [  59.80588881  -52.38015781  326.53219776  233.40121201   25.8621717
  -14.93092905 -165.87822008  130.79629083  298.79220041  128.87138755]


In [None]:
# Predicting
y_pred = gdr.predict(X_test)

In [None]:
# Printing the R2 score
r2_score(y_test, y_pred)

0.42529917444899457

1.Our created model have outperformed the scikit learn's implementation of gradient descent in linear regression model in this notebook <br>
<br>
2.But during cross validation this r2_score will dip