### Batch Gradient Descent (BGD), often referred to simply as Gradient Descent, is an optimization algorithm used in machine learning and deep learning to minimize the cost or loss function of a model during training. It is a fundamental technique for updating the parameters of a model in order to find the best possible set of parameters that minimize the error between the model's predictions and the actual target values.


## Advantage
1. Stable Convergence: BGD typically converges to a global minimum (or a good local minimum) if the cost function is convex.

2. Precise Gradient Estimate: Using the entire dataset provides an accurate estimate of the gradient, reducing the chances of noisy updates.

3. It is a simple and easy to understand algorithm.

4. It can be used to minimize any cost function, regardless of its shape.

5. It can be used to train models with a large number of parameters.

## Disadvantage

1. Computationally Intensive: BGD requires processing the entire training dataset at each iteration, making it computationally expensive, especially for large datasets.

2. Memory Usage: It may not be feasible to load the entire dataset into memory, leading to memory constraints for very large datasets.

3. Slower Convergence: BGD often converges more slowly than other variants of gradient descent due to infrequent parameter updates.

4. It can be slow to converge, especially for large datasets.

5. It can be sensitive to the choice of the learning rate.

### Batch Gradient Descent (BGD) finds its application across various machine learning domains, including linear regression, logistic regression, and neural networks. It stands out as a reliable optimization method, particularly for scenarios with manageable dataset sizes. Let's consolidate the key points about BGD:

1. Linear Regression: BGD is frequently employed to train linear regression models, where its stability and well-defined convergence are advantageous.

2. Deep Learning: In the initial training phases of neural networks, BGD is a valuable choice, especially when the dataset can comfortably fit into memory. It provides a solid foundation for optimizing neural network parameters.

3. Convex Optimization: BGD shines in problems featuring convex cost functions. Its suitability for such scenarios makes it a preferred optimization technique.

To summarize, Batch Gradient Descent is a gradient-based optimization method that iteratively adjusts model parameters using gradients computed from the entire training dataset. Its strength lies in delivering robust convergence, but it comes with computational and speed trade-offs, particularly when dealing with extensive datasets. Its applicability spans linear regression, neural network training, and convex optimization problems.

Additional considerations include fine-tuning the learning rate, a crucial hyperparameter influencing the algorithm's convergence speed. Smaller learning rates lead to slower but more stable convergence, while larger ones can risk divergence. The choice of the number of epochs, representing passes through the dataset, impacts both model accuracy and training time. Furthermore, BGD can be combined with various optimization techniques like momentum and adaptive learning rates to enhance its convergence and accuracy in practice.

#### Fetching the actual coef and intercept values for this dataset using the linear regresiion model from skelarn

In [3]:
from sklearn.datasets import load_diabetes

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [4]:
X,y = load_diabetes(return_X_y=True)

In [5]:
print(X.shape)
print(y.shape)

(442, 10)
(442,)


In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2,random_state=2)

In [7]:
reg = LinearRegression()
reg.fit(X_train, y_train)

In [8]:
y_pred = reg.predict(X_test)
r2_score(y_test, y_pred)

0.4399338661568968

In [10]:
print(reg.coef_) # 10 columns 10 coefficients 
print(reg.intercept_) # and one intercept

[  -9.15865318 -205.45432163  516.69374454  340.61999905 -895.5520019
  561.22067904  153.89310954  126.73139688  861.12700152   52.42112238]
151.88331005254167


# Making my own class

In [24]:
class GDRegressor:
    
    def __init__(self,learning_rate=0.01,epochs=100):
        
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        
    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            # update all the coef and the intercept
            y_hat = np.dot(X_train,self.coef_) + self.intercept_
            intercept_der = -2 * np.mean(y_train - y_hat)
            self.intercept_ = self.intercept_ - (self.lr * intercept_der)
            
            coef_der = -2 * np.dot((y_train - y_hat),X_train)/X_train.shape[0]
            self.coef_ = self.coef_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coef_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

In [30]:
gdr = GDRegressor(epochs=1000, learning_rate=
                 0.5) # making object of the class

In [31]:
gdr.fit(X_train, y_train) # intercept and coef terms

152.01351687661833 [  14.38990585 -173.7235727   491.54898524  323.91524824  -39.32648042
 -116.01061213 -194.04077415  103.38135565  451.63448787   97.57218278]


In [32]:
y_pred = gdr.predict(X_test)

In [33]:
r2_score(y_test, y_pred)

0.4534503034722803

# Normal Gradient descent is batch gradient descent