# Mini-Batch Gradient Descent (MBGD)
is a variation of the Gradient Descent optimization algorithm that strikes a balance between the computational efficiency of Stochastic Gradient Descent (SGD) and the stability of Batch Gradient Descent (BGD). It's a popular choice in machine learning and deep learning because it combines the benefits of both methods.Stochastic gradient descent (SGD) is a machine learning algorithm for optimizing an objective function. It is a type of gradient descent, which means that it follows the gradient of the objective function to find the minimum. However, SGD does not use the entire dataset to calculate the gradient at each step. Instead, it uses a single data point or a small subset of data points, called a minibatch. This makes SGD much faster than traditional gradient descent, which can be computationally expensive for large datasets.

Here is an analogy to help you understand SGD. Imagine you are lost in a forest and you want to find the shortest path to the exit. You could start by walking in a straight line, but you would probably end up going in circles. A better way would be to follow a trail. The trail will lead you in the right direction, even if it is not the shortest path. SGD is like following a trail in the forest. It is not the shortest path to the minimum, but it is a much faster way to get there.

## Objective:
Like other gradient-based algorithms, MBGD aims to update a model's parameters to minimize a cost function. However, it does so by considering small, random subsets of the training data at each iteration, rather than the entire dataset.

Explanation:

1. Data Batches: In MBGD, you divide your entire training dataset into smaller, equally-sized chunks called "mini-batches." These mini-batches typically contain a few dozen to a few hundred data points.

2. Gradient Calculation: At each iteration (or epoch), MBGD computes the gradient of the cost function using one of these mini-batches. This means it updates the model's parameters based on a subset of the data, not the entire dataset.

3. Parameter Update: MBGD then adjusts the model's parameters using the computed gradient. The update is done similarly to BGD, but it's based on the mini-batch gradient rather than the full dataset gradient.

## Advantage
1. Faster Convergence: MBGD often converges faster than BGD because it updates the parameters more frequently. This speedup can be particularly significant for large datasets.

2. Efficient Memory Usage: Unlike BGD, which requires storing the entire dataset in memory, MBGD only needs to load one mini-batch at a time, making it memory-efficient.

3. Better Generalization: Mini-batches introduce a level of randomness into the optimization process, which can help the model generalize better and avoid getting stuck in local minima.
4. It is much faster than traditional gradient descent for large datasets.
5. is relatively easy to implement.
6. It can be used to train a wide variety of machine learning models.

## Disadvantage

1. Learning Rate Tuning: Choosing an appropriate learning rate for MBGD can be trickier than for BGD. It may require some trial and error to find the right learning rate.

2. Less Stable Convergence: While MBGD is faster, it can have more oscillations in the optimization path compared to BGD due to the randomness introduced by mini-batches.

3. It can be less accurate than traditional gradient descent.
4. It can be more susceptible to getting stuck in local minima.
5. It can require more tuning of hyperparameters.

## Common Applications:

1. Deep Learning: MBGD is widely used for training deep neural networks, where large datasets are common, and computational efficiency is crucial.

2. Natural Language Processing: It's employed in tasks like text classification and language modeling where datasets can be large.

3. Computer Vision: MBGD is applied to tasks such as image classification and object detection.
4. Training neural networks
5. Linear regression
6. Logistic regression
7. Support vector machines
8. Natural language processing
9. Image classification
10. Speech recognition

In essence, Mini-Batch Gradient Descent combines the advantages of both Batch and Stochastic Gradient Descent. It's a versatile optimization method used in various machine learning applications, offering faster convergence and efficient memory usage while maintaining good generalization performance.

In [2]:
from sklearn.datasets import load_diabetes

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [3]:
X,y = load_diabetes(return_X_y=True)

In [7]:
print(X.shape)
print(y.shape)

(442, 10)
(442,)


In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=2)

In [9]:
reg = LinearRegression()

In [10]:
reg.fit(X_train, y_train)

In [11]:
print(reg.coef_)
print(reg.intercept_)

[  -9.15865318 -205.45432163  516.69374454  340.61999905 -895.5520019
  561.22067904  153.89310954  126.73139688  861.12700152   52.42112238]
151.88331005254167


In [12]:
y_pred = reg.predict(X_test)
r2_score(y_test, y_pred)

0.4399338661568968

# Making my own class

In [17]:
import random

class MBGDRegressor:
    
    def __init__(self,batch_size,learning_rate=0.01,epochs=100):
        
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size
        
    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            
            for j in range(int(X_train.shape[0]/self.batch_size)):
                
                idx = random.sample(range(X_train.shape[0]),self.batch_size)
                
                y_hat = np.dot(X_train[idx],self.coef_) + self.intercept_
                #print("Shape of y_hat",y_hat.shape)
                intercept_der = -2 * np.mean(y_train[idx] - y_hat)
                self.intercept_ = self.intercept_ - (self.lr * intercept_der)

                coef_der = -2 * np.dot((y_train[idx] - y_hat),X_train[idx])
                self.coef_ = self.coef_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coef_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

In [18]:
mbr = MBGDRegressor(batch_size=int(X_train.shape[0]/50),learning_rate=0.01,epochs=100)

In [19]:
mbr.fit(X_train,y_train)

149.05038634411366 [  35.90625821 -135.96077789  442.42252856  294.45932664  -21.73197737
  -88.18605825 -185.84746272  106.59877843  397.96723993  115.92261399]


In [20]:
y_pred = mbr.predict(X_test)

In [21]:
r2_score(y_test,y_pred)

0.45088457983308416

# Using sklearn
Batch size in sgdregressor

In [22]:
from sklearn.linear_model import SGDRegressor

In [23]:
sgd = SGDRegressor(learning_rate='constant',eta0=0.1)

In [24]:
batch_size = 35

for i in range(100):
    
    idx = random.sample(range(X_train.shape[0]),batch_size)
    sgd.partial_fit(X_train[idx],y_train[idx])

In [25]:
sgd.coef_

array([  58.9614017 ,  -70.02031461,  346.88463031,  250.06957973,
          1.13474738,  -40.4056732 , -185.27398386,  123.86660869,
        310.15432892,  120.53006075])

In [26]:
sgd.intercept_

array([161.80410321])

In [27]:
y_pred = sgd.predict(X_test)

In [28]:
r2_score(y_test,y_pred)

0.4201702669239188