### Mini Batch Gradient Descent 

## Mini-Batch: Batch Size vs Number of Updates

Let:

- $n$ = total number of samples  
- $B$ = batch size (number of samples per batch)

Then:

$$
\text{Number of updates per epoch} = \frac{n}{B}
$$

---

## Important Distinction

- **Batch size (B)** → number of rows in one batch  
- **Number of batches** → how many groups the dataset is divided into  

$$
\text{Number of batches} = \frac{n}{B}
$$

Each batch produces one parameter update.

---

## Corrected Examples

### Case 1: 1000 rows and 100 batches

If there are 100 batches:

$$
\text{Batch size} = \frac{1000}{100} = 10
$$

So:
- Batch size = 10  
- Updates per epoch = 100  

---

### Case 2: 1000 rows and 10 batches

If there are 10 batches:

$$
\text{Batch size} = \frac{1000}{10} = 100
$$

So:
- Batch size = 100  
- Updates per epoch = 10  

---

## Summary Table

| Total Samples ($n$) | Batch Size ($B$) | Updates per Epoch |
|---------------------|------------------|-------------------|
| 1000 | 1000 | 1 (Batch GD) |
| 1000 | 1 | 1000 (SGD) |
| 1000 | 100 | 10 |
| 1000 | 10 | 100 |

---

## Intuition

- Larger batch size → Fewer updates per epoch → Smoother learning  
- Smaller batch size → More updates per epoch → Noisier learning  


# from scratch

In [1]:
from sklearn.datasets import load_diabetes

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [2]:
X,y = load_diabetes(return_X_y=True)

In [3]:
print(X.shape)
print(y.shape)

(442, 10)
(442,)


In [4]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [5]:
lr = LinearRegression()
lr.fit(X_train,y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [6]:
print(lr.coef_)
print(lr.intercept_)

[  37.90402135 -241.96436231  542.42875852  347.70384391 -931.48884588
  518.06227698  163.41998299  275.31790158  736.1988589    48.67065743]
151.34560453985995


In [7]:
y_pred = lr.predict(X_test)
r2_score(y_test,y_pred)

0.4526027629719197

In [8]:
X_train.shape

(353, 10)

In [12]:
import random


### 1. Initialization

The constructor initializes:

- `batch_size` → number of samples per mini-batch  
- `learning_rate (η)` → step size  
- `epochs` → number of full passes over dataset  

These are stored as class attributes.

---

### 2. Extract Dataset Dimensions

```python
n_samples, n_features = X_train.shape


### 3. Initialize Parameters

```python
self.intercept_ = 0
self.coef_ = np.zeros(n_features)
```

Mathematically:

w=0 , 
b=0

### 4. Loop Over Epochs
```pyhton
for epoch in range(self.epochs):```
One epoch means one complete pass over the entire dataset.


### 5. Shuffle the Dataset
```python
indices = np.random.permutation(n_samples)
X_shuffled = X_train[indices]
y_shuffled = y_train[indices]
```
##### Shuffling ensures:
- Randomness
- No order bias
- Better convergence behavior

### 6. Divide Data into Mini-Batches

```python
for start in range(0, n_samples, self.batch_size):```
 -This splits the dataset into chunks of size batch_size. 
 Number of updates per epoch:
n/B 

##### n = total samples 
##### B = batch size 


### 7. Extract the Current Batch
```python 
X_batch = X_shuffled[start:end]
y_batch = y_shuffled[start:end]```

- This selects the current mini-batch: XB​,yB​


### 8. Compute Predictions

```python 
y_hat = np.dot(X_batch, self.coef_) + self.intercept_
```


##### Mathematically:

$$
\hat{y}_i = \mathbf{w}^T \mathbf{x}_i + b
$$

Predictions are computed for all samples in the mini-batch.


### 9. Compute Errors

### 10. Compute Gradient for Intercept

```python 
intercept_der = (-2 / B) * np.sum(errors)

$$
\frac{\partial J_B}{\partial b}
=
-\frac{2}{B}
\sum_{i \in B}
\left( y_i - \hat{y}_i \right)
$$


### 11. Compute Gradient for Weights
```python 
coef_der = (-2 / B) * np.dot(X_batch.T, errors)


$$
\frac{\partial J_B}{\partial \mathbf{w}}
=
-\frac{2}{B}
\sum_{i \in B}
\mathbf{x}_i \left( y_i - \hat{y}_i \right)
$$

### 12. Update Parameters
```python
self.intercept_ -= self.lr * intercept_der
self.coef_ -= self.lr * coef_der


In [16]:
## MBGD class from scratch

In [17]:
class MBGDRegressor:    
    def __init__(self, batch_size, learning_rate=0.01, epochs=100):       
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size
        
    def fit(self, X_train, y_train):
        
        n_samples, n_features = X_train.shape
        
        self.intercept_ = 0
        self.coef_ = np.zeros(n_features)
        
        for epoch in range(self.epochs):
            
            # Shuffle once per epoch
            indices = np.random.permutation(n_samples)
            X_shuffled = X_train[indices]
            y_shuffled = y_train[indices]
            
            for start in range(0, n_samples, self.batch_size):
                
                end = start + self.batch_size
                
                X_batch = X_shuffled[start:end]
                y_batch = y_shuffled[start:end]
                
                y_hat = np.dot(X_batch, self.coef_) + self.intercept_
                
                errors = y_batch - y_hat
                B = len(X_batch)
                
                # gradients
                intercept_der = (-2 / B) * np.sum(errors)
                coef_der = (-2 / B) * np.dot(X_batch.T, errors)
                
                self.intercept_ -= self.lr * intercept_der
                self.coef_ -= self.lr * coef_der
        
        return self
    
    def predict(self, X_test):
        return np.dot(X_test, self.coef_) + self.intercept_

In [42]:
mbr = MBGDRegressor(batch_size=int(X_train.shape[0]/50),learning_rate=0.01,epochs=2000)

In [43]:
mbr.fit(X_train , y_train)

<__main__.MBGDRegressor at 0x216a981f8d0>

In [46]:
y_pred1 = mbr.predict(X_test)

In [47]:
r2_score(y_test , y_pred)

0.45708620170332803

In [48]:
y_pred = lr.predict(X_test)
r2_score(y_test,y_pred)

0.4526027629719197

### using sklearn

In [56]:
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes

# Load dataset
X, y = load_diabetes(return_X_y=True)

# Scale features (VERY IMPORTANT)
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create model
sgd = SGDRegressor(
    max_iter=1000,
    learning_rate="constant",
    eta0=0.01,
    random_state=42
)

# Fit
sgd.fit(X_train, y_train)

0,1,2
,loss,'squared_error'
,penalty,'l2'
,alpha,0.0001
,l1_ratio,0.15
,fit_intercept,True
,max_iter,1000
,tol,0.001
,shuffle,True
,verbose,0
,epsilon,0.1


In [59]:
sgd = SGDRegressor(
    learning_rate="constant",
    eta0=0.01,
    random_state=42
)

In [60]:
batch_size = 32
n_samples = X_train.shape[0]

In [61]:
for epoch in range(100):
    
    indices = np.random.permutation(n_samples)
    X_shuffled = X_train[indices]
    y_shuffled = y_train[indices]
    
    for start in range(0, n_samples, batch_size):
        
        end = start + batch_size
        
        X_batch = X_shuffled[start:end]
        y_batch = y_shuffled[start:end]
        
        sgd.partial_fit(X_batch, y_batch)

In [62]:
y_pred2 = sgd.predict(X_test)

In [63]:
r2_score(y_test , y_pred2)

0.4628373645026933