# üìâ Stochastic Gradient Descent (SGD)  
**Topic:** Why Full Batch Gradient Descent becomes slow & how SGD helps

---

## 1Ô∏è‚É£ Batch Gradient Descent (BGD) Basics

- **Batch Gradient Descent (BGD)** uses **all training examples** to compute gradients.
- Parameter update happens **once per epoch**.

### üîÅ Steps per Epoch
1. Predict output for **all rows**
2. Compute loss for **all rows**
3. Compute gradients using **entire dataset**
4. Update parameters **one time**

---

## 2Ô∏è‚É£ Small Dataset Example

### üîπ Assumptions

- Number of rows:
$$
n = 1000
$$
- Number of features:
$$
d = 6
$$
- Number of epochs:
$$
E = 50
$$

### üîπ Computation

- Gradients computed on all rows per epoch
- Total derivative calculations:

$$
\text{Total derivatives} \approx E \times n = 50 \times 1000 = 50{,}000
$$

### üîπ Interpretation

- $50{,}000$ operations are **manageable**
- Training is **fast enough**
- BGD works well for **small datasets**

---

## 3Ô∏è‚É£ Large Dataset Example

### üîπ Assumptions

- Number of rows:
$$
n = 10^5
$$
- Number of features:
$$
d = 10^2
$$
- Number of epochs:
$$
E = 10^3
$$

### üîπ Computation

- Each epoch processes all $10^5$ rows
- Total derivative computations:

$$
\text{Total derivatives} \approx 10^5 \times 10^3 = 10^8
$$

- With feature-wise operations and overhead:

$$
\approx 10^{10} \text{ operations}
$$

### üîπ Interpretation

- Entire dataset is processed **every epoch**
- Computation becomes **very heavy**
- Training becomes **very slow / impractical**

---

## 4Ô∏è‚É£ Visual Comparison

| Aspect                | Small Dataset                  | Large Dataset                         |
|-----------------------|--------------------------------|--------------------------------------|
| Rows                  | $1000$                         | $10^5$                               |
| Features              | $6$                            | $10^2$                               |
| Epochs                | $50$                           | $10^3$                               |
| Gradient calculations | $\approx 50{,}000$             | $\approx 10^8$ to $10^{10}$          |
| Speed (BGD)           | Fast                           | Very Slow                            |

---

## 5Ô∏è‚É£ Stochastic Gradient Descent (SGD)

### üîπ Meaning of *Stochastic*
- **Stochastic** = based on **random probability**
- Data is selected in a **random order**

---

## 6Ô∏è‚É£ Key Idea of SGD (from handwritten notes)

### ‚úÖ How SGD Works

- Instead of using **all rows**, SGD:
  - Picks **one random row**
  - Or a **small mini-batch**
- Updates parameters **immediately**

$$
\text{Update frequency} = \text{per row}
$$

---

### üß† Flow of SGD

- Random row selected  
- Gradient calculated  
- **Row-wise update**  
- Continues for all rows  
- One full pass = **1 epoch**

---

## 7Ô∏è‚É£ Important Points (Handwritten Notes Explained)

- ‚ö° **Faster** than Batch GD
- üé≤ Uses **random probability distribution**
- üîÑ Data order is **random**
- üìâ Loss does **not decrease smoothly**
- ‚ùå Does **not give steady / smooth answer**
- ‚úÖ But reaches **near-optimal solution faster**

---

## 8Ô∏è‚É£ Batch vs Stochastic Gradient Descent

| Feature | Batch Gradient Descent | Stochastic Gradient Descent |
|-------|------------------------|-----------------------------|
| Data used | All rows | Single random row |
| Update | Once per epoch | After every row |
| Speed | Slow for large data | Very fast |
| Stability | Smooth convergence | Noisy convergence |
| Practical use | Small datasets | Large datasets |

---

## 9Ô∏è‚É£ Final Insight

> **Batch Gradient Descent**  
> ‚úî Accurate  
> ‚ùå Very slow on large data  

> **Stochastic Gradient Descent**  
> ‚úî Fast and scalable  
> ‚ùå Noisy but practical  

---

### üìå One-Line Summary

$$
\text{SGD trades stability for speed ‚Äî making it ideal for large datasets}
$$

In [105]:
from sklearn.datasets import load_diabetes
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

x,y = load_diabetes(return_X_y=True)

import time

In [94]:
print(x.shape)
print(y.shape)

(442, 10)
(442,)


In [95]:
x_train,x_test,y_train,y_test = train_test_split(x,y,train_size=0.8,random_state=3)

In [96]:
reg = LinearRegression()
reg.fit(x_train,y_train)

In [97]:
reg.coef_

array([  -1.13744712, -212.8867836 ,  540.45536994,  345.20621542,
       -938.23814645,  516.62060367,  172.85885498,  267.87535242,
        732.63230159,   70.07849485])

In [98]:
reg.intercept_

np.float64(153.13441535285003)

In [99]:
y_pred = reg.predict(x_test)
r2_score(y_test,y_pred)

0.4161792211496941

In [100]:
class SGDRegression:

    def __init__(self,learning_rate = 0.01,epochs = 100):
        self.coef__ = None
        self.intercept__ = None
        self.lr = learning_rate
        self.epochs = epochs

    def fit(self,x_train,y_train):
        self.intercept__ =  0
        self.coef__ = np.ones(x_train.shape[1])
        
        for i in range(self.epochs):
            for j in range(x_train.shape[0]):
                index = np.random.randint(0,x_train.shape[0])
                y_hat = np.dot(x_train[index],self.coef__) + self.intercept__

                intercept_der = -2 * (y_train[index] - y_hat)
                self.intercept__ = self.intercept__ - (self.lr * intercept_der)

                coef_der  = -2 * np.dot((y_train[index] - y_hat),x_train[index])
                self.coef__ = self.coef__ - (self.lr * coef_der)

        print(self.intercept__,self.coef__)

    def predcit(self,x_test):
        return np.dot(x_test,self.coef__) + self.intercept__

In [101]:
x_train.shape

(353, 10)

In [111]:
sgd = SGDRegression(learning_rate=0.01,epochs=40)

start = time.time()
sgd.fit(x_train,y_train)

print('Time taken is ',time.time()- start)

141.95570577090814 [  43.57365245  -42.76713967  325.23843933  227.37899541    5.67490445
  -18.71288369 -187.23162281  153.33445179  267.94578639  155.66674071]
Time taken is  0.43929362297058105


In [103]:
y_pred = sgd.predcit(x_test)

In [104]:
r2_score(y_test,y_pred)

0.35715706403605363

## Time Comparison

- if no of epochs is fixed  e = 100
- which one is faster --> - batch(100) or -  Stochasstic(100*n)

# üìâ When to Use Stochastic Gradient Descent (SGD)

---

## ‚úÖ Use **Stochastic Gradient Descent** When

---

## 1Ô∏è‚É£ Dataset is Very Large

- Number of samples is very high:
  $$
  n \ge 10^5
  $$
- Batch Gradient Descent becomes computationally expensive
- SGD updates parameters using **one data point at a time**

üëâ **Best suited for large-scale datasets**

---

## 2Ô∏è‚É£ Limited Memory Availability

- Batch GD requires loading the **entire dataset** into memory
- SGD processes **one sample or a small mini-batch**

üëâ Useful for **memory-constrained systems**

---

## 3Ô∏è‚É£ Faster Training is Required

- Parameter updates occur **after every sample**
- Model starts learning immediately
- Provides faster intermediate results

üëâ Ideal for **rapid experimentation**

---

## 4Ô∏è‚É£ Online / Streaming Data

- Data arrives **continuously over time**
- Entire dataset is not available at once

üëâ SGD supports **online learning**

---

## 5Ô∏è‚É£ Complex Loss Surfaces

- Noisy gradient updates help:
  - Escape **local minima**
  - Move away from **saddle points**

üëâ Commonly used in **deep learning**

---

## 6Ô∏è‚É£ Better Generalization is Desired

- Noise in SGD acts as regularization
- Often leads to better **test performance**

---

## ‚ùå When NOT to Use SGD

| Scenario | Reason |
|--------|--------|
| Small datasets | Gradient noise dominates |
| Need exact convergence | SGD oscillates near minimum |
| Deterministic results required | Random updates vary results |

---

## ‚öñÔ∏è Batch GD vs SGD

| Property | Batch Gradient Descent | Stochastic Gradient Descent |
|--------|-----------------------|-----------------------------|
| Data used per update | All samples | Single sample |
| Training speed | Slow | Fast |
| Memory usage | High | Low |
| Stability | Smooth convergence | Noisy updates |
| Large datasets | ‚ùå | ‚úÖ |

---

## üß† One-Line Rule (Exam Tip)

> **Use Stochastic Gradient Descent when the dataset is large, memory is limited, or fast and online learning is required.**


In [112]:
from sklearn.linear_model import SGDRegressor
sgd = SGDRegressor(max_iter=100,learning_rate='constant',eta0=0.01)
sgd.fit(x_train,y_train)
y_pred = sgd.predict(x_test)
r2_score(y_test,y_pred)



0.3857240729894066