## 9. Batch vs Online Learning Comprehensive Comparison

### Head-to-Head Comparison

| Aspect | Batch Learning | Online Learning |
|--------|---|---|
| **Data Processing** | Entire dataset at once | One sample or mini-batch at a time |
| **Training Phases** | Single large training phase | Continuous incremental updates |
| **Memory Usage** | High (must load all data) | Low (only batch in memory) |
| **Convergence Speed** | Slower overall | Faster per-sample |
| **Adaptation to New Data** | Requires complete retraining | Immediate update |
| **Real-time Capability** | Not suitable | Excellent |
| **Computational Cost** | Very high initially | Spread over time |
| **Model Stability** | Very stable | More sensitive to noise |
| **Data Redundancy** | Handles poorly | Handles well |
| **Implementation** | Simpler | More complex |
| **Suitable Scale** | 1GB - 100GB datasets | 100GB+ or streaming |
| **Tools** | Scikit-learn, Spark, TensorFlow | River, SKlearn (partial_fit) |
| **Latency** | High gap between update | Low, real-time |
| **Hyperparameter Tuning** | Easier to tune | Harder to tune |

### Computational Complexity Comparison


In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Batch Learning Complexity
dataset_sizes = np.array([1000, 10000, 100000, 1000000])

# Batch: O(n) where n = dataset size
batch_time = dataset_sizes * 0.001  # Linear complexity

# Online: O(1) per sample, so total = O(n) but with smaller constant
# and can spread computation
online_time_per_sample = 0.00001
online_time_total = dataset_sizes * online_time_per_sample

# Online with spread cost (can update model incrementally)
online_time_amortized = dataset_sizes * 0.0001

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(dataset_sizes, batch_time, marker='o', label='Batch', linewidth=2)
plt.plot(dataset_sizes, online_time_total, marker='s', label='Online (total)', linewidth=2)
plt.xlabel('Dataset Size (samples)')
plt.ylabel('Training Time (seconds)')
plt.title('Training Time: Batch vs Online')
plt.xscale('log')
plt.yscale('log')
plt.legend()
plt.grid(True, alpha=0.3)

# Memory usage comparison
plt.subplot(1, 2, 2)
batch_memory = dataset_sizes / 1000 * 8  # 8 bytes per number
online_memory = np.ones_like(dataset_sizes) * 0.008  # Fixed batch size

plt.plot(dataset_sizes, batch_memory, marker='o', label='Batch', linewidth=2)
plt.plot(dataset_sizes, online_memory, marker='s', label='Online', linewidth=2)
plt.xlabel('Dataset Size (samples)')
plt.ylabel('Peak Memory (GB)')
plt.title('Memory Usage: Batch vs Online')
plt.xscale('log')
plt.yscale('log')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()


### Production Systems Comparison

#### Batch Learning in Production
```
[Database] → [Batch Job Scheduled Daily]
              ↓ (1-2 hour process)
           [Model Trained]
              ↓
           [Save Model]
              ↓
           [Deploy to Production]
              ↓
       [Serve predictions all day]
           (with same model)
              ↓
           [Next day, repeat]
```

Example: Netflix Recommendation System


In [None]:
# Batch ML Pipeline
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'start_date': datetime(2024, 1, 1),
    'retries': 1
}

dag = DAG('daily_recommendation_model', default_args=default_args,
          schedule_interval='0 2 * * *')  # Run at 2 AM daily

# Task 1: Extract features from database
def extract_data():
    # Extract user-movie ratings, content features, etc.
    pass

# Task 2: Train model on all accumulated data
def train_model():
    # Train on millions of ratings
    model = GradientBoostingRegressor(n_estimators=100)
    model.fit(X_train, y_train)  # Takes 1-2 hours

# Task 3: Evaluate and validate
def validate_model():
    # A/B test, quality checks
    pass

# Task 4: Deploy to production
def deploy_model():
    # Update serving infrastructure
    pass

t1 = PythonOperator(task_id='extract_data', python_callable=extract_data, dag=dag)
t2 = PythonOperator(task_id='train_model', python_callable=train_model, dag=dag)
t3 = PythonOperator(task_id='validate', python_callable=validate_model, dag=dag)
t4 = PythonOperator(task_id='deploy', python_callable=deploy_model, dag=dag)

t1 >> t2 >> t3 >> t4


#### Online Learning in Production
```
[Streaming Data Source (Kafka)]
        ↓ (Continuous)
[Online ML Model]
     (Continuously Updates)
        ↓
[Real-time Predictions]
```

Example: Fraud Detection System


In [None]:
# Online ML Pipeline
from river import linear_model, preprocessing, metrics
from kafka import KafkaConsumer
import json

# Initialize online model
model = preprocessing.StandardScaler() | linear_model.LogisticRegression()
metric = metrics.Accuracy()

# Connect to transaction stream
consumer = KafkaConsumer(
    'transactions',
    bootstrap_servers=['localhost:9092'],
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

print("Listening to transaction stream...")

for message in consumer:
    transaction = message.value
    
    # Extract features
    features = {
        'amount': transaction['amount'],
        'merchant_category': transaction['category'],
        'time_of_day': transaction['hour'],
        'user_history': transaction['user_avg_spend']
    }
    
    # Predict: Is this fraud?
    fraud_prob = model.predict_proba_one(features)
    
    if fraud_prob[True] > 0.8:  # High fraud probability
        print(f"⚠️  FRAUD ALERT: {transaction['id']}")
        # Block transaction
    
    # Learn from ground truth (when available)
    actual_fraud = transaction.get('confirmed_fraud', None)
    if actual_fraud is not None:
        model.learn_one(features, actual_fraud)
        metric.update(actual_fraud, fraud_prob[True] > 0.5)
    
    # Log metric every 1000 transactions
    if metric.n % 1000 == 0:
        print(f"Model accuracy: {metric.get():.4f} ({metric.n} samples)")


### Application Suitability


In [None]:
# Batch Learning Applications
batch_applications = {
    'recommendation_systems': 'Netflix, Amazon - retrain daily',
    'fraud_detection': 'Batch daily training on past days data',
    'price_prediction': 'Real estate, stock prices - weekly retraining',
    'demand_forecasting': 'Retail, airlines - daily/weekly models',
    'medical_imaging': 'CT scans, X-rays - trained on available cases'
}

# Online Learning Applications
online_applications = {
    'real_time_fraud_detection': 'Credit card - detect in milliseconds',
    'streaming_analytics': 'Social media trends - continuous learning',
    'adaptive_systems': 'Spam filters - learn new patterns immediately',
    'autonomous_vehicles': 'Self-driving cars - adapt to road conditions',
    'IoT_monitoring': 'Sensor data - continuous anomaly detection',
    'recommendation': 'YouTube - learn user preference per click'
}

print("When to use Batch Learning:")
for app, desc in batch_applications.items():
    print(f"  • {app}: {desc}")

print("\nWhen to use Online Learning:")
for app, desc in online_applications.items():
    print(f"  • {app}: {desc}")


### Tools and Frameworks


In [None]:
# Batch Learning Tools
batch_tools = """
1. Scikit-learn
   - Mature, stable, comprehensive algorithms
   - Good for datasets that fit in memory
   
2. Apache Spark MLlib
   - Distributed batch learning on clusters
   - Handles terabyte-scale datasets
   - For PySpark, use spark.mllib or spark.ml
   
3. TensorFlow/PyTorch
   - Deep learning frameworks
   - Can train on batches (not true online)
   - Requires significant computational resources
   
4. XGBoost, LightGBM
   - Gradient boosting frameworks
   - Fast batch training
   - Excellent for tabular data
   
5. H2O
   - Distributed ML platform
   - Auto ML capabilities
   - Can handle large datasets
"""

# Online Learning Tools
online_tools = """
1. River
   - Modern Python library for online ML
   - Supports classification, regression, clustering
   - Stream learning and drift detection
   - pip install river
   
2. Scikit-learn partial_fit
   - Most sklearn models support partial_fit
   - Can process mini-batches
   - But not truly online (batch-like)
   
3. Kafka + Spark Streaming
   - Real-time data pipelines
   - Online updates with distributed computing
   - Complex but powerful
   
4. Apache Flink
   - Stream processing framework
   - Real-time ML model updates
   - Complex distributed system
   
5. Custom implementations
   - Often needed for specific requirements
   - Use SGDClassifier, SGDRegressor
   - Implement online learning from scratch
"""

print(batch_tools)
print(online_tools)


---
