# Chapter 41: Batch Prediction Systems

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the architecture of a batch prediction system and when to use it
- Design and implement a robust batch processing pipeline for daily NEPSE predictions
- Schedule batch jobs using cron, Apache Airflow, and cloud schedulers
- Build scalable data preparation pipelines that handle multiple stocks efficiently
- Compute features at scale using pandas, Dask, or Spark
- Implement batch inference for hundreds of models (one per stock) efficiently
- Store prediction results in databases, data lakes, or files for downstream consumption
- Set up notification systems (email, Slack) to alert on job completion or failures
- Monitor batch jobs with logging and metrics
- Handle errors gracefully with retries and fallback mechanisms
- Optimize performance for large‑scale batch processing
- Integrate batch predictions into trading or reporting workflows

---

## **41.1 Introduction to Batch Prediction Systems**

Batch prediction systems generate forecasts on a regular schedule (e.g., daily, hourly) for a large set of inputs (e.g., all NEPSE stocks). Unlike real‑time prediction services (Chapter 40), which respond to individual requests, batch systems process data in bulk, store the results, and make them available for later querying.

For the NEPSE prediction system, a batch approach is natural: after the market closes each day, we can fetch the day's data, compute features, run predictions for all stocks, and store the next‑day return forecasts in a database. Traders can then query these predictions the next morning.

**Advantages of batch prediction:**
- Efficient for large numbers of predictions (e.g., hundreds of stocks).
- Easier to manage and monitor (scheduled jobs).
- Can leverage big data tools (Spark, Dask) for scalability.
- Results are persistent and auditable.

**Disadvantages:**
- Not real‑time; predictions are only as fresh as the last batch.
- Requires infrastructure to schedule and run jobs reliably.

---

## **41.2 Batch Processing Architecture**

A typical batch prediction pipeline consists of the following stages:

1. **Data Ingestion:** Fetch raw data from sources (CSV files, databases, APIs).
2. **Data Validation:** Check data quality and completeness.
3. **Feature Engineering:** Compute features for each stock using historical data.
4. **Model Loading:** Load pre‑trained models (one per stock or a single global model).
5. **Inference:** Generate predictions for the target period.
6. **Result Storage:** Write predictions to a database, data lake, or file.
7. **Notification:** Alert on success/failure.
8. **Monitoring:** Track job duration, data volumes, and prediction quality.

These stages are often implemented as a **DAG** (Directed Acyclic Graph) using workflow orchestrators like Apache Airflow.

---

## **41.3 Data Preparation Pipelines**

Data preparation for batch prediction must handle multiple stocks and ensure that features are computed correctly without look‑ahead bias.

### **41.3.1 Loading Raw Data**

Assume we receive a daily CSV file with all stocks' OHLCV data. We'll load it into a pandas DataFrame.

```python
import pandas as pd
from datetime import datetime

def load_raw_data(date):
    """
    Load raw NEPSE data for a given date.
    For simplicity, we assume files are named like 'nepse_YYYYMMDD.csv'
    """
    filename = f"data/raw/nepse_{date.strftime('%Y%m%d')}.csv"
    df = pd.read_csv(filename)
    df['Date'] = pd.to_datetime(df['Date'])
    return df
```

### **41.3.2 Feature Computation at Scale**

If we have many stocks and need to compute rolling features (e.g., 20‑day moving average), we must do this per stock. Using pandas `groupby` is efficient for moderate data (e.g., a few thousand stocks × few years of history). For larger datasets, we might use Dask or Spark.

```python
def compute_features(df):
    """
    Compute features for all stocks.
    Assumes df is sorted by Date within each Symbol.
    """
    # Sort by Symbol and Date
    df = df.sort_values(['Symbol', 'Date'])
    
    # Compute returns
    df['Return'] = df.groupby('Symbol')['Close'].pct_change() * 100
    
    # Lag features
    for lag in [1, 2, 3, 5]:
        df[f'Return_Lag{lag}'] = df.groupby('Symbol')['Return'].shift(lag)
    
    # Rolling statistics (20-day)
    df['MA_20'] = df.groupby('Symbol')['Close'].transform(lambda x: x.rolling(20, min_periods=1).mean())
    df['Volatility_20'] = df.groupby('Symbol')['Return'].transform(lambda x: x.rolling(20, min_periods=1).std())
    
    # RSI (simplified)
    def rsi(series, period=14):
        delta = series.diff()
        gain = delta.where(delta > 0, 0)
        loss = -delta.where(delta < 0, 0)
        avg_gain = gain.rolling(period).mean()
        avg_loss = loss.rolling(period).mean()
        rs = avg_gain / avg_loss
        rsi = 100 - (100 / (1 + rs))
        return rsi
    df['RSI'] = df.groupby('Symbol')['Close'].transform(lambda x: rsi(x))
    
    # Drop rows with NaN (first few rows of each stock)
    df = df.dropna()
    return df
```

**Explanation:**  
We use `groupby` and `transform` to apply rolling functions per stock. The `min_periods=1` ensures we get a value even at the beginning, but we drop NaN later to avoid using incomplete windows. In a production batch job, we would compute features incrementally (e.g., only for the new day) rather than recomputing everything.

### **41.3.3 Incremental Feature Computation**

For daily updates, we can store intermediate states (e.g., last 20 days of returns) and update rolling features incrementally. This avoids reprocessing all historical data each day.

```python
def update_features(existing_features, new_data):
    """
    Update feature set with new day's data.
    This is a simplified example; real implementation would use a feature store.
    """
    # Combine existing and new data
    combined = pd.concat([existing_features, new_data]).sort_values(['Symbol', 'Date'])
    # Recompute rolling features for the affected stocks (could be optimized)
    # For simplicity, we recompute all
    return compute_features(combined)
```

---

## **41.4 Scheduling Systems**

Batch jobs need to run at specific times. We'll explore several scheduling options.

### **41.4.1 Cron (Simple)**

On a Unix server, you can use cron to run a Python script daily.

```bash
# Edit crontab: crontab -e
# Run at 6 PM every day
0 18 * * * cd /path/to/project && python run_batch_prediction.py >> logs/batch.log 2>&1
```

**Pros:** Simple, no extra dependencies.  
**Cons:** No monitoring, no retries, no dependency management.

### **41.4.2 Apache Airflow (Enterprise)**

Airflow is a workflow orchestrator that allows you to define DAGs in Python, with built‑in monitoring, retries, and alerting.

```python
# dags/batch_prediction_dag.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
from batch_prediction import run_batch_pipeline

default_args = {
    'owner': 'data_science',
    'depends_on_past': False,
    'start_date': datetime(2025, 1, 1),
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'nepse_batch_prediction',
    default_args=default_args,
    description='Daily NEPSE prediction batch job',
    schedule_interval='0 18 * * *',  # daily at 6 PM
    catchup=False,
)

run_pipeline = PythonOperator(
    task_id='run_prediction_pipeline',
    python_callable=run_batch_pipeline,
    dag=dag,
)
```

**Pros:** Robust scheduling, retries, monitoring, UI, dependency management.  
**Cons:** Requires setup (database, web server).

### **41.4.3 Cloud Schedulers**

- **AWS CloudWatch Events / EventBridge** can trigger a Lambda function.
- **Google Cloud Scheduler** can trigger a Cloud Function or a job on Compute Engine.
- **Azure Logic Apps / Scheduler** similar.

These are serverless options, good for lightweight jobs.

---

## **41.5 Feature Computation at Scale**

If you have many stocks and long history, pandas may become slow or memory‑intensive. Consider:

- **Dask:** Parallelizes pandas operations across cores or clusters.
- **PySpark:** For very large datasets (billions of rows).
- **Feature Store:** A centralized system that stores pre‑computed features (see Chapter 63).

### **41.5.1 Using Dask for Parallel Feature Engineering**

```python
import dask.dataframe as dd

# Read data with Dask (lazy)
ddf = dd.read_csv('data/raw/nepse_*.csv', parse_dates=['Date'])

# Groupby and compute rolling features (Dask supports rolling but with limitations)
# For complex rolling, you might need to repartition and map partitions.
def compute_features_on_partition(partition):
    # pandas function applied to each partition
    return compute_features(partition)

# Apply per partition
ddf = ddf.map_partitions(compute_features_on_partition)

# Compute result (triggers execution)
df_result = ddf.compute()
```

**Explanation:**  
Dask splits data into partitions and processes them in parallel. For rolling operations that require cross‑partition data, you may need to use more advanced techniques (e.g., `rolling` with `groupby` is supported in recent Dask versions).

---

## **41.6 Batch Inference**

After features are ready, we need to generate predictions. If we have a single model that works for all stocks, we simply call `model.predict(X)`. If we have one model per stock, we need to load each model and predict for that stock's data.

### **41.6.1 Single Model for All Stocks**

```python
def predict_all_stocks(features_df, model, feature_cols):
    X = features_df[feature_cols]
    predictions = model.predict(X)
    features_df['Prediction'] = predictions
    return features_df[['Symbol', 'Date', 'Prediction']]
```

### **41.6.2 One Model per Stock**

```python
import joblib
import os

def predict_per_stock(features_df, model_dir, feature_cols):
    results = []
    for symbol in features_df['Symbol'].unique():
        model_path = os.path.join(model_dir, f"{symbol}", "model.joblib")
        if not os.path.exists(model_path):
            logger.warning(f"Model for {symbol} not found, skipping.")
            continue
        model = joblib.load(model_path)
        symbol_data = features_df[features_df['Symbol'] == symbol]
        X = symbol_data[feature_cols]
        preds = model.predict(X)
        symbol_data = symbol_data.copy()
        symbol_data['Prediction'] = preds
        results.append(symbol_data[['Symbol', 'Date', 'Prediction']])
    return pd.concat(results)
```

**Performance consideration:** Loading hundreds of models one by one can be slow. Consider:
- Caching models in memory (e.g., using a dictionary) if the batch job runs repeatedly.
- Using a model serving layer (e.g., MLflow) to load models on demand.
- Parallelizing the prediction loop (see below).

### **41.6.3 Parallelizing Predictions**

You can use `concurrent.futures` to predict for multiple stocks in parallel.

```python
from concurrent.futures import ThreadPoolExecutor, as_completed

def predict_stock(symbol, data, model_dir, feature_cols):
    model_path = os.path.join(model_dir, f"{symbol}", "model.joblib")
    if not os.path.exists(model_path):
        return None
    model = joblib.load(model_path)
    X = data[feature_cols]
    preds = model.predict(X)
    data = data.copy()
    data['Prediction'] = preds
    return data[['Symbol', 'Date', 'Prediction']]

def predict_parallel(features_df, model_dir, feature_cols, max_workers=4):
    results = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = []
        for symbol, group in features_df.groupby('Symbol'):
            futures.append(executor.submit(predict_stock, symbol, group, model_dir, feature_cols))
        for future in as_completed(futures):
            result = future.result()
            if result is not None:
                results.append(result)
    return pd.concat(results)
```

**Explanation:**  
This uses a thread pool to load models and predict concurrently. Since loading models and predicting are I/O‑bound (disk read) and CPU‑bound (prediction), threads are appropriate. For CPU‑intensive prediction, consider `ProcessPoolExecutor`.

---

## **41.7 Result Storage**

Predictions must be stored for downstream use. Options:

- **CSV files:** Simple, but not queryable.
- **Database (PostgreSQL, MySQL):** Structured, queryable.
- **Data warehouse (Redshift, BigQuery):** For large‑scale analytics.
- **Feature store:** For low‑latency access.

### **41.7.1 Storing in PostgreSQL**

```python
import psycopg2
from sqlalchemy import create_engine

def store_predictions(df):
    engine = create_engine('postgresql://user:pass@localhost/nepse')
    df.to_sql('predictions', engine, if_exists='append', index=False)
```

### **41.7.2 Storing in Parquet Files (Data Lake)**

```python
def store_predictions_parquet(df, date):
    filename = f"data/predictions/{date.strftime('%Y%m%d')}_predictions.parquet"
    df.to_parquet(filename, index=False)
```

This is suitable for later analysis with Spark or Dask.

### **41.7.3 Storing in a Feature Store**

For real‑time access, you might push predictions to a Redis cache or a feature store (see Chapter 63).

---

## **41.8 Notification Systems**

After the batch job completes (or fails), notify stakeholders.

### **41.8.1 Email Notifications**

```python
import smtplib
from email.mime.text import MIMEText

def send_email(subject, body, to_emails):
    msg = MIMEText(body)
    msg['Subject'] = subject
    msg['From'] = 'batch@nepse-predictor.com'
    msg['To'] = ', '.join(to_emails)
    with smtplib.SMTP('smtp.gmail.com', 587) as server:
        server.starttls()
        server.login('user', 'password')
        server.send_message(msg)
```

### **41.8.2 Slack Notifications**

```python
import requests

def send_slack_message(message, webhook_url):
    payload = {'text': message}
    requests.post(webhook_url, json=payload)
```

### **41.8.3 Integration with Airflow**

Airflow automatically sends emails on failure if configured. You can also add Slack operators.

---

## **41.9 Monitoring and Alerting**

Monitor the health of batch jobs:

- **Job duration:** Alert if it takes too long.
- **Data volume:** Alert if number of stocks or rows is abnormal.
- **Model performance:** Compare predictions to actuals when they arrive (next day) and alert if error exceeds threshold.
- **Infrastructure metrics:** CPU, memory, disk usage.

Use logging and metrics (e.g., Prometheus, CloudWatch) to track these.

### **41.9.1 Logging**

```python
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

def run_pipeline():
    logger.info("Starting batch prediction pipeline")
    try:
        # ... steps
        logger.info("Pipeline completed successfully")
    except Exception as e:
        logger.error(f"Pipeline failed: {e}", exc_info=True)
        raise
```

### **41.9.2 Metrics**

You can log custom metrics to a time‑series database.

```python
from prometheus_client import Counter, Gauge, push_to_gateway

job_duration = Gauge('batch_job_duration_seconds', 'Duration of batch job')
job_success = Counter('batch_job_success_total', 'Number of successful runs')

start = time.time()
# ... run job
duration = time.time() - start
job_duration.set(duration)
job_success.inc()
push_to_gateway('localhost:9091', job='batch_prediction', registry=...)
```

---

## **41.10 Error Handling**

Batch jobs should handle errors gracefully.

### **41.10.1 Retries**

If a step fails (e.g., data download), retry a few times with exponential backoff.

```python
import time
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def download_data(date):
    # may raise exception
    pass
```

### **41.10.2 Fallback Models**

If the main model for a stock is missing or fails, fall back to a simpler model (e.g., historical mean).

```python
def predict_with_fallback(symbol, data):
    try:
        model = joblib.load(f"models/{symbol}/model.joblib")
        return model.predict(data)
    except FileNotFoundError:
        # fallback: predict mean of last 20 returns
        return data['Return'].tail(20).mean()
```

### **41.10.3 Partial Success**

If some stocks fail, still record successes and log failures. Continue the pipeline.

```python
results = []
failed_stocks = []
for symbol in symbols:
    try:
        pred = predict_stock(symbol, data)
        results.append(pred)
    except Exception as e:
        failed_stocks.append(symbol)
        logger.error(f"Failed for {symbol}: {e}")

if failed_stocks:
    send_alert(f"Failed for stocks: {failed_stocks}")
```

---

## **41.11 Performance Optimization**

- **Use vectorized operations** (pandas) instead of loops.
- **Parallelize** where possible (per stock or per partition).
- **Cache intermediate results** (e.g., pre‑computed features) to avoid recomputation.
- **Use efficient file formats** like Parquet (columnar, compressed) instead of CSV.
- **Consider incremental processing** for daily updates to avoid full history recomputation.

### **41.11.1 Incremental Processing Example**

```python
def incremental_update(last_date, new_date):
    # Load only new data since last_date
    new_data = load_data_since(last_date)
    # Load last known features (e.g., from a feature store)
    last_features = load_last_features()
    # Update features incrementally
    updated_features = update_features(last_features, new_data)
    # Predict only on new_data's target dates
    predictions = predict(updated_features[updated_features['Date'] == new_date])
    return predictions
```

---

## **41.12 Integration with Downstream Systems**

Predictions are only useful if they reach traders or automated systems.

- **Database:** Traders query via dashboards.
- **Message queue:** Push predictions to Kafka for real‑time consumption.
- **File share:** Deliver CSV to a shared folder.

Choose based on your architecture.

---

## **41.13 Chapter Summary**

In this chapter, we designed a robust batch prediction system for the NEPSE dataset.

- **Architecture:** Data ingestion → feature engineering → inference → storage → notification.
- **Scheduling:** Cron for simple jobs, Airflow for enterprise workflows.
- **Feature computation:** Handled per stock using pandas; scaled with Dask if needed.
- **Inference:** Single or per‑stock models; parallelized for efficiency.
- **Result storage:** Database, Parquet, or feature store.
- **Notifications:** Email, Slack on job completion/failure.
- **Monitoring:** Logging, metrics, alerts.
- **Error handling:** Retries, fallbacks, partial success.
- **Performance:** Incremental updates, parallelization, efficient formats.

### **Practical Takeaways for the NEPSE System:**

- Use Airflow to schedule daily predictions at market close.
- Compute features incrementally to save time.
- Store predictions in a PostgreSQL database for easy querying.
- Set up Slack alerts for job failures.
- Monitor prediction accuracy by comparing with actual returns the next day.

In the next chapter, **Chapter 42: Real‑Time Prediction Systems**, we will explore how to build low‑latency streaming prediction services using technologies like Kafka and Flink.

---

**End of Chapter 41**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='40. building_prediction_services.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='42. real_time_prediction_systems.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
