# **Chapter 9: Data Pipelines and Automation**

---

## **Learning Objectives**

By the end of this chapter, you will be able to:
- Design and implement batch processing pipelines for time-series data
- Build streaming pipelines for real-time data ingestion
- Orchestrate complex workflows using Apache Airflow
- Implement data quality gates and validation checks
- Monitor pipeline health and performance
- Handle errors gracefully with retry logic and recovery mechanisms
- Optimize pipelines for cost and scalability

---

## **Prerequisites**

- Completed Chapter 8: Data Storage and Management
- Understanding of Python programming and decorators
- Basic knowledge of Docker (helpful but not required)
- Familiarity with cron jobs or task scheduling concepts

---

## **9.1 Pipeline Architecture Patterns**

Data pipelines move data from source to destination while transforming it along the way. Understanding architectural patterns helps you build robust, maintainable systems.

```python
"""
Pipeline Architecture Patterns for Time-Series Data

This module demonstrates common architectural patterns used in building
data pipelines for the NEPSE stock prediction system.

Patterns covered:
1. ETL (Extract, Transform, Load)
2. ELT (Extract, Load, Transform)
3. Lambda Architecture (Batch + Speed layers)
4. Kappa Architecture (Streaming only)
5. Medallion Architecture (Bronze, Silver, Gold)
"""

from abc import ABC, abstractmethod
from typing import Dict, List, Any, Optional, Callable
from dataclasses import dataclass
from datetime import datetime, timedelta
from enum import Enum
import pandas as pd
import logging

# Configure logging for pipelines
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)


class PipelineStage(Enum):
    """Enumeration of pipeline stages."""
    EXTRACT = "extract"
    TRANSFORM = "transform"
    LOAD = "load"
    VALIDATE = "validate"


@dataclass
class PipelineContext:
    """
    Context object passed between pipeline stages.
    
    This dataclass maintains state throughout the pipeline execution,
    allowing stages to communicate and share metadata.
    """
    execution_id: str
    start_time: datetime
    data: Optional[Any] = None
    metadata: Dict[str, Any] = None
    metrics: Dict[str, float] = None
    
    def __post_init__(self):
        if self.metadata is None:
            self.metadata = {}
        if self.metrics is None:
            self.metrics = {}


class Pipeline(ABC):
    """
    Abstract base class for data pipelines.
    
    Defines the interface that all pipeline implementations must follow.
    Uses the Template Method pattern to define the execution flow while
    allowing subclasses to customize specific steps.
    """
    
    def __init__(self, name: str):
        self.name = name
        self.logger = logging.getLogger(f"{__name__}.{name}")
    
    @abstractmethod
    def extract(self, context: PipelineContext) -> PipelineContext:
        """Extract data from source."""
        pass
    
    @abstractmethod
    def transform(self, context: PipelineContext) -> PipelineContext:
        """Transform data."""
        pass
    
    @abstractmethod
    def load(self, context: PipelineContext) -> PipelineContext:
        """Load data to destination."""
        pass
    
    def validate(self, context: PipelineContext) -> PipelineContext:
        """
        Optional validation step.
        
        Default implementation checks for empty data.
        Subclasses can override for specific validation logic.
        """
        if context.data is None or (isinstance(context.data, pd.DataFrame) and context.data.empty):
            raise ValueError("Pipeline validation failed: No data to process")
        
        self.logger.info(f"Validation passed: {len(context.data)} records")
        return context
    
    def run(self, **kwargs) -> PipelineContext:
        """
        Execute the pipeline.
        
        This template method defines the execution order and handles
        error propagation and metrics collection.
        """
        execution_id = f"{self.name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        context = PipelineContext(
            execution_id=execution_id,
            start_time=datetime.now()
        )
        
        self.logger.info(f"Starting pipeline {self.name} [ID: {execution_id}]")
        
        try:
            # Execute stages in order
            context = self.extract(context)
            context = self.transform(context)
            context = self.validate(context)
            context = self.load(context)
            
            # Calculate duration
            duration = (datetime.now() - context.start_time).total_seconds()
            context.metrics['total_duration_seconds'] = duration
            
            self.logger.info(f"Pipeline completed successfully in {duration:.2f}s")
            return context
            
        except Exception as e:
            self.logger.error(f"Pipeline failed: {str(e)}")
            raise


class ETLPipeline(Pipeline):
    """
    Traditional ETL (Extract, Transform, Load) Pipeline.
    
    Best for: Complex transformations that should happen before loading,
    data cleansing, aggregations, feature engineering.
    
    Use case: NEPSE daily data processing - extract from API,
    calculate technical indicators (transform), load to database.
    """
    
    def __init__(self, 
                 extractor: Callable,
                 transformer: Callable,
                 loader: Callable):
        super().__init__("ETL_Pipeline")
        self.extractor = extractor
        self.transformer = transformer
        self.loader = loader
    
    def extract(self, context: PipelineContext) -> PipelineContext:
        """Extract raw data."""
        self.logger.info("Extracting data...")
        context.data = self.extractor()
        context.metadata['extracted_at'] = datetime.now()
        return context
    
    def transform(self, context: PipelineContext) -> PipelineContext:
        """Transform data before loading."""
        self.logger.info("Transforming data...")
        context.data = self.transformer(context.data)
        context.metadata['transformed_at'] = datetime.now()
        return context
    
    def load(self, context: PipelineContext) -> PipelineContext:
        """Load transformed data."""
        self.logger.info("Loading data...")
        self.loader(context.data)
        context.metadata['loaded_at'] = datetime.now()
        return context


class ELTPipeline(Pipeline):
    """
    ELT (Extract, Load, Transform) Pipeline.
    
    Best for: When destination is a powerful data warehouse (BigQuery, Snowflake),
    raw data preservation is important, transformations are SQL-based.
    
    Use case: Load raw NEPSE data to data warehouse, use SQL views for transformations.
    """
    
    def __init__(self,
                 extractor: Callable,
                 loader: Callable,
                 transform_query: str):
        super().__init__("ELT_Pipeline")
        self.extractor = extractor
        self.loader = loader
        self.transform_query = transform_query
    
    def extract(self, context: PipelineContext) -> PipelineContext:
        self.logger.info("Extracting data...")
        context.data = self.extractor()
        return context
    
    def transform(self, context: PipelineContext) -> PipelineContext:
        """
        In ELT, transformation happens after load.
        
        We might do minimal cleaning here, but heavy lifting is in DB.
        """
        self.logger.info("Minimal pre-load transformation...")
        # Just basic type conversion or column renaming
        if isinstance(context.data, pd.DataFrame):
            context.data.columns = [c.lower().replace(' ', '_') for c in context.data.columns]
        return context
    
    def load(self, context: PipelineContext) -> PipelineContext:
        self.logger.info("Loading raw data...")
        self.loader(context.data, raw=True)
        
        # Execute transformation in database
        self.logger.info("Executing DB transformations...")
        # Here you would execute self.transform_query against the DB
        return context


class LambdaArchitecture:
    """
    Lambda Architecture combines batch and real-time (speed) layers.
    
    Structure:
    - Batch Layer: Process all historical data (high latency, high accuracy)
    - Speed Layer: Process recent data in real-time (low latency, approximate)
    - Serving Layer: Merges batch and speed views
    
    Use case: NEPSE prediction system with daily batch training (batch layer)
    and real-time price alerts (speed layer).
    """
    
    def __init__(self):
        self.batch_pipeline = None
        self.speed_pipeline = None
        self.logger = logging.getLogger("LambdaArchitecture")
    
    def set_batch_pipeline(self, pipeline: Pipeline):
        """Set the batch processing pipeline."""
        self.batch_pipeline = pipeline
    
    def set_speed_pipeline(self, pipeline: Pipeline):
        """Set the speed (real-time) pipeline."""
        self.speed_pipeline = pipeline
    
    def run_batch(self, **kwargs) -> PipelineContext:
        """Execute batch layer (process all historical data)."""
        self.logger.info("Running batch layer...")
        if not self.batch_pipeline:
            raise ValueError("Batch pipeline not configured")
        return self.batch_pipeline.run(**kwargs)
    
    def run_speed(self, data: Any) -> PipelineContext:
        """Execute speed layer (process recent data)."""
        self.logger.info("Running speed layer...")
        if not self.speed_pipeline:
            raise ValueError("Speed pipeline not configured")
        
        # Speed layer typically processes a small batch of recent data
        context = PipelineContext(
            execution_id=f"speed_{datetime.now().strftime('%H%M%S')}",
            start_time=datetime.now(),
            data=data
        )
        
        # Skip extract, go straight to transform and load
        context = self.speed_pipeline.transform(context)
        context = self.speed_pipeline.load(context)
        return context


class MedallionArchitecture:
    """
    Medallion Architecture organizes data in three layers:
    
    Bronze: Raw data as-ingested (immutable, append-only)
    Silver: Cleaned, conformed data (deduplicated, schema enforced)
    Gold: Aggregated, business-level data (features, aggregations)
    
    Use case: NEPSE data lakehouse:
    - Bronze: Raw CSV files from NEPSE API
    - Silver: Cleaned data with proper types, no duplicates
    - Gold: Technical indicators, features for ML models
    """
    
    def __init__(self, base_path: str = "./medallion"):
        self.base_path = base_path
        self.layers = ['bronze', 'silver', 'gold']
        self.logger = logging.getLogger("MedallionArchitecture")
        
        # Ensure directories exist
        import os
        for layer in self.layers:
            os.makedirs(f"{base_path}/{layer}", exist_ok=True)
    
    def write_bronze(self, data: pd.DataFrame, filename: str):
        """
        Write raw data to Bronze layer.
        
        Bronze characteristics:
        - Immutable (never update, only append)
        - Schema-on-read (flexible)
        - Raw format (keep original files)
        """
        path = f"{self.base_path}/bronze/{filename}"
        data.to_parquet(path, index=False)
        self.logger.info(f"Wrote {len(data)} records to Bronze: {filename}")
    
    def write_silver(self, data: pd.DataFrame, filename: str):
        """
        Write cleaned data to Silver layer.
        
        Silver characteristics:
        - Deduplicated
        - Schema enforced
        - Basic cleaning applied
        """
        path = f"{self.base_path}/silver/{filename}"
        
        # Deduplication
        if 'symbol' in data.columns and 'date' in data.columns:
            data = data.drop_duplicates(subset=['symbol', 'date'], keep='last')
        
        # Schema enforcement
        if 'volume' in data.columns:
            data['volume'] = pd.to_numeric(data['volume'], errors='coerce')
        
        data.to_parquet(path, index=False)
        self.logger.info(f"Wrote {len(data)} records to Silver: {filename}")
    
    def write_gold(self, data: pd.DataFrame, filename: str):
        """
        Write business-level data to Gold layer.
        
        Gold characteristics:
        - Aggregated
        - Feature-engineered
        - Optimized for querying
        """
        path = f"{self.base_path}/gold/{filename}"
        data.to_parquet(path, index=False)
        self.logger.info(f"Wrote {len(data)} records to Gold: {filename}")
    
    def read_layer(self, layer: str, filename: str) -> pd.DataFrame:
        """Read data from specified layer."""
        path = f"{self.base_path}/{layer}/{filename}"
        return pd.read_parquet(path)


def demonstrate_architecture_patterns():
    """
    Demonstrate different pipeline architecture patterns.
    """
    print("=" * 70)
    print("Pipeline Architecture Patterns")
    print("=" * 70)
    
    # Sample data for demonstration
    sample_data = pd.DataFrame({
        'symbol': ['NABIL', 'NICA', 'SCBL'],
        'date': ['2024-01-15', '2024-01-15', '2024-01-15'],
        'close': [865.0, 790.0, 530.0],
        'volume': [125000, 98000, 76000]
    })
    
    print("\n1. ETL Pattern (Extract -> Transform -> Load)")
    print("-" * 50)
    
    # Define ETL functions
    def extract():
        print("  Extracting from NEPSE API...")
        return sample_data.copy()
    
    def transform(df):
        print("  Calculating technical indicators...")
        df['sma_5'] = df['close'].rolling(window=5, min_periods=1).mean()
        return df
    
    def load(df):
        print(f"  Loading {len(df)} records to database...")
    
    etl = ETLPipeline(extract, transform, load)
    etl.run()
    
    print("\n2. Medallion Architecture (Bronze -> Silver -> Gold)")
    print("-" * 50)
    medallion = MedallionArchitecture("./nepse_medallion")
    
    # Bronze: Raw data
    medallion.write_bronze(sample_data, "raw_20240115.parquet")
    
    # Silver: Cleaned
    cleaned_data = sample_data.copy()
    cleaned_data['date'] = pd.to_datetime(cleaned_data['date'])
    medallion.write_silver(cleaned_data, "cleaned_20240115.parquet")
    
    # Gold: Features
    features = cleaned_data.copy()
    features['price_momentum'] = features['close'].pct_change()
    medallion.write_gold(features, "features_20240115.parquet")
    
    print("\n3. Lambda Architecture (Batch + Speed layers)")
    print("-" * 50)
    lambda_arch = LambdaArchitecture()
    lambda_arch.set_batch_pipeline(etl)
    print("  Batch layer configured for daily processing")
    print("  Speed layer would handle real-time ticks")
    
    return etl, medallion, lambda_arch


if __name__ == "__main__":
    demonstrate_architecture_patterns()
```

**Detailed Explanation:**

1. **Pipeline Context**: The `PipelineContext` dataclass acts as a carrier for data and metadata between stages. It maintains execution ID, timing metrics, and allows stages to communicate state. This pattern avoids global variables and makes pipelines testable.

2. **ETL vs ELT**: 
   - **ETL** transforms data before loading, suitable when the destination has limited processing power or when you need to protect the database from dirty data.
   - **ELT** loads raw data first then transforms in the database, leveraging the DB's processing power and preserving raw data for debugging.

3. **Lambda Architecture**: Combines batch (accuracy) and speed (latency) layers. For NEPSE, the batch layer retrains models daily on full history, while the speed layer provides real-time alerts on price movements without waiting for the daily batch.

4. **Medallion Architecture**: Organizes data quality into three zones:
   - **Bronze**: Immutable raw data (source of truth)
   - **Silver**: Cleaned, deduplicated, schema-validated
   - **Gold**: Business-ready features and aggregations

---

## **9.2 Batch Processing Pipelines**

Batch processing handles data in discrete chunks, typically scheduled to run at intervals (hourly, daily). This is the most common pattern for financial data like NEPSE.

```python
"""
Batch Processing Pipelines for NEPSE Data

Batch processing is suitable for:
- Daily stock data ingestion
- End-of-day model training
- Historical backtesting
- Nightly report generation

Components:
1. Scheduler (cron, Airflow)
2. Data extraction (API, files)
3. Transformation (cleaning, feature engineering)
4. Loading (database, file storage)
5. Notification (success/failure alerts)
"""

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Callable
import logging
import json
from pathlib import Path
import sqlite3

logger = logging.getLogger(__name__)


class BatchPipeline:
    """
    Production-ready batch pipeline for NEPSE daily data processing.
    
    Features:
    - Idempotency (running twice doesn't duplicate data)
    - Checkpointing (resume from failure)
    - Data validation
    - Error handling with retries
    - Audit logging
    """
    
    def __init__(self, 
                 pipeline_id: str,
                 db_connection: str = "./nepse_pipeline.db"):
        self.pipeline_id = pipeline_id
        self.db_connection = db_connection
        self._init_checkpoint_db()
        
        # Processing statistics
        self.stats = {
            'records_extracted': 0,
            'records_transformed': 0,
            'records_loaded': 0,
            'errors': []
        }
    
    def _init_checkpoint_db(self):
        """Initialize SQLite database for checkpointing."""
        conn = sqlite3.connect(self.db_connection)
        cursor = conn.cursor()
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS pipeline_runs (
                run_id TEXT PRIMARY KEY,
                pipeline_id TEXT,
                status TEXT,
                start_time TIMESTAMP,
                end_time TIMESTAMP,
                records_processed INTEGER,
                checkpoint_data TEXT,
                error_message TEXT
            )
        ''')
        conn.commit()
        conn.close()
    
    def _save_checkpoint(self, run_id: str, status: str, checkpoint_data: Dict = None):
        """Save pipeline state for fault tolerance."""
        conn = sqlite3.connect(self.db_connection)
        cursor = conn.cursor()
        cursor.execute('''
            INSERT OR REPLACE INTO pipeline_runs 
            (run_id, pipeline_id, status, start_time, end_time, records_processed, checkpoint_data)
            VALUES (?, ?, ?, ?, ?, ?, ?)
        ''', (
            run_id,
            self.pipeline_id,
            status,
            datetime.now(),
            datetime.now() if status in ['completed', 'failed'] else None,
            self.stats['records_loaded'],
            json.dumps(checkpoint_data) if checkpoint_data else None
        ))
        conn.commit()
        conn.close()
    
    def extract_daily_data(self, trade_date: datetime) -> pd.DataFrame:
        """
        Extract NEPSE data for a specific trading date.
        
        In production, this would call the NEPSE API or scrape the website.
        For demonstration, we generate synthetic data.
        """
        logger.info(f"Extracting data for {trade_date.date()}")
        
        symbols = ['NABIL', 'NICA', 'SCBL', 'ADBL', 'EBL', 'GBIME', 'HBL']
        data = []
        
        for symbol in symbols:
            # Simulate API call
            base_price = np.random.uniform(200, 1000)
            data.append({
                'symbol': symbol,
                'trade_date': trade_date.strftime('%Y-%m-%d'),
                'open': round(base_price * np.random.uniform(0.98, 1.02), 2),
                'high': round(base_price * np.random.uniform(1.01, 1.05), 2),
                'low': round(base_price * np.random.uniform(0.95, 0.99), 2),
                'close': round(base_price * np.random.uniform(0.98, 1.02), 2),
                'volume': int(np.random.uniform(10000, 500000)),
                'turnover': round(np.random.uniform(1000000, 50000000), 2)
            })
        
        df = pd.DataFrame(data)
        self.stats['records_extracted'] = len(df)
        logger.info(f"Extracted {len(df)} records")
        return df
    
    def validate_raw_data(self, df: pd.DataFrame) -> bool:
        """
        Validate raw data before processing.
        
        Checks:
        - No missing symbols
        - Price ranges are reasonable (0 < price < 100000)
        - Volume is positive
        - No duplicate symbols for same date
        """
        validation_rules = [
            (df['symbol'].notna().all(), "Missing symbols detected"),
            ((df['close'] > 0).all(), "Non-positive prices detected"),
            ((df['close'] < 100000).all(), "Suspiciously high prices detected"),
            ((df['volume'] >= 0).all(), "Negative volume detected"),
            (df.groupby('trade_date')['symbol'].nunique() == len(df), "Duplicate symbols in date")
        ]
        
        for condition, message in validation_rules:
            if not condition:
                logger.error(f"Validation failed: {message}")
                return False
        
        logger.info("Data validation passed")
        return True
    
    def transform_data(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Transform raw data into analysis-ready format.
        
        Transformations:
        1. Data type conversion
        2. Calculate daily returns
        3. Calculate technical indicators (SMA, RSI)
        4. Add metadata columns
        """
        logger.info("Transforming data...")
        df = df.copy()
        
        # Ensure correct types
        df['trade_date'] = pd.to_datetime(df['trade_date'])
        df['close'] = pd.to_numeric(df['close'], errors='coerce')
        df['volume'] = pd.to_numeric(df['volume'], errors='coerce')
        
        # Sort by symbol and date for calculations
        df = df.sort_values(['symbol', 'trade_date'])
        
        # Calculate daily returns (would need historical data in practice)
        df['daily_return'] = df.groupby('symbol')['close'].pct_change()
        
        # Simple Moving Average (5-day)
        df['sma_5'] = df.groupby('symbol')['close'].transform(
            lambda x: x.rolling(window=5, min_periods=1).mean()
        )
        
        # Price range percentage
        df['range_pct'] = ((df['high'] - df['low']) / df['low']) * 100
        
        # Add processing metadata
        df['processed_at'] = datetime.now()
        df['pipeline_version'] = '1.0.0'
        
        self.stats['records_transformed'] = len(df)
        logger.info(f"Transformation complete: {len(df)} records")
        return df
    
    def load_to_warehouse(self, df: pd.DataFrame, 
                         connection_string: str = "sqlite:///nepse_warehouse.db"):
        """
        Load transformed data to data warehouse.
        
        Implements upsert logic (INSERT OR REPLACE) to ensure idempotency.
        """
        logger.info("Loading to warehouse...")
        
        # In production, use SQLAlchemy or dedicated connector
        # Here we use SQLite for demonstration
        conn = sqlite3.connect(connection_string.replace('sqlite:///', ''))
        
        # Create table if not exists
        df.head(0).to_sql('stock_prices', conn, if_exists='append', index=False)
        
        # Upsert logic: Delete existing records for these symbols/dates
        # then insert new ones (ensures idempotency)
        symbols = df['symbol'].unique().tolist()
        dates = df['trade_date'].dt.strftime('%Y-%m-%d').unique().tolist()
        
        placeholders_sym = ','.join(['?' for _ in symbols])
        placeholders_date = ','.join(['?' for _ in dates])
        
        cursor = conn.cursor()
        cursor.execute(f'''
            DELETE FROM stock_prices 
            WHERE symbol IN ({placeholders_sym}) 
            AND trade_date IN ({placeholders_date})
        ''', symbols + dates)
        
        # Insert new data
        df.to_sql('stock_prices', conn, if_exists='append', index=False)
        
        conn.commit()
        conn.close()
        
        self.stats['records_loaded'] = len(df)
        logger.info(f"Loaded {len(df)} records to warehouse")
    
    def run(self, trade_date: Optional[datetime] = None) -> Dict:
        """
        Execute the full batch pipeline.
        
        Args:
            trade_date: Date to process (default: yesterday)
        
        Returns:
            Dictionary with execution statistics
        """
        if trade_date is None:
            trade_date = datetime.now() - timedelta(days=1)
        
        run_id = f"{self.pipeline_id}_{trade_date.strftime('%Y%m%d')}"
        
        try:
            # Check if already processed (idempotency)
            conn = sqlite3.connect(self.db_connection)
            cursor = conn.cursor()
            cursor.execute('''
                SELECT status FROM pipeline_runs 
                WHERE run_id = ? AND status = 'completed'
            ''', (run_id,))
            
            if cursor.fetchone():
                logger.info(f"Pipeline already completed for {run_id}, skipping")
                return {'status': 'skipped', 'run_id': run_id}
            
            conn.close()
            
            # Save checkpoint: started
            self._save_checkpoint(run_id, 'running')
            
            # Execute pipeline stages
            raw_data = self.extract_daily_data(trade_date)
            
            if not self.validate_raw_data(raw_data):
                raise ValueError("Data validation failed")
            
            transformed_data = self.transform_data(raw_data)
            self.load_to_warehouse(transformed_data)
            
            # Save checkpoint: completed
            self._save_checkpoint(run_id, 'completed', 
                                {'records': len(transformed_data)})
            
            return {
                'status': 'success',
                'run_id': run_id,
                'stats': self.stats,
                'trade_date': trade_date.strftime('%Y-%m-%d')
            }
            
        except Exception as e:
            logger.error(f"Pipeline failed: {str(e)}")
            self._save_checkpoint(run_id, 'failed', {'error': str(e)})
            raise


class BatchScheduler:
    """
    Simple scheduler for batch pipelines.
    
    In production, use Apache Airflow, Prefect, or cron.
    This demonstrates the concepts.
    """
    
    def __init__(self):
        self.jobs: Dict[str, Dict] = {}
    
    def add_job(self, 
                job_id: str, 
                pipeline: BatchPipeline,
                schedule: str,  # 'daily', 'hourly', or cron-like
                start_date: datetime):
        """
        Add a job to the scheduler.
        
        Args:
            job_id: Unique identifier
            pipeline: Pipeline instance to run
            schedule: Frequency ('daily', 'hourly')
            start_date: When to start
        """
        self.jobs[job_id] = {
            'pipeline': pipeline,
            'schedule': schedule,
            'start_date': start_date,
            'last_run': None,
            'next_run': start_date
        }
        logger.info(f"Scheduled job {job_id} with schedule {schedule}")
    
    def should_run(self, job_id: str) -> bool:
        """Check if job should run now."""
        job = self.jobs.get(job_id)
        if not job:
            return False
        
        return datetime.now() >= job['next_run']
    
    def execute_job(self, job_id: str):
        """Execute a scheduled job."""
        job = self.jobs[job_id]
        
        logger.info(f"Executing scheduled job: {job_id}")
        result = job['pipeline'].run()
        
        # Update schedule
        job['last_run'] = datetime.now()
        
        if job['schedule'] == 'daily':
            job['next_run'] = job['last_run'] + timedelta(days=1)
        elif job['schedule'] == 'hourly':
            job['next_run'] = job['last_run'] + timedelta(hours=1)
        
        return result


def demonstrate_batch_pipeline():
    """
    Demonstrate batch processing pipeline.
    """
    print("=" * 70)
    print("Batch Processing Pipeline for NEPSE")
    print("=" * 70)
    
    # Initialize pipeline
    pipeline = BatchPipeline(
        pipeline_id="nepse_daily_ingestion",
        db_connection="./pipeline_meta.db"
    )
    
    # Run for specific date
    result = pipeline.run(trade_date=datetime(2024, 1, 15))
    
    print(f"\nPipeline Result: {result['status']}")
    print(f"Records processed: {result['stats']['records_loaded']}")
    
    # Demonstrate scheduler
    print("\nScheduler Example:")
    scheduler = BatchScheduler()
    scheduler.add_job(
        job_id="daily_nepse",
        pipeline=pipeline,
        schedule='daily',
        start_date=datetime.now()
    )
    
    print(f"Job scheduled. Next run: {scheduler.jobs['daily_nepse']['next_run']}")
    
    return pipeline, scheduler


if __name__ == "__main__":
    demonstrate_batch_pipeline()
```

**Detailed Explanation:**

1. **Idempotency**: The pipeline uses "upsert" logic (delete then insert) to ensure running twice doesn't create duplicates. This is critical for reliable scheduling.

2. **Checkpointing**: The SQLite checkpoint database tracks pipeline runs. If a job fails, you can check the status and resume from the last successful stage.

3. **Validation Gates**: Before transformation, data is validated for:
   - Schema compliance (correct types)
   - Business rules (positive prices, reasonable ranges)
   - Uniqueness constraints (no duplicate symbols per date)

4. **Error Handling**: Try-except blocks catch errors, log them, and save failure status to the checkpoint database for debugging.

---

## **9.3 Stream Processing Pipelines**

Stream processing handles data in real-time as it arrives, enabling immediate reactions to market events.

```python
"""
Stream Processing Pipelines for Real-Time NEPSE Data

Stream processing is used for:
- Real-time price alerts
- Live trading signals
- Immediate anomaly detection
- Real-time dashboards

Tools: Apache Kafka, Apache Flink, Redis Streams, AWS Kinesis
"""

import asyncio
import json
from datetime import datetime
from typing import Dict, List, Callable, Optional
from dataclasses import dataclass, asdict
import threading
import queue
import time


@dataclass
class StockTick:
    """
    Represents a single stock tick (price update).
    
    In real NEPSE streaming, this would come from WebSocket
    or streaming API.
    """
    symbol: str
    timestamp: datetime
    price: float
    volume: int
    tick_type: str = 'trade'  # 'trade', 'bid', 'ask'
    
    def to_json(self) -> str:
        return json.dumps({
            'symbol': self.symbol,
            'timestamp': self.timestamp.isoformat(),
            'price': self.price,
            'volume': self.volume,
            'tick_type': self.tick_type
        })


class StreamProcessor:
    """
    Real-time stream processor for NEPSE ticks.
    
    Architecture:
    - Ingest: Receive ticks from source (simulated here)
    - Process: Apply transformations/filters in real-time
    - Sink: Output to alerts, database, or dashboard
    """
    
    def __init__(self, window_size_seconds: int = 60):
        self.window_size = window_size_seconds
        self.buffer: Dict[str, List[StockTick]] = {}
        self.callbacks: List[Callable] = []
        self.running = False
        self.stats = {
            'ticks_processed': 0,
            'alerts_triggered': 0
        }
    
    def register_callback(self, callback: Callable[[StockTick], None]):
        """
        Register a callback function to process each tick.
        
        Callbacks can be:
        - Alert generators
        - Database writers
        - Feature calculators
        """
        self.callbacks.append(callback)
    
    def ingest_tick(self, tick: StockTick):
        """
        Ingest a single tick into the stream.
        
        In production, this would be called by Kafka consumer
        or WebSocket handler.
        """
        # Add to buffer for windowed calculations
        if tick.symbol not in self.buffer:
            self.buffer[tick.symbol] = []
        
        self.buffer[tick.symbol].append(tick)
        
        # Clean old ticks outside window
        cutoff = datetime.now() - timedelta(seconds=self.window_size)
        self.buffer[tick.symbol] = [
            t for t in self.buffer[tick.symbol] 
            if t.timestamp > cutoff
        ]
        
        # Process tick through all callbacks
        for callback in self.callbacks:
            try:
                callback(tick)
            except Exception as e:
                print(f"Callback error: {e}")
        
        self.stats['ticks_processed'] += 1
    
    def get_moving_average(self, symbol: str, seconds: int = 300) -> Optional[float]:
        """
        Calculate moving average for a symbol over last N seconds.
        
        This is a stateful operation on the stream buffer.
        """
        if symbol not in self.buffer:
            return None
        
        cutoff = datetime.now() - timedelta(seconds=seconds)
        recent_ticks = [t for t in self.buffer[symbol] if t.timestamp > cutoff]
        
        if not recent_ticks:
            return None
        
        prices = [t.price for t in recent_ticks]
        return sum(prices) / len(prices)
    
    def detect_anomaly(self, tick: StockTick) -> bool:
        """
        Detect price anomalies (sudden spikes/drops).
        
        Returns True if price change > 5% from moving average.
        """
        avg = self.get_moving_average(tick.symbol, seconds=300)
        if avg is None:
            return False
        
        change_pct = abs(tick.price - avg) / avg
        return change_pct > 0.05  # 5% threshold


class AlertManager:
    """
    Manages real-time alerts based on stream processing.
    """
    
    def __init__(self):
        self.alerts: List[Dict] = []
        self.cooldowns: Dict[str, datetime] = {}
    
    def check_price_alert(self, tick: StockTick):
        """
        Check if tick triggers any alerts.
        
        Implements cooldown to prevent spam (max 1 alert per 5 min per symbol).
        """
        # Check cooldown
        last_alert = self.cooldowns.get(tick.symbol)
        if last_alert and (datetime.now() - last_alert).seconds < 300:
            return
        
        # Check conditions
        if tick.price > 1000:  # Price threshold
            self._trigger_alert(tick, f"Price above 1000: {tick.price}")
        
        if tick.volume > 1000000:  # Volume spike
            self._trigger_alert(tick, f"Volume spike: {tick.volume}")
    
    def _trigger_alert(self, tick: StockTick, message: str):
        """Record and send alert."""
        alert = {
            'timestamp': datetime.now(),
            'symbol': tick.symbol,
            'message': message,
            'severity': 'high'
        }
        self.alerts.append(alert)
        self.cooldowns[tick.symbol] = datetime.now()
        
        # In production: send email, SMS, Slack notification
        print(f"🚨 ALERT [{tick.symbol}]: {message}")


class KafkaSimulator:
    """
    Simulates Apache Kafka for demonstration purposes.
    
    In production, use kafka-python library:
    from kafka import KafkaConsumer, KafkaProducer
    """
    
    def __init__(self):
        self.topics: Dict[str, queue.Queue] = {}
        self.consumers: List[threading.Thread] = []
    
    def create_topic(self, topic_name: str):
        """Create a topic (message queue)."""
        self.topics[topic_name] = queue.Queue()
    
    def produce(self, topic: str, message: str):
        """Produce message to topic."""
        if topic in self.topics:
            self.topics[topic].put(message)
    
    def consume(self, topic: str, processor: Callable[[str], None]):
        """
        Consume messages from topic in background thread.
        """
        def consumer_loop():
            while True:
                try:
                    message = self.topics[topic].get(timeout=1)
                    processor(message)
                except queue.Empty:
                    continue
                except Exception as e:
                    print(f"Consumer error: {e}")
        
        thread = threading.Thread(target=consumer_loop, daemon=True)
        thread.start()
        self.consumers.append(thread)


def demonstrate_stream_processing():
    """
    Demonstrate stream processing concepts.
    """
    print("=" * 70)
    print("Stream Processing Pipeline (Real-Time)")
    print("=" * 70)
    
    # Initialize components
    processor = StreamProcessor(window_size_seconds=60)
    alert_manager = AlertManager()
    
    # Register callbacks
    processor.register_callback(alert_manager.check_price_alert)
    processor.register_callback(
        lambda tick: print(f"Processed: {tick.symbol} @ {tick.price}")
    )
    
    # Simulate incoming ticks
    print("\nSimulating real-time ticks...")
    symbols = ['NABIL', 'NICA']
    
    for i in range(20):
        symbol = symbols[i % 2]
        
        # Simulate price spike for NABIL on 10th tick
        if i == 10 and symbol == 'NABIL':
            price = 1050  # Anomaly
        else:
            price = 850 + (i % 10)
        
        tick = StockTick(
            symbol=symbol,
            timestamp=datetime.now(),
            price=price,
            volume=50000 + (i * 1000)
        )
        
        processor.ingest_tick(tick)
        
        if processor.detect_anomaly(tick):
            print(f"⚠️  Anomaly detected in {symbol}!")
        
        time.sleep(0.1)  # Simulate real-time delay
    
    print(f"\nStream Stats:")
    print(f"  Ticks processed: {processor.stats['ticks_processed']}")
    print(f"  Alerts triggered: {len(alert_manager.alerts)}")
    
    return processor, alert_manager


if __name__ == "__main__":
    demonstrate_stream_processing()
```

**Detailed Explanation:**

1. **Stream Buffer**: Maintains a sliding window of recent ticks per symbol in memory. This enables stateful operations like moving averages without querying a database.

2. **Callback Pattern**: Each tick is passed through a chain of callback functions (alert checker, feature calculator, database writer). This is the Chain of Responsibility pattern.

3. **Anomaly Detection**: Real-time detection compares current price against 5-minute moving average. If deviation > 5%, trigger alert.

4. **Cooldown Mechanism**: Prevents alert spam by limiting alerts to one per 5 minutes per symbol, even if conditions remain triggered.

---

## **9.4 Pipeline Orchestration**

Orchestration tools manage complex workflows with dependencies, scheduling, and monitoring.

### **9.4.1 Apache Airflow**

Apache Airflow is the industry standard for workflow orchestration, using Python to define DAGs (Directed Acyclic Graphs).

```python
"""
Apache Airflow Integration for NEPSE Pipelines

Airflow concepts:
- DAG: Directed Acyclic Graph (the workflow definition)
- Task: A unit of work (Python function, SQL query, etc.)
- Operator: Defines what a task does (PythonOperator, BashOperator, etc.)
- Sensor: Waits for external event (file, API, etc.)
- XCom: Cross-communication (share data between tasks)
"""

from datetime import datetime, timedelta
from typing import Dict, Any

# Note: In production, these imports come from airflow
# from airflow import DAG
# from airflow.operators.python import PythonOperator
# from airflow.sensors.filesystem import FileSensor


class MockDAG:
    """
    Mock DAG class to demonstrate Airflow concepts without installation.
    
    In production:
    from airflow import DAG
    
    dag = DAG(
        'nepse_daily_pipeline',
        default_args=default_args,
        description='NEPSE daily data pipeline',
        schedule_interval=timedelta(days=1),
        start_date=datetime(2024, 1, 1),
        catchup=False
    )
    """
    
    def __init__(self, dag_id: str, schedule_interval: timedelta, start_date: datetime):
        self.dag_id = dag_id
        self.schedule_interval = schedule_interval
        self.start_date = start_date
        self.tasks: Dict[str, Any] = {}
        self.dependencies: Dict[str, List[str]] = {}
    
    def add_task(self, task_id: str, python_callable: callable, **kwargs):
        """Add a task to the DAG."""
        self.tasks[task_id] = {
            'callable': python_callable,
            'kwargs': kwargs
        }
        return self
    
    def set_dependency(self, upstream: str, downstream: str):
        """Set task dependency: upstream >> downstream"""
        if downstream not in self.dependencies:
            self.dependencies[downstream] = []
        self.dependencies[downstream].append(upstream)
    
    def run(self):
        """Execute the DAG (simplified simulation)."""
        print(f"Running DAG: {self.dag_id}")
        
        # Topological sort to determine execution order
        executed = set()
        pending = set(self.tasks.keys())
        
        while pending:
            ready = {
                task for task in pending 
                if all(dep in executed for dep in self.dependencies.get(task, []))
            }
            
            if not ready:
                raise ValueError("Circular dependency detected")
            
            for task_id in ready:
                print(f"Executing task: {task_id}")
                task = self.tasks[task_id]
                task['callable'](**task['kwargs'])
                executed.add(task_id)
                pending.remove(task_id)
        
        print("DAG completed successfully")


def create_nepse_dag():
    """
    Create a production-grade NEPSE pipeline DAG.
    
    This demonstrates the structure of a real Airflow DAG.
    """
    # Default arguments for all tasks
    default_args = {
        'owner': 'nepse-data-team',
        'depends_on_past': False,  # Don't wait for previous run
        'email': ['alerts@nepse-system.com'],
        'email_on_failure': True,
        'email_on_retry': False,
        'retries': 3,
        'retry_delay': timedelta(minutes=5),
    }
    
    # Create DAG
    dag = MockDAG(
        dag_id='nepse_daily_etl',
        schedule_interval=timedelta(days=1),  # Daily at midnight
        start_date=datetime(2024, 1, 1)
    )
    
    # Task 1: Check if source data is available (Sensor)
    def check_source_data():
        """Wait for NEPSE API to publish daily data."""
        print("Checking NEPSE API for daily data...")
        # In production: Check if file exists or API endpoint returns 200
        return True
    
    # Task 2: Extract raw data
    def extract_task():
        """Extract data from NEPSE API."""
        print("Extracting from NEPSE API...")
        # Return data via XCom in real Airflow
        return {'date': '2024-01-15', 'records': 200}
    
    # Task 3: Validate data quality
    def validate_task():
        """Run data quality checks."""
        print("Validating data quality...")
        checks = {
            'row_count': 200,
            'null_percentage': 0.0,
            'price_range_valid': True
        }
        
        if checks['null_percentage'] > 0.05:
            raise ValueError("Too many null values!")
        
        return checks
    
    # Task 4: Transform data
    def transform_task():
        """Calculate features and indicators."""
        print("Transforming data...")
        # Read from XCom, process, write back
        return {'features_calculated': 15}
    
    # Task 5: Load to warehouse
    def load_task():
        """Load to data warehouse."""
        print("Loading to warehouse...")
        return {'records_loaded': 200}
    
    # Task 6: Generate report
    def report_task():
        """Send success notification."""
        print("Sending success email...")
    
    # Add tasks to DAG
    dag.add_task('check_source', check_source_data)
    dag.add_task('extract', extract_task)
    dag.add_task('validate', validate_task)
    dag.add_task('transform', transform_task)
    dag.add_task('load', load_task)
    dag.add_task('report', report_task)
    
    # Define dependencies (execution order)
    # check_source >> extract >> validate >> transform >> load >> report
    dag.set_dependency('check_source', 'extract')
    dag.set_dependency('extract', 'validate')
    dag.set_dependency('validate', 'transform')
    dag.set_dependency('transform', 'load')
    dag.set_dependency('load', 'report')
    
    return dag


# Production Airflow DAG file (would be saved as nepse_dag.py):
"""
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'nepse_daily_pipeline',
    default_args=default_args,
    description='NEPSE daily ETL pipeline',
    schedule_interval='0 18 * * 1-5',  # 6 PM, Mon-Fri (after market close)
    catchup=False
)

# Extract task
def extract_nepse_data(**context):
    from nepse_api import get_daily_data
    data = get_daily_data()
    # Push to XCom for downstream tasks
    context['task_instance'].xcom_push(key='raw_data', value=data)

extract = PythonOperator(
    task_id='extract',
    python_callable=extract_nepse_data,
    dag=dag,
)

# Transform task
transform = PythonOperator(
    task_id='transform',
    python_callable=lambda **c: transform_data(c['task_instance'].xcom_pull(task_ids='extract')),
    dag=dag,
)

# Load task using PostgresOperator
load = PostgresOperator(
    task_id='load',
    postgres_conn_id='nepse_warehouse',
    sql="""
        INSERT INTO stock_prices (date, symbol, close, volume)
        VALUES (%(date)s, %(symbol)s, %(close)s, %(volume)s)
        ON CONFLICT (date, symbol) DO UPDATE SET
            close = EXCLUDED.close,
            volume = EXCLUDED.volume;
    """,
    dag=dag,
)

# Set dependencies
extract >> transform >> load
"""


def demonstrate_airflow():
    """
    Demonstrate Airflow DAG concepts.
    """
    print("=" * 70)
    print("Apache Airflow Orchestration")
    print("=" * 70)
    
    dag = create_nepse_dag()
    dag.run()
    
    print("\nKey Airflow Features for NEPSE:")
    print("  - Schedule: Daily at 6 PM (after market close)")
    print("  - Retries: 3 attempts with 5-min delays")
    print("  - Dependencies: Linear pipeline with validation gates")
    print("  - XCom: Pass data between tasks")
    print("  - Sensors: Wait for API availability")


if __name__ == "__main__":
    demonstrate_airflow()
```

**Detailed Explanation:**

1. **DAG Structure**: The workflow is a Directed Acyclic Graph - tasks flow in one direction with no loops. This ensures predictable execution.

2. **Operators**: 
   - **PythonOperator**: Run Python functions
   - **PostgresOperator**: Execute SQL
   - **BashOperator**: Run shell commands
   - **Sensor**: Wait for external condition (file, API)

3. **XCom (Cross-Communication)**: Tasks pass data via XCom. Task A pushes data, Task B pulls it. For large data, use intermediate storage (S3, database) instead.

4. **Scheduling**: Cron expressions (`0 18 * * 1-5`) define when to run: 6 PM, Monday through Friday, after NEPSE market close.

### **9.4.2 Prefect and Dagster (Modern Alternatives)**

```python
"""
Modern Orchestration Tools: Prefect and Dagster

Prefect: Simpler than Airflow, better error handling, dynamic workflows
Dagster: Asset-centric, emphasizes data quality and testing
"""

# Prefect Example (conceptual):
"""
from prefect import flow, task
from prefect.tasks import task_input_hash
import requests

@task(cache_key_fn=task_input_hash)  # Cache results
def fetch_nepse_data(date: str):
    response = requests.get(f"https://nepse-api.com/prices?date={date}")
    return response.json()

@task
def calculate_features(data: dict):
    # Feature engineering
    return processed_data

@task
def save_to_db(data: dict):
    # Database insertion
    pass

@flow(name="nepse_daily_flow")
def nepse_pipeline(date: str = None):
    if date is None:
        date = datetime.now().strftime("%Y-%m-%d")
    
    raw_data = fetch_nepse_data(date)
    features = calculate_features(raw_data)
    save_to_db(features)

# Run: nepse_pipeline("2024-01-15")
"""

# Dagster Example (conceptual):
"""
from dagster import asset, Definitions
import pandas as pd

@asset  # An asset is a data object (table, file, ML model)
def nepse_raw_data():
    # Asset 1: Raw data from API
    return fetch_from_api()

@asset
def nepse_cleaned(nepse_raw_data):  # Dependency injection
    # Asset 2: Cleaned data (depends on raw)
    return clean_data(nepse_raw_data)

@asset
def nepse_features(nepse_cleaned):
    # Asset 3: Feature engineered data
    return calculate_features(nepse_cleaned)

defs = Definitions(assets=[nepse_raw_data, nepse_cleaned, nepse_features])
"""
```

---

## **9.5 Data Quality Gates**

Data quality gates prevent bad data from polluting downstream systems.

```python
"""
Data Quality Gates and Validation

Implement checks at multiple stages:
1. Ingestion: Schema validation, completeness
2. Transformation: Business rule validation
3. Loading: Referential integrity, uniqueness
"""

import pandas as pd
from typing import List, Dict, Callable, Any
from dataclasses import dataclass
import logging

logger = logging.getLogger(__name__)


@dataclass
class ValidationResult:
    """Result of a data quality check."""
    check_name: str
    passed: bool
    details: Dict[str, Any]
    severity: str  # 'error', 'warning'


class DataQualitySuite:
    """
    Comprehensive data quality validation for NEPSE data.
    
    Implements the Great Expectations pattern (without the library
    for simplicity, but consider using great_expectations in production).
    """
    
    def __init__(self):
        self.checks: List[Callable] = []
        self.results: List[ValidationResult] = []
    
    def add_check(self, check_fn: Callable):
        """Add a validation check."""
        self.checks.append(check_fn)
    
    def validate(self, df: pd.DataFrame) -> bool:
        """
        Run all validation checks.
        
        Returns:
            True if all critical checks pass, False otherwise
        """
        self.results = []
        
        for check in self.checks:
            try:
                result = check(df)
                self.results.append(result)
                
                if not result.passed and result.severity == 'error':
                    logger.error(f"Validation failed: {result.check_name}")
                    
            except Exception as e:
                self.results.append(ValidationResult(
                    check_name=check.__name__,
                    passed=False,
                    details={'error': str(e)},
                    severity='error'
                ))
        
        # Return True only if no errors
        critical_failures = [r for r in self.results 
                           if not r.passed and r.severity == 'error']
        
        return len(critical_failures) == 0
    
    def get_report(self) -> Dict:
        """Generate validation report."""
        return {
            'total_checks': len(self.results),
            'passed': sum(1 for r in self.results if r.passed),
            'failed': sum(1 for r in self.results if not r.passed),
            'details': [
                {
                    'check': r.check_name,
                    'status': 'PASS' if r.passed else 'FAIL',
                    'severity': r.severity,
                    'details': r.details
                }
                for r in self.results
            ]
        }


# Pre-built validation checks for NEPSE data

def check_no_missing_symbols(df: pd.DataFrame) -> ValidationResult:
    """Ensure all records have stock symbols."""
    missing = df['symbol'].isna().sum()
    return ValidationResult(
        check_name='no_missing_symbols',
        passed=missing == 0,
        details={'missing_count': missing},
        severity='error'
    )


def check_price_positive(df: pd.DataFrame) -> ValidationResult:
    """Ensure all prices are positive."""
    invalid = (df['close'] <= 0).sum()
    return ValidationResult(
        check_name='price_positive',
        passed=invalid == 0,
        details={'invalid_prices': invalid},
        severity='error'
    )


def check_volume_reasonable(df: pd.DataFrame) -> ValidationResult:
    """Check for suspicious volume (0 or extremely high)."""
    zero_volume = (df['volume'] == 0).sum()
    high_volume = (df['volume'] > 10_000_000).sum()
    
    return ValidationResult(
        check_name='volume_reasonable',
        passed=zero_volume == 0,  # Zero volume is error
        details={
            'zero_volume_rows': zero_volume,
            'suspiciously_high': high_volume
        },
        severity='warning' if high_volume > 0 else 'error'
    )


def check_no_duplicates(df: pd.DataFrame) -> ValidationResult:
    """Check for duplicate symbol-date combinations."""
    dups = df.duplicated(subset=['symbol', 'date']).sum()
    return ValidationResult(
        check_name='no_duplicates',
        passed=dups == 0,
        details={'duplicate_rows': dups},
        severity='error'
    )


def check_price_range_consistency(df: pd.DataFrame) -> ValidationResult:
    """Ensure high >= low, close within range."""
    invalid_high_low = (df['high'] < df['low']).sum()
    invalid_close = ((df['close'] > df['high']) | (df['close'] < df['low'])).sum()
    
    return ValidationResult(
        check_name='price_range_consistency',
        passed=invalid_high_low == 0 and invalid_close == 0,
        details={
            'invalid_high_low': invalid_high_low,
            'close_out_of_range': invalid_close
        },
        severity='error'
    )


def demonstrate_data_quality():
    """
    Demonstrate data quality gates.
    """
    print("=" * 70)
    print("Data Quality Gates")
    print("=" * 70)
    
    # Create test data with some issues
    test_data = pd.DataFrame({
        'symbol': ['NABIL', 'NICA', 'SCBL', None, 'NABIL'],  # Duplicate NABIL, None symbol
        'date': ['2024-01-15'] * 5,
        'open': [850.0, 780.0, 520.0, 340.0, 860.0],
        'high': [870.0, 795.0, 535.0, 350.0, 875.0],
        'low': [845.0, 775.0, 515.0, 335.0, 840.0],
        'close': [865.0, 790.0, 530.0, -10.0, 865.0],  # Negative close
        'volume': [125000, 0, 76000, 145000, 130000]  # Zero volume for NICA
    })
    
    print("\nTest Data Issues:")
    print("  - Row 3: Missing symbol, negative price")
    print("  - Row 4: Duplicate symbol (NABIL)")
    print("  - Row 1: Zero volume")
    
    # Setup validation suite
    suite = DataQualitySuite()
    suite.add_check(check_no_missing_symbols)
    suite.add_check(check_price_positive)
    suite.add_check(check_volume_reasonable)
    suite.add_check(check_no_duplicates)
    suite.add_check(check_price_range_consistency)
    
    # Run validation
    is_valid = suite.validate(test_data)
    report = suite.get_report()
    
    print(f"\nValidation Result: {'PASS' if is_valid else 'FAIL'}")
    print(f"Checks: {report['passed']}/{report['total_checks']} passed")
    
    print("\nDetailed Report:")
    for detail in report['details']:
        status_icon = "✓" if detail['status'] == 'PASS' else "✗"
        print(f"  {status_icon} {detail['check']} ({detail['severity']})")
        if detail['details']:
            print(f"      Details: {detail['details']}")
    
    return suite


if __name__ == "__main__":
    demonstrate_data_quality()
```

**Detailed Explanation:**

1. **Validation Suite**: Composable pattern where checks can be added/removed. Each check returns a `ValidationResult` with severity levels.

2. **Severity Levels**:
   - **Error**: Stop the pipeline (bad data)
   - **Warning**: Log but continue (suspicious but possibly valid data)

3. **Business Rules**: Checks enforce domain knowledge:
   - Prices must be positive (stocks can't have negative price)
   - High must be >= Low (definition of high/low)
   - No duplicates (one record per symbol per day)

---

## **9.6 Pipeline Monitoring**

Monitoring tracks pipeline health, performance, and data quality over time.

```python
"""
Pipeline Monitoring and Observability

Key metrics to track:
1. Operational: Runtime, success/failure rates, latency
2. Data Quality: Row counts, null rates, distribution drift
3. Business: Records processed, SLAs met
"""

import time
from datetime import datetime
from typing import Dict, List, Any
from dataclasses import dataclass, field
import json


@dataclass
class PipelineMetrics:
    """Metrics collected during pipeline execution."""
    pipeline_id: str
    run_id: str
    start_time: datetime
    end_time: datetime = None
    status: str = "running"  # running, success, failed
    records_processed: int = 0
    duration_seconds: float = 0.0
    custom_metrics: Dict[str, Any] = field(default_factory=dict)


class PipelineMonitor:
    """
    Monitor for tracking pipeline performance and health.
    
    In production, send metrics to:
    - Prometheus (metrics)
    - Grafana (visualization)
    - ELK Stack (logs)
    - PagerDuty (alerts)
    """
    
    def __init__(self, pipeline_id: str):
        self.pipeline_id = pipeline_id
        self.metrics_history: List[PipelineMetrics] = []
        self.current_run: PipelineMetrics = None
    
    def start_run(self) -> str:
        """Start monitoring a new pipeline run."""
        run_id = f"{self.pipeline_id}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        self.current_run = PipelineMetrics(
            pipeline_id=self.pipeline_id,
            run_id=run_id,
            start_time=datetime.now()
        )
        return run_id
    
    def record_metric(self, name: str, value: Any):
        """Record a custom metric."""
        if self.current_run:
            self.current_run.custom_metrics[name] = value
    
    def end_run(self, status: str = "success"):
        """End current run and save metrics."""
        if not self.current_run:
            return
        
        self.current_run.end_time = datetime.now()
        self.current_run.status = status
        self.current_run.duration_seconds = (
            self.current_run.end_time - self.current_run.start_time
        ).total_seconds()
        
        self.metrics_history.append(self.current_run)
        
        # Alert if failed
        if status == "failed":
            self._send_alert(self.current_run)
    
    def _send_alert(self, metrics: PipelineMetrics):
        """Send failure alert (simulated)."""
        print(f"🚨 ALERT: Pipeline {metrics.pipeline_id} failed!")
        print(f"   Run ID: {metrics.run_id}")
        print(f"   Duration: {metrics.duration_seconds}s")
    
    def get_stats(self) -> Dict:
        """Calculate statistics over run history."""
        if not self.metrics_history:
            return {}
        
        runs = self.metrics_history
        success_runs = [r for r in runs if r.status == "success"]
        
        return {
            'total_runs': len(runs),
            'success_rate': len(success_runs) / len(runs),
            'avg_duration': sum(r.duration_seconds for r in runs) / len(runs),
            'avg_records': sum(r.records_processed for r in runs) / len(runs),
            'last_run_status': runs[-1].status if runs else None
        }


class DataDriftMonitor:
    """
    Monitor for data drift in time-series.
    
    Detects when incoming data distribution changes significantly
    from historical patterns (indicates data quality issues or
    fundamental market changes).
    """
    
    def __init__(self, reference_data: pd.DataFrame):
        self.reference_stats = self._calculate_stats(reference_data)
    
    def _calculate_stats(self, df: pd.DataFrame) -> Dict:
        """Calculate statistical profile of reference data."""
        return {
            'close_mean': df['close'].mean(),
            'close_std': df['close'].std(),
            'volume_mean': df['volume'].mean(),
            'volume_std': df['volume'].std()
        }
    
    def check_drift(self, new_data: pd.DataFrame) -> Dict[str, bool]:
        """
        Check if new data has drifted from reference.
        
        Uses z-score to detect significant deviations.
        """
        drift_detected = {}
        
        # Check price drift
        new_mean = new_data['close'].mean()
        z_score = abs(new_mean - self.reference_stats['close_mean']) / self.reference_stats['close_std']
        drift_detected['price_drift'] = z_score > 3  # 3 sigma rule
        
        # Check volume drift
        new_vol_mean = new_data['volume'].mean()
        vol_z = abs(new_vol_mean - self.reference_stats['volume_mean']) / self.reference_stats['volume_std']
        drift_detected['volume_drift'] = vol_z > 3
        
        return drift_detected


def demonstrate_monitoring():
    """
    Demonstrate pipeline monitoring.
    """
    print("=" * 70)
    print("Pipeline Monitoring")
    print("=" * 70)
    
    # Simulate pipeline runs
    monitor = PipelineMonitor("nepse_daily_etl")
    
    for i in range(5):
        run_id = monitor.start_run()
        
        # Simulate work
        time.sleep(0.1)
        
        monitor.record_metric("rows_extracted", 200 + i * 10)
        monitor.record_metric("null_percentage", 0.02)
        
        # Simulate occasional failure
        status = "success" if i != 3 else "failed"
        monitor.end_run(status)
        
        print(f"Run {i+1}: {status}")
    
    stats = monitor.get_stats()
    print(f"\nPipeline Statistics:")
    print(f"  Total runs: {stats['total_runs']}")
    print(f"  Success rate: {stats['success_rate']:.1%}")
    print(f"  Avg duration: {stats['avg_duration']:.2f}s")
    
    # Data drift example
    print("\nData Drift Detection:")
    reference = pd.DataFrame({
        'close': np.random.normal(500, 20, 1000),
        'volume': np.random.normal(100000, 10000, 1000)
    })
    
    drift_monitor = DataDriftMonitor(reference)
    
    # Normal new data
    normal_data = pd.DataFrame({
        'close': np.random.normal(505, 22, 100),
        'volume': np.random.normal(101000, 11000, 100)
    })
    
    # Drifted data (market crash simulation)
    drifted_data = pd.DataFrame({
        'close': np.random.normal(300, 15, 100),  # Major drop
        'volume': np.random.normal(500000, 50000, 100)  # Volume spike
    })
    
    print(f"  Normal data drift: {drift_monitor.check_drift(normal_data)}")
    print(f"  Crashed market drift: {drift_monitor.check_drift(drifted_data)}")
    
    return monitor


if __name__ == "__main__":
    demonstrate_monitoring()
```

---

## **9.7 Error Handling and Recovery**

Robust pipelines handle failures gracefully and recover automatically.

```python
"""
Error Handling and Recovery Strategies

Strategies:
1. Retry with exponential backoff
2. Dead letter queues (save failed records)
3. Circuit breaker (stop trying if failing repeatedly)
4. Checkpoint/resume
"""

import time
import random
from functools import wraps
from typing import Callable, Any, Optional
import logging

logger = logging.getLogger(__name__)


class RetryWithBackoff:
    """
    Decorator for retrying failed operations.
    
    Implements exponential backoff: wait 2^attempt seconds between retries.
    """
    
    def __init__(self, max_retries: int = 3, base_delay: float = 1.0):
        self.max_retries = max_retries
        self.base_delay = base_delay
    
    def __call__(self, func: Callable) -> Callable:
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(self.max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == self.max_retries - 1:
                        raise
                    
                    delay = self.base_delay * (2 ** attempt) + random.uniform(0, 1)
                    logger.warning(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay:.1f}s...")
                    time.sleep(delay)
            
        return wrapper


class CircuitBreaker:
    """
    Circuit breaker pattern: Stop calling failing service.
    
    States:
    - CLOSED: Normal operation
    - OPEN: Failing fast (service down)
    - HALF_OPEN: Testing if service recovered
    """
    
    def __init__(self, failure_threshold: int = 5, recovery_timeout: int = 60):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN
    
    def call(self, func: Callable, *args, **kwargs):
        """Call function with circuit breaker protection."""
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "HALF_OPEN"
                logger.info("Circuit breaker entering HALF_OPEN state")
            else:
                raise Exception("Circuit breaker is OPEN")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise
    
    def _on_success(self):
        """Reset on success."""
        self.failure_count = 0
        self.state = "CLOSED"
    
    def _on_failure(self):
        """Track failures."""
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.failure_count >= self.failure_threshold:
            self.state = "OPEN"
            logger.error("Circuit breaker OPENED due to repeated failures")


class DeadLetterQueue:
    """
    Save failed records for later inspection/reprocessing.
    """
    
    def __init__(self, queue_path: str = "./dlq.json"):
        self.queue_path = queue_path
        self.failed_records: List[Dict] = []
    
    def add(self, record: Any, error: str, context: Dict = None):
        """Add failed record to DLQ."""
        self.failed_records.append({
            'timestamp': datetime.now().isoformat(),
            'record': record,
            'error': error,
            'context': context or {}
        })
    
    def save(self):
        """Persist DLQ to disk."""
        import json
        with open(self.queue_path, 'w') as f:
            json.dump(self.failed_records, f, indent=2, default=str)
    
    def reprocess(self, processor: Callable) -> int:
        """
        Attempt to reprocess failed records.
        
        Returns:
            Number successfully reprocessed
        """
        success_count = 0
        still_failed = []
        
        for item in self.failed_records:
            try:
                processor(item['record'])
                success_count += 1
            except Exception as e:
                item['retry_error'] = str(e)
                still_failed.append(item)
        
        self.failed_records = still_failed
        return success_count


# Example usage with NEPSE API

@RetryWithBackoff(max_retries=3, base_delay=2.0)
def fetch_nepse_api(date: str) -> Dict:
    """
    Fetch data from NEPSE API with retry logic.
    
    Simulates occasional failures.
    """
    if random.random() < 0.3:  # 30% failure rate simulation
        raise ConnectionError("NEPSE API timeout")
    
    return {"date": date, "prices": [{"symbol": "NABIL", "close": 865.0}]}


def demonstrate_error_handling():
    """
    Demonstrate error handling patterns.
    """
    print("=" * 70)
    print("Error Handling and Recovery")
    print("=" * 70)
    
    # Test retry logic
    print("\n1. Retry with Exponential Backoff")
    print("-" * 40)
    
    try:
        result = fetch_nepse_api("2024-01-15")
        print(f"Success: {result}")
    except Exception as e:
        print(f"Failed after retries: {e}")
    
    # Test circuit breaker
    print("\n2. Circuit Breaker Pattern")
    print("-" * 40)
    
    cb = CircuitBreaker(failure_threshold=3, recovery_timeout=5)
    
    def unreliable_function():
        if random.random() < 0.7:
            raise Exception("Service unavailable")
        return "Success"
    
    for i in range(10):
        try:
            result = cb.call(unreliable_function)
            print(f"Call {i+1}: {result}")
        except Exception as e:
            print(f"Call {i+1}: {str(e)[:50]}")
        
        time.sleep(0.5)
    
    # Test DLQ
    print("\n3. Dead Letter Queue")
    print("-" * 40)
    
    dlq = DeadLetterQueue()
    
    # Simulate processing with some failures
    records = [
        {"symbol": "NABIL", "price": 865},
        {"symbol": "BAD_DATA", "price": "invalid"},
        {"symbol": "NICA", "price": 790}
    ]
    
    def process_record(record):
        if not isinstance(record['price'], (int, float)):
            raise ValueError("Invalid price type")
        print(f"  Processed {record['symbol']}")
    
    for record in records:
        try:
            process_record(record)
        except Exception as e:
            dlq.add(record, str(e))
            print(f"  Failed {record['symbol']}: {e}")
    
    print(f"\nDLQ has {len(dlq.failed_records)} failed records")
    print(f"Attempting reprocess...")
    fixed_count = dlq.reprocess(process_record)
    print(f"Reprocessed {fixed_count} records")


if __name__ == "__main__":
    demonstrate_error_handling()
```

---

## **9.8 Pipeline Testing**

Testing ensures pipelines work correctly before production deployment.

```python
"""
Pipeline Testing Strategies

Types of tests:
1. Unit tests: Individual functions
2. Integration tests: Database connections, API calls
3. Data tests: Schema, values, distributions
4. End-to-end tests: Full pipeline run
"""

import unittest
import pandas as pd
from datetime import datetime


class TestNEPSEPipeline(unittest.TestCase):
    """Unit tests for NEPSE pipeline components."""
    
    def setUp(self):
        """Set up test fixtures."""
        self.sample_data = pd.DataFrame({
            'symbol': ['NABIL', 'NICA'],
            'date': ['2024-01-15', '2024-01-15'],
            'open': [850.0, 780.0],
            'high': [870.0, 795.0],
            'low': [845.0, 775.0],
            'close': [865.0, 790.0],
            'volume': [100000, 80000]
        })
    
    def test_data_validation(self):
        """Test validation logic."""
        # Should pass with valid data
        self.assertTrue(len(self.sample_data) > 0)
        self.assertTrue(all(self.sample_data['close'] > 0))
    
    def test_feature_calculation(self):
        """Test technical indicator calculations."""
        # Calculate daily return
        self.sample_data['return'] = self.sample_data['close'].pct_change()
        
        # First row should be NaN (no previous data)
        self.assertTrue(pd.isna(self.sample_data['return'].iloc[0]))
        
        # Other rows should be numeric
        self.assertIsInstance(self.sample_data['return'].iloc[1], float)
    
    def test_duplicate_detection(self):
        """Test that duplicates are caught."""
        # Create duplicate
        dup_data = pd.concat([self.sample_data, self.sample_data.iloc[[0]]])
        
        duplicates = dup_data.duplicated(subset=['symbol', 'date']).sum()
        self.assertEqual(duplicates, 1)


class TestDataQuality(unittest.TestCase):
    """Tests for data quality checks."""
    
    def test_schema_validation(self):
        """Ensure required columns exist."""
        required_cols = ['symbol', 'date', 'open', 'high', 'low', 'close', 'volume']
        data = pd.DataFrame(columns=required_cols)
        
        for col in required_cols:
            self.assertIn(col, data.columns)
    
    def test_price_consistency(self):
        """Ensure high >= low."""
        data = pd.DataFrame({
            'high': [100, 100],
            'low': [90, 110]  # Second row invalid
        })
        
        valid = data['high'] >= data['low']
        self.assertTrue(valid.iloc[0])
        self.assertFalse(valid.iloc[1])


def run_tests():
    """Run the test suite."""
    print("=" * 70)
    print("Pipeline Testing")
    print("=" * 70)
    
    # Create test suite
    loader = unittest.TestLoader()
    suite = unittest.TestSuite()
    
    suite.addTests(loader.loadTestsFromTestCase(TestNEPSEPipeline))
    suite.addTests(loader.loadTestsFromTestCase(TestDataQuality))
    
    # Run tests
    runner = unittest.TextTestRunner(verbosity=2)
    result = runner.run(suite)
    
    return result


if __name__ == "__main__":
    run_tests()
```

---

## **9.9 Scalability Considerations**

Designing pipelines to handle growing data volumes.

```python
"""
Scalability Strategies for NEPSE Pipelines

Horizontal Scaling:
- Distribute work across multiple workers
- Partition data by symbol or date
- Use message queues (Kafka, RabbitMQ)

Vertical Scaling:
- Optimize code (vectorization, caching)
- Use faster storage (SSD, memory)
- Parallel processing within single machine
"""

import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import pandas as pd
import numpy as np
from typing import List, Dict


def process_symbol_chunk(symbol_data: pd.DataFrame) -> pd.DataFrame:
    """
    Process a single symbol's data (for parallel execution).
    
    This function is picklable and can be sent to worker processes.
    """
    symbol = symbol_data['symbol'].iloc[0]
    
    # Calculate indicators
    symbol_data = symbol_data.sort_values('date')
    symbol_data['sma_20'] = symbol_data['close'].rolling(20, min_periods=1).mean()
    symbol_data['returns'] = symbol_data['close'].pct_change()
    
    return symbol_data


class ScalablePipeline:
    """
    Pipeline that scales horizontally using multiprocessing.
    
    Useful when processing many symbols independently.
    """
    
    def __init__(self, max_workers: int = None):
        self.max_workers = max_workers or mp.cpu_count()
    
    def process_parallel(self, data: pd.DataFrame) -> pd.DataFrame:
        """
        Process data in parallel by symbol.
        
        Each symbol is processed independently in a separate process,
        utilizing all CPU cores.
        """
        # Group by symbol
        symbol_groups = [group for _, group in data.groupby('symbol')]
        
        # Process in parallel
        with ProcessPoolExecutor(max_workers=self.max_workers) as executor:
            results = list(executor.map(process_symbol_chunk, symbol_groups))
        
        # Combine results
        return pd.concat(results, ignore_index=True)
    
    def process_distributed(self, 
                          data: pd.DataFrame,
                          partition_col: str = 'date') -> List[pd.DataFrame]:
        """
        Partition data for distributed processing (e.g., Spark, Dask).
        
        Returns list of partitions that can be processed on different nodes.
        """
        partitions = []
        
        for partition_value, group in data.groupby(partition_col):
            partitions.append({
                'partition_key': partition_value,
                'data': group,
                'count': len(group)
            })
        
        return partitions


def demonstrate_scalability():
    """
    Demonstrate scalability concepts.
    """
    print("=" * 70)
    print("Scalability Considerations")
    print("=" * 70)
    
    # Generate large dataset
    print("\nGenerating test data (100,000 records)...")
    dates = pd.date_range('2020-01-01', '2024-01-01', freq='B')
    symbols = [f'STOCK_{i}' for i in range(100)]  # 100 symbols
    
    data = []
    for date in dates:
        for symbol in symbols:
            data.append({
                'symbol': symbol,
                'date': date,
                'close': np.random.uniform(100, 1000),
                'volume': np.random.randint(10000, 1000000)
            })
    
    df = pd.DataFrame(data)
    print(f"Total records: {len(df)}")
    
    # Parallel processing
    pipeline = ScalablePipeline(max_workers=4)
    
    print("\nProcessing with 4 workers...")
    import time
    start = time.time()
    result = pipeline.process_parallel(df.head(10000))  # Process subset for demo
    duration = time.time() - start
    
    print(f"Processed in {duration:.2f} seconds")
    print(f"CPU cores used: {pipeline.max_workers}")
    
    # Partitioning strategy
    print("\nData Partitioning for Distributed Processing:")
    partitions = pipeline.process_distributed(df.head(1000), 'date')
    print(f"  Created {len(partitions)} daily partitions")
    print(f"  Average partition size: {np.mean([p['count'] for p in partitions]):.0f} records")
    
    print("\nScaling Recommendations for NEPSE:")
    print("  1. Current scale (< 1M records): Single machine, pandas")
    print("  2. Medium scale (1M-100M): Dask or multiprocessing")
    print("  3. Large scale (> 100M): Apache Spark on cluster")
    print("  4. Real-time: Kafka + Flink streaming")


if __name__ == "__main__":
    demonstrate_scalability()
```

---

## **9.10 Cost Optimization**

Managing infrastructure costs while maintaining performance.

```python
"""
Cost Optimization Strategies

1. Storage Tiers: Hot (SSD) -> Warm (Disk) -> Cold (S3 Glacier)
2. Compute: Spot instances, auto-scaling, serverless
3. Data Lifecycle: Archive old data, compression
4. Query Optimization: Partition pruning, column selection
"""

from datetime import datetime, timedelta


class CostOptimizer:
    """
    Strategies to minimize pipeline operating costs.
    """
    
    def __init__(self):
        self.costs = {
            'storage_gb_month': 0.023,  # S3 Standard per GB
            'compute_hour': 0.05,       # EC2 spot instance
            'api_call': 0.001           # Per 1000 API calls
        }
    
    def calculate_storage_cost(self, 
                              data_size_gb: float,
                              storage_class: str = 'standard') -> float:
        """
        Calculate monthly storage cost.
        
        Storage classes:
        - standard: $0.023/GB (frequent access)
        - infrequent: $0.0125/GB (monthly access)
        - glacier: $0.004/GB (archive, rare access)
        """
        rates = {
            'standard': 0.023,
            'infrequent': 0.0125,
            'glacier': 0.004
        }
        
        rate = rates.get(storage_class, 0.023)
        return data_size_gb * rate
    
    def recommend_storage_tier(self, 
                              last_access_days: int,
                              access_frequency: str) -> str:
        """
        Recommend storage tier based on access patterns.
        """
        if last_access_days > 365:
            return 'glacier'
        elif last_access_days > 30 or access_frequency == 'rare':
            return 'infrequent'
        else:
            return 'standard'
    
    def calculate_processing_cost(self,
                                 run_time_hours: float,
                                 vcpu_count: int,
                                 memory_gb: int,
                                 use_spot: bool = True) -> float:
        """
        Calculate compute cost for pipeline run.
        
        Spot instances can save up to 90% but can be interrupted.
        """
        base_rate = 0.05  # per hour per vCPU
        
        if use_spot:
            base_rate *= 0.3  # 70% discount
        
        return run_time_hours * vcpu_count * base_rate
    
    def optimize_query_cost(self, 
                           data_scanned_gb: float,
                           column_optimization: bool = True) -> float:
        """
        Calculate query cost (BigQuery model: $5 per TB scanned).
        
        Column optimization (SELECT only needed columns) reduces cost.
        """
        cost_per_tb = 5.0
        
        if column_optimization:
            # Typical reduction: 80% less data scanned
            data_scanned_gb *= 0.2
        
        return (data_scanned_gb / 1024) * cost_per_tb


def demonstrate_cost_optimization():
    """
    Demonstrate cost calculations.
    """
    print("=" * 70)
    print("Cost Optimization")
    print("=" * 70)
    
    optimizer = CostOptimizer()
    
    # NEPSE data cost calculation
    data_size_gb = 0.5  # 500 MB historical data
    growth_rate = 0.001  # 1 MB per day
    
    print("\nStorage Cost Analysis (Monthly):")
    print(f"Current data size: {data_size_gb} GB")
    
    # Current month
    standard_cost = optimizer.calculate_storage_cost(data_size_gb, 'standard')
    glacier_cost = optimizer.calculate_storage_cost(data_size_gb, 'glacier')
    
    print(f"  Standard storage: ${standard_cost:.2f}/month")
    print(f"  Glacier archive: ${glacier_cost:.2f}/month")
    print(f"  Potential savings: ${standard_cost - glacier_cost:.2f}/month")
    
    # Processing cost
    print("\nProcessing Cost (Daily Pipeline):")
    daily_cost = optimizer.calculate_processing_cost(
        run_time_hours=0.1,  # 6 minutes
        vcpu_count=2,
        memory_gb=4,
        use_spot=True
    )
    print(f"  Cost per run: ${daily_cost:.3f}")
    print(f"  Monthly cost (22 trading days): ${daily_cost * 22:.2f}")
    
    # Query optimization
    print("\nQuery Cost Optimization:")
    full_scan_cost = optimizer.optimize_query_cost(0.5, column_optimization=False)
    optimized_cost = optimizer.optimize_query_cost(0.5, column_optimization=True)
    
    print(f"  Full table scan: ${full_scan_cost:.4f} per query")
    print(f"  Column-optimized: ${optimized_cost:.4f} per query")
    print(f"  Savings per query: {(1 - optimized_cost/full_scan_cost)*100:.0f}%")
    
    print("\nCost Optimization Recommendations:")
    print("  1. Move data > 1 year old to Glacier (saves 80%)")
    print("  2. Use Spot instances for batch processing (saves 70%)")
    print("  3. Partition data by date (reduces query scan by 90%)")
    print("  4. Compress files (Parquet with Snappy)")


if __name__ == "__main__":
    demonstrate_cost_optimization()
```

---

## **9.11 Building Production Pipelines**

Putting it all together into a production-ready system.

```python
"""
Production Pipeline for NEPSE

Complete integration of all concepts:
- Orchestration with Airflow
- Data quality gates
- Error handling with retries
- Monitoring and alerting
- Cost optimization
"""

import logging
from datetime import datetime

# Configure production logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s [%(levelname)s] %(name)s: %(message)s',
    handlers=[
        logging.FileHandler('nepse_pipeline.log'),
        logging.StreamHandler()
    ]
)


class ProductionNEPSEPipeline:
    """
    Production-ready pipeline for NEPSE daily data ingestion.
    
    Features:
    - Comprehensive error handling
    - Data quality validation
    - Performance monitoring
    - Idempotent operations
    - Automatic retries
    """
    
    def __init__(self):
        self.logger = logging.getLogger('ProductionNEPSEPipeline')
        self.monitor = PipelineMonitor('nepse_production')
        self.dlq = DeadLetterQueue('./production_dlq.json')
        self.quality_suite = DataQualitySuite()
        
        # Setup quality checks
        self.quality_suite.add_check(check_no_missing_symbols)
        self.quality_suite.add_check(check_price_positive)
        self.quality_suite.add_check(check_no_duplicates)
        self.quality_suite.add_check(check_price_range_consistency)
    
    @RetryWithBackoff(max_retries=3, base_delay=60)
    def extract(self, date: str):
        """Extract with retry logic."""
        self.logger.info(f"Extracting data for {date}")
        # API call here
        return pd.DataFrame()  # Placeholder
    
    def validate(self, df: pd.DataFrame) -> bool:
        """Run quality gates."""
        self.logger.info("Running data quality checks")
        is_valid = self.quality_suite.validate(df)
        
        report = self.quality_suite.get_report()
        self.monitor.record_metric('quality_checks_passed', report['passed'])
        self.monitor.record_metric('quality_checks_failed', report['failed'])
        
        return is_valid
    
    def run(self, date: str = None):
        """Execute full pipeline."""
        if date is None:
            date = datetime.now().strftime('%Y-%m-%d')
        
        run_id = self.monitor.start_run()
        self.logger.info(f"Starting production run: {run_id}")
        
        try:
            # Extract
            raw_data = self.extract(date)
            self.monitor.record_metric('records_extracted', len(raw_data))
            
            # Validate
            if not self.validate(raw_data):
                raise ValueError("Data quality validation failed")
            
            # Transform
            processed = self.transform_data(raw_data)
            
            # Load
            self.load_to_production_warehouse(processed)
            self.monitor.record_metric('records_loaded', len(processed))
            
            # Success
            self.monitor.end_run('success')
            self.logger.info(f"Pipeline completed successfully: {run_id}")
            
            return {'status': 'success', 'run_id': run_id}
            
        except Exception as e:
            self.logger.error(f"Pipeline failed: {str(e)}")
            self.monitor.end_run('failed')
            
            # Save to DLQ for investigation
            self.dlq.add({'date': date}, str(e))
            self.dlq.save()
            
            raise


def demonstrate_production_pipeline():
    """
    Demonstrate production pipeline structure.
    """
    print("=" * 70)
    print("Production Pipeline Architecture")
    print("=" * 70)
    
    print("""
    Production Pipeline Components:
    
    1. Orchestration: Apache Airflow
       - Schedule: Daily at 18:00 (after market close)
       - Retries: 3 attempts with exponential backoff
       - Alerts: Email on failure
    
    2. Data Quality: Great Expectations
       - Schema validation
       - Statistical profiling
       - Automatic documentation
    
    3. Storage: Tiered approach
       - Hot: Last 30 days (SSD)
       - Warm: 1 month to 1 year (Standard S3)
       - Cold: > 1 year (Glacier)
    
    4. Monitoring: Prometheus + Grafana
       - Pipeline duration
       - Success rates
       - Data quality metrics
       - Cost tracking
    
    5. Disaster Recovery:
       - Daily backups to S3
       - Cross-region replication
       - 4-hour RPO (Recovery Point Objective)
    
    6. Security:
       - Encryption at rest (AES-256)
       - TLS in transit
       - IAM roles for service accounts
       - Audit logging
    """)
    
    # Initialize production pipeline
    pipeline = ProductionNEPSEPipeline()
    print(f"\nProduction pipeline initialized: {pipeline.monitor.pipeline_id}")
    
    return pipeline


if __name__ == "__main__":
    demonstrate_production_pipeline()
```

---

## **Chapter Summary**

In this chapter, we covered comprehensive data pipeline strategies:

### **Key Takeaways:**

1. **Architecture Patterns**:
   - **ETL**: Transform before load (protects warehouse)
   - **ELT**: Load then transform (leverages DB power)
   - **Lambda**: Batch + speed layers for different latencies
   - **Medallion**: Bronze (raw) → Silver (clean) → Gold (features)

2. **Batch Processing**:
   - Idempotent operations (safe to retry)
   - Checkpointing (resume from failures)
   - Validation gates (stop bad data early)

3. **Stream Processing**:
   - Real-time tick processing
   - Windowed aggregations (moving averages)
   - Anomaly detection with cooldowns

4. **Orchestration**:
   - **Airflow**: Industry standard, DAG-based
   - **Prefect**: Modern, Pythonic, better error handling
   - **Dagster**: Asset-centric, strong typing

5. **Data Quality**:
   - Schema validation (types, ranges)
   - Business rules (high >= low, positive prices)
   - Great Expectations library for comprehensive checks

6. **Monitoring**:
   - Operational metrics (runtime, success rate)
   - Data drift detection (distribution changes)
   - Alerting on failures and anomalies

7. **Error Handling**:
   - Exponential backoff for retries
   - Circuit breakers (stop hammering failing services)
   - Dead letter queues (save failed records)

8. **Testing**:
   - Unit tests for transformations
   - Integration tests for databases
   - Data quality tests (schema, duplicates)

9. **Scalability**:
   - Horizontal scaling (multiprocessing by symbol)
   - Partitioning strategies (date-based)
   - Tools: Dask (medium), Spark (large)

10. **Cost Optimization**:
    - Storage tiers (hot/warm/cold)
    - Spot instances for batch jobs
    - Column pruning for queries

11. **Production Readiness**:
    - Comprehensive logging
    - Automated retries
    - Quality gates
    - Disaster recovery
    - Security (encryption, IAM)

### **Next Steps:**

Chapter 10 will cover **Feature Engineering**, including:
- Creating technical indicators (SMA, RSI, MACD)
- Lag features and rolling windows
- Domain-specific features for finance
- Feature selection and importance
- Automated feature engineering with tsfresh

---

**End of Chapter 9**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='8. data_storage_and_management.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../3. feature_engineering/10. introduction_to_feature_engineering.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
