# Day 02: Data Pipeline for Capstone Project

## Week 24 - Capstone Project

**Objective**: Build a robust, production-ready data pipeline for our ML trading system capstone project.

### Topics Covered:
1. Data Pipeline Architecture Overview
2. Data Sources Configuration
3. Data Extraction Layer (APIs, market data)
4. Data Transformation Pipeline
5. Data Validation and Quality Checks
6. Feature Engineering Pipeline
7. Data Storage and Caching
8. Pipeline Orchestration
9. End-to-End Pipeline Testing

---

### Why Data Pipelines Matter in Quant Finance

A well-designed data pipeline is the foundation of any successful ML trading system:

- **Reliability**: Consistent, clean data is critical for model performance
- **Reproducibility**: Pipelines ensure experiments can be replicated
- **Scalability**: Handle growing data volumes efficiently
- **Automation**: Reduce manual intervention and human error
- **Monitoring**: Track data quality and pipeline health

## 1. Import Required Libraries

In [None]:
# Core Libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Data Fetching
import yfinance as yf

# Data Validation
from typing import Dict, List, Optional, Tuple, Any
from dataclasses import dataclass, field
from abc import ABC, abstractmethod
import hashlib
import json

# File I/O and Caching
import os
from pathlib import Path
import pickle

# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Logging
import logging

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('DataPipeline')

print("‚úÖ All libraries imported successfully!")

## 2. Data Pipeline Architecture

### Pipeline Design Pattern

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                      DATA PIPELINE ARCHITECTURE                      ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                      ‚îÇ
‚îÇ   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îÇ
‚îÇ   ‚îÇ  DATA    ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  DATA    ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  DATA    ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ FEATURE  ‚îÇ     ‚îÇ
‚îÇ   ‚îÇ SOURCES  ‚îÇ    ‚îÇEXTRACTION‚îÇ    ‚îÇTRANSFORM ‚îÇ    ‚îÇENGINEERING‚îÇ    ‚îÇ
‚îÇ   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îÇ
‚îÇ        ‚îÇ               ‚îÇ               ‚îÇ               ‚îÇ            ‚îÇ
‚îÇ        ‚ñº               ‚ñº               ‚ñº               ‚ñº            ‚îÇ
‚îÇ   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îÇ
‚îÇ   ‚îÇ  CONFIG  ‚îÇ    ‚îÇVALIDATION‚îÇ    ‚îÇ QUALITY  ‚îÇ    ‚îÇ  FEATURE ‚îÇ     ‚îÇ
‚îÇ   ‚îÇ  LAYER   ‚îÇ    ‚îÇ  LAYER   ‚îÇ    ‚îÇ  CHECKS  ‚îÇ    ‚îÇ   STORE  ‚îÇ     ‚îÇ
‚îÇ   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îÇ                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                           ‚îÇ
‚îÇ                    ‚îÇ   ORCHESTRATION    ‚îÇ                           ‚îÇ
‚îÇ                    ‚îÇ   & MONITORING     ‚îÇ                           ‚îÇ
‚îÇ                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                           ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

## 3. Define Data Sources and Configuration

In [None]:
@dataclass
class PipelineConfig:
    """Configuration for the data pipeline."""
    
    # Ticker Configuration
    tickers: List[str] = field(default_factory=lambda: [
        'SPY', 'QQQ', 'IWM',  # Major ETFs
        'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META',  # Tech Giants
        'JPM', 'GS', 'BAC',  # Financials
        'XLE', 'GLD', 'TLT'  # Sectors/Commodities/Bonds
    ])
    
    # Date Range
    start_date: str = '2020-01-01'
    end_date: str = '2025-12-31'
    
    # Data Frequency
    frequency: str = '1d'  # '1d', '1h', '5m', etc.
    
    # Storage Configuration
    data_dir: str = './pipeline_data'
    cache_dir: str = './pipeline_cache'
    
    # Validation Thresholds
    max_missing_pct: float = 0.05  # Max 5% missing data
    min_trading_days: int = 252  # At least 1 year of data
    max_price_change_pct: float = 0.50  # Flag >50% daily moves
    
    # Feature Engineering
    lookback_windows: List[int] = field(default_factory=lambda: [5, 10, 20, 60, 120, 252])
    
    # Pipeline Settings
    enable_caching: bool = True
    parallel_downloads: bool = True
    
    def __post_init__(self):
        """Create necessary directories."""
        Path(self.data_dir).mkdir(parents=True, exist_ok=True)
        Path(self.cache_dir).mkdir(parents=True, exist_ok=True)


# Initialize configuration
config = PipelineConfig()
print(f"üìã Pipeline Configuration:")
print(f"   Tickers: {len(config.tickers)} symbols")
print(f"   Date Range: {config.start_date} to {config.end_date}")
print(f"   Data Directory: {config.data_dir}")

## 4. Create Data Extraction Functions

### 4.1 Base Data Extractor Interface

In [None]:
class DataExtractor(ABC):
    """Abstract base class for data extractors."""
    
    @abstractmethod
    def extract(self, symbols: List[str], start_date: str, end_date: str) -> pd.DataFrame:
        """Extract data for given symbols and date range."""
        pass
    
    @abstractmethod
    def get_source_name(self) -> str:
        """Return the name of the data source."""
        pass


class YFinanceExtractor(DataExtractor):
    """Extract market data using yfinance API."""
    
    def __init__(self, config: PipelineConfig):
        self.config = config
        self.logger = logging.getLogger('YFinanceExtractor')
    
    def get_source_name(self) -> str:
        return "Yahoo Finance"
    
    def extract(self, symbols: List[str], start_date: str, end_date: str) -> pd.DataFrame:
        """
        Extract OHLCV data for multiple symbols.
        
        Returns:
            DataFrame with MultiIndex columns (ticker, field)
        """
        self.logger.info(f"Extracting data for {len(symbols)} symbols...")
        
        all_data = {}
        failed_symbols = []
        
        for symbol in symbols:
            try:
                ticker = yf.Ticker(symbol)
                df = ticker.history(start=start_date, end=end_date, interval=self.config.frequency)
                
                if df.empty:
                    self.logger.warning(f"No data returned for {symbol}")
                    failed_symbols.append(symbol)
                    continue
                
                # Standardize column names
                df.columns = df.columns.str.lower().str.replace(' ', '_')
                
                # Keep essential columns
                essential_cols = ['open', 'high', 'low', 'close', 'volume']
                available_cols = [col for col in essential_cols if col in df.columns]
                df = df[available_cols]
                
                all_data[symbol] = df
                self.logger.info(f"‚úì {symbol}: {len(df)} records")
                
            except Exception as e:
                self.logger.error(f"Error extracting {symbol}: {str(e)}")
                failed_symbols.append(symbol)
        
        if not all_data:
            raise ValueError("No data extracted for any symbol")
        
        # Combine into single DataFrame with MultiIndex columns
        combined_df = pd.concat(all_data, axis=1)
        combined_df.index = pd.to_datetime(combined_df.index).tz_localize(None)
        
        self.logger.info(f"‚úÖ Extraction complete: {len(all_data)} symbols, {len(combined_df)} records")
        if failed_symbols:
            self.logger.warning(f"Failed symbols: {failed_symbols}")
        
        return combined_df
    
    def extract_single(self, symbol: str, start_date: str, end_date: str) -> pd.DataFrame:
        """Extract data for a single symbol with additional metadata."""
        ticker = yf.Ticker(symbol)
        
        # Get price data
        price_data = ticker.history(start=start_date, end=end_date, interval=self.config.frequency)
        price_data.columns = price_data.columns.str.lower().str.replace(' ', '_')
        
        # Get fundamental info (for enrichment)
        try:
            info = ticker.info
            sector = info.get('sector', 'Unknown')
            industry = info.get('industry', 'Unknown')
            market_cap = info.get('marketCap', np.nan)
        except:
            sector, industry, market_cap = 'Unknown', 'Unknown', np.nan
        
        # Add metadata columns
        price_data['symbol'] = symbol
        price_data['sector'] = sector
        price_data['industry'] = industry
        price_data['market_cap'] = market_cap
        
        return price_data


# Initialize extractor
extractor = YFinanceExtractor(config)
print(f"‚úÖ Data Extractor initialized: {extractor.get_source_name()}")

## 5. Implement Data Transformation Pipeline

### 5.1 Data Transformers

In [None]:
class DataTransformer:
    """Transform and clean raw market data."""
    
    def __init__(self, config: PipelineConfig):
        self.config = config
        self.logger = logging.getLogger('DataTransformer')
    
    def handle_missing_values(self, df: pd.DataFrame, method: str = 'ffill') -> pd.DataFrame:
        """
        Handle missing values in the dataset.
        
        Methods:
        - 'ffill': Forward fill (appropriate for prices)
        - 'interpolate': Linear interpolation
        - 'drop': Drop rows with missing values
        """
        missing_before = df.isna().sum().sum()
        
        if method == 'ffill':
            df = df.ffill().bfill()  # Forward fill, then backfill for start
        elif method == 'interpolate':
            df = df.interpolate(method='time')
        elif method == 'drop':
            df = df.dropna()
        
        missing_after = df.isna().sum().sum()
        self.logger.info(f"Missing values: {missing_before} ‚Üí {missing_after}")
        
        return df
    
    def detect_outliers(self, series: pd.Series, method: str = 'zscore', 
                        threshold: float = 3.0) -> pd.Series:
        """
        Detect outliers in a series.
        
        Methods:
        - 'zscore': Z-score method
        - 'iqr': Interquartile range method
        - 'mad': Median absolute deviation
        """
        if method == 'zscore':
            z_scores = np.abs((series - series.mean()) / series.std())
            return z_scores > threshold
        
        elif method == 'iqr':
            Q1, Q3 = series.quantile([0.25, 0.75])
            IQR = Q3 - Q1
            lower_bound = Q1 - threshold * IQR
            upper_bound = Q3 + threshold * IQR
            return (series < lower_bound) | (series > upper_bound)
        
        elif method == 'mad':
            median = series.median()
            mad = np.median(np.abs(series - median))
            modified_z = 0.6745 * (series - median) / mad
            return np.abs(modified_z) > threshold
        
        return pd.Series(False, index=series.index)
    
    def handle_outliers(self, df: pd.DataFrame, columns: List[str] = None,
                        method: str = 'clip') -> Tuple[pd.DataFrame, Dict]:
        """
        Handle outliers in specified columns.
        
        Methods:
        - 'clip': Clip to percentile bounds (winsorize)
        - 'remove': Remove outlier rows
        - 'replace': Replace with median
        """
        if columns is None:
            # Only apply to return columns
            columns = [col for col in df.columns if 'return' in str(col).lower()]
        
        outlier_report = {}
        
        for col in columns:
            if col not in df.columns:
                continue
            
            outliers = self.detect_outliers(df[col])
            outlier_count = outliers.sum()
            outlier_report[col] = outlier_count
            
            if outlier_count > 0:
                if method == 'clip':
                    lower = df[col].quantile(0.01)
                    upper = df[col].quantile(0.99)
                    df[col] = df[col].clip(lower, upper)
                elif method == 'replace':
                    median = df[col].median()
                    df.loc[outliers, col] = median
                elif method == 'remove':
                    df = df[~outliers]
        
        self.logger.info(f"Outliers handled: {sum(outlier_report.values())} total")
        return df, outlier_report
    
    def normalize_data(self, df: pd.DataFrame, columns: List[str],
                       method: str = 'zscore') -> Tuple[pd.DataFrame, Dict]:
        """
        Normalize specified columns.
        
        Methods:
        - 'zscore': Standardization (mean=0, std=1)
        - 'minmax': Min-Max scaling [0, 1]
        - 'robust': Robust scaling using median/IQR
        """
        scalers = {}
        df_normalized = df.copy()
        
        for col in columns:
            if col not in df.columns:
                continue
            
            if method == 'zscore':
                scaler = StandardScaler()
            elif method == 'minmax':
                scaler = MinMaxScaler()
            elif method == 'robust':
                scaler = RobustScaler()
            else:
                raise ValueError(f"Unknown normalization method: {method}")
            
            values = df[col].values.reshape(-1, 1)
            df_normalized[col] = scaler.fit_transform(values).flatten()
            scalers[col] = scaler
        
        return df_normalized, scalers
    
    def resample_data(self, df: pd.DataFrame, frequency: str = 'W') -> pd.DataFrame:
        """
        Resample time series data to different frequency.
        
        Frequencies:
        - 'D': Daily
        - 'W': Weekly
        - 'M': Monthly
        - 'Q': Quarterly
        """
        # OHLCV aggregation rules
        agg_rules = {
            'open': 'first',
            'high': 'max',
            'low': 'min',
            'close': 'last',
            'volume': 'sum'
        }
        
        if isinstance(df.columns, pd.MultiIndex):
            # Handle MultiIndex columns
            resampled = {}
            for symbol in df.columns.get_level_values(0).unique():
                symbol_df = df[symbol]
                symbol_agg = {col: agg_rules.get(col, 'last') 
                            for col in symbol_df.columns if col in agg_rules}
                resampled[symbol] = symbol_df.resample(frequency).agg(symbol_agg)
            return pd.concat(resampled, axis=1)
        else:
            available_rules = {col: agg_rules.get(col, 'last') 
                            for col in df.columns if col in agg_rules}
            return df.resample(frequency).agg(available_rules)


# Initialize transformer
transformer = DataTransformer(config)
print("‚úÖ Data Transformer initialized")

## 6. Build Data Validation Layer

### 6.1 Validation Rules and Quality Checks

In [None]:
@dataclass
class ValidationResult:
    """Result of a validation check."""
    check_name: str
    passed: bool
    message: str
    details: Dict = field(default_factory=dict)
    severity: str = 'error'  # 'error', 'warning', 'info'


class DataValidator:
    """Validate data quality and integrity."""
    
    def __init__(self, config: PipelineConfig):
        self.config = config
        self.logger = logging.getLogger('DataValidator')
        self.validation_results: List[ValidationResult] = []
    
    def check_missing_values(self, df: pd.DataFrame, threshold: float = None) -> ValidationResult:
        """Check for missing values exceeding threshold."""
        if threshold is None:
            threshold = self.config.max_missing_pct
        
        missing_pct = df.isna().sum().sum() / df.size
        passed = missing_pct <= threshold
        
        result = ValidationResult(
            check_name="missing_values",
            passed=passed,
            message=f"Missing values: {missing_pct:.2%} (threshold: {threshold:.2%})",
            details={'missing_pct': missing_pct, 'threshold': threshold},
            severity='error' if not passed else 'info'
        )
        self.validation_results.append(result)
        return result
    
    def check_date_continuity(self, df: pd.DataFrame, max_gap_days: int = 5) -> ValidationResult:
        """Check for gaps in date series."""
        if not isinstance(df.index, pd.DatetimeIndex):
            return ValidationResult(
                check_name="date_continuity",
                passed=False,
                message="Index is not DatetimeIndex",
                severity='error'
            )
        
        date_diffs = df.index.to_series().diff().dt.days
        max_gap = date_diffs.max()
        gap_dates = df.index[date_diffs > max_gap_days].tolist()
        
        passed = len(gap_dates) == 0
        
        result = ValidationResult(
            check_name="date_continuity",
            passed=passed,
            message=f"Max date gap: {max_gap} days, Gaps > {max_gap_days} days: {len(gap_dates)}",
            details={'max_gap': max_gap, 'gap_dates': gap_dates[:5]},  # First 5 gaps
            severity='warning' if not passed else 'info'
        )
        self.validation_results.append(result)
        return result
    
    def check_price_sanity(self, df: pd.DataFrame, price_col: str = 'close') -> ValidationResult:
        """Check for unrealistic price movements."""
        issues = []
        
        # Check for negative prices
        if isinstance(df.columns, pd.MultiIndex):
            for symbol in df.columns.get_level_values(0).unique():
                if (symbol, price_col) in df.columns:
                    prices = df[(symbol, price_col)]
                    if (prices <= 0).any():
                        issues.append(f"{symbol}: negative/zero prices")
                    
                    returns = prices.pct_change()
                    extreme_moves = (returns.abs() > self.config.max_price_change_pct).sum()
                    if extreme_moves > 0:
                        issues.append(f"{symbol}: {extreme_moves} moves > {self.config.max_price_change_pct:.0%}")
        else:
            if price_col in df.columns:
                prices = df[price_col]
                if (prices <= 0).any():
                    issues.append("Negative/zero prices found")
                
                returns = prices.pct_change()
                extreme_moves = (returns.abs() > self.config.max_price_change_pct).sum()
                if extreme_moves > 0:
                    issues.append(f"{extreme_moves} moves > {self.config.max_price_change_pct:.0%}")
        
        passed = len(issues) == 0
        
        result = ValidationResult(
            check_name="price_sanity",
            passed=passed,
            message=f"Price sanity check: {len(issues)} issues found",
            details={'issues': issues},
            severity='warning' if not passed else 'info'
        )
        self.validation_results.append(result)
        return result
    
    def check_minimum_history(self, df: pd.DataFrame, min_days: int = None) -> ValidationResult:
        """Check for minimum historical data."""
        if min_days is None:
            min_days = self.config.min_trading_days
        
        actual_days = len(df)
        passed = actual_days >= min_days
        
        result = ValidationResult(
            check_name="minimum_history",
            passed=passed,
            message=f"Trading days: {actual_days} (minimum: {min_days})",
            details={'actual_days': actual_days, 'min_days': min_days},
            severity='error' if not passed else 'info'
        )
        self.validation_results.append(result)
        return result
    
    def check_schema(self, df: pd.DataFrame, required_columns: List[str]) -> ValidationResult:
        """Check if required columns exist."""
        if isinstance(df.columns, pd.MultiIndex):
            available_cols = df.columns.get_level_values(1).unique().tolist()
        else:
            available_cols = df.columns.tolist()
        
        missing_cols = [col for col in required_columns if col not in available_cols]
        passed = len(missing_cols) == 0
        
        result = ValidationResult(
            check_name="schema_validation",
            passed=passed,
            message=f"Schema check: {len(missing_cols)} missing columns",
            details={'missing_columns': missing_cols, 'available_columns': available_cols},
            severity='error' if not passed else 'info'
        )
        self.validation_results.append(result)
        return result
    
    def validate_all(self, df: pd.DataFrame) -> Tuple[bool, List[ValidationResult]]:
        """Run all validation checks."""
        self.validation_results = []
        
        # Run all checks
        self.check_missing_values(df)
        self.check_date_continuity(df)
        self.check_price_sanity(df)
        self.check_minimum_history(df)
        self.check_schema(df, ['open', 'high', 'low', 'close', 'volume'])
        
        # Determine overall pass/fail
        errors = [r for r in self.validation_results if r.severity == 'error' and not r.passed]
        all_passed = len(errors) == 0
        
        return all_passed, self.validation_results
    
    def print_report(self):
        """Print validation report."""
        print("\n" + "="*60)
        print("üìä DATA VALIDATION REPORT")
        print("="*60)
        
        for result in self.validation_results:
            status = "‚úÖ" if result.passed else ("‚ö†Ô∏è" if result.severity == 'warning' else "‚ùå")
            print(f"{status} {result.check_name}: {result.message}")
        
        errors = sum(1 for r in self.validation_results if not r.passed and r.severity == 'error')
        warnings = sum(1 for r in self.validation_results if not r.passed and r.severity == 'warning')
        
        print("="*60)
        print(f"Summary: {errors} errors, {warnings} warnings")
        print("="*60 + "\n")


# Initialize validator
validator = DataValidator(config)
print("‚úÖ Data Validator initialized")

## 7. Create Feature Engineering Pipeline

### 7.1 Technical Indicators and Features