# **Chapter 6: Data Cleaning and Preprocessing**

---

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Develop systematic data cleaning strategies for time-series data
- Identify and handle duplicate records effectively
- Standardize data types and formats across datasets
- Analyze missing data patterns (MCAR, MAR, MNAR) and choose appropriate imputation strategies
- Apply advanced imputation techniques including KNN and iterative methods
- Detect and treat outliers using statistical, ML, and domain-specific approaches
- Implement data smoothing and noise reduction techniques
- Build automated preprocessing pipelines for production systems
- Ensure reproducibility in data preprocessing steps

---

## **Prerequisites**

- Completed Chapter 4: Data Fundamentals and Chapter 5: Data Collection
- Understanding of pandas DataFrame operations
- Basic statistical concepts (mean, median, standard deviation)
- Familiarity with NEPSE data structure (OHLCV format)

---

## **6.1 Data Cleaning Strategy**

Data cleaning is not just about fixing errors—it's about understanding your data's limitations and ensuring it meets the requirements of your prediction models. A systematic approach prevents ad-hoc decisions that can introduce bias.

```python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import logging
from typing import Dict, List, Tuple, Optional

class DataCleaningStrategy:
    """
    Framework for systematic data cleaning in time-series prediction systems.
    
    The strategy follows these principles:
    1. Assess before acting (understand the data first)
    2. Document everything (track what was changed and why)
    3. Preserve raw data (never modify source files)
    4. Validate after cleaning (ensure quality improved, not degraded)
    5. Make reproducible (same cleaning steps every time)
    """
    
    def __init__(self, raw_data: pd.DataFrame, symbol: str):
        self.raw_data = raw_data.copy()
        self.symbol = symbol
        self.cleaning_log = []
        self.quality_metrics = {}
        
        # Calculate initial quality metrics
        self._assess_initial_quality()
    
    def _assess_initial_quality(self):
        """
        Baseline assessment before any cleaning.
        
        These metrics help determine if cleaning improved the data
        or accidentally removed useful information.
        """
        self.quality_metrics['initial'] = {
            'total_rows': len(self.raw_data),
            'missing_values': self.raw_data.isna().sum().sum(),
            'duplicate_rows': self.raw_data.duplicated().sum(),
            'date_range': (self.raw_data.index.min(), self.raw_data.index.max()) 
                          if isinstance(self.raw_data.index, pd.DatetimeIndex) else None,
            'numeric_outliers': self._count_outliers(),
            'memory_usage_mb': self.raw_data.memory_usage(deep=True).sum() / 1024 / 1024
        }
        
        print(f"Initial Quality Assessment for {self.symbol}:")
        print(f"  Records: {self.quality_metrics['initial']['total_rows']}")
        print(f"  Missing: {self.quality_metrics['initial']['missing_values']}")
        print(f"  Duplicates: {self.quality_metrics['initial']['duplicate_rows']}")
        print(f"  Outliers: {self.quality_metrics['initial']['numeric_outliers']}")
    
    def _count_outliers(self) -> int:
        """Count outliers using IQR method for numeric columns."""
        numeric_cols = self.raw_data.select_dtypes(include=[np.number]).columns
        outlier_count = 0
        
        for col in numeric_cols:
            Q1 = self.raw_data[col].quantile(0.25)
            Q3 = self.raw_data[col].quantile(0.75)
            IQR = Q3 - Q1
            lower = Q1 - 1.5 * IQR
            upper = Q3 + 1.5 * IQR
            outlier_count += ((self.raw_data[col] < lower) | (self.raw_data[col] > upper)).sum()
        
        return outlier_count
    
    def log_action(self, action: str, details: str, rows_affected: int):
        """Log every cleaning action for audit trail."""
        self.cleaning_log.append({
            'timestamp': datetime.now().isoformat(),
            'action': action,
            'details': details,
            'rows_affected': rows_affected,
            'symbol': self.symbol
        })
    
    def get_cleaning_report(self) -> pd.DataFrame:
        """Generate comprehensive cleaning report."""
        if not self.cleaning_log:
            return pd.DataFrame()
        
        log_df = pd.DataFrame(self.cleaning_log)
        
        # Calculate final metrics if not done
        if 'final' not in self.quality_metrics:
            self.quality_metrics['final'] = {
                'total_rows': len(self.raw_data),
                'missing_values': self.raw_data.isna().sum().sum(),
                'duplicate_rows': self.raw_data.duplicated().sum(),
                'numeric_outliers': self._count_outliers()
            }
        
        # Calculate changes
        initial = self.quality_metrics['initial']
        final = self.quality_metrics['final']
        
        summary = {
            'metric': ['Total Rows', 'Missing Values', 'Duplicate Rows', 'Outliers'],
            'initial': [initial['total_rows'], initial['missing_values'], 
                       initial['duplicate_rows'], initial['numeric_outliers']],
            'final': [final['total_rows'], final['missing_values'],
                     final['duplicate_rows'], final['numeric_outliers']],
            'change': [
                final['total_rows'] - initial['total_rows'],
                final['missing_values'] - initial['missing_values'],
                final['duplicate_rows'] - initial['duplicate_rows'],
                final['numeric_outliers'] - initial['numeric_outliers']
            ]
        }
        
        return pd.DataFrame(summary)

# Usage example with NEPSE data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=20, freq='B')

# Create sample data with intentional issues
raw_nepse = pd.DataFrame({
    'Open': [2850.50, 2875.25, np.nan, 2865.75, 2880.50] * 4,
    'High': [2890.00, 2910.00, 2920.00, 2900.00, 2915.00] * 4,
    'Low': [2840.00, 2860.00, 2880.00, 2850.00, 2870.00] * 4,
    'Close': [2875.25, 2895.50, 2900.00, np.nan, 2905.00] * 4,
    'Volume': [125000, 150000, 175000, 140000, np.nan] * 4
}, index=dates)

# Add duplicates
duplicate_row = raw_nepse.iloc[0:1].copy()
raw_nepse = pd.concat([raw_nepse, duplicate_row])

# Initialize strategy
strategy = DataCleaningStrategy(raw_nepse, 'NABIL')

print("\nCleaning log initialized")
print("Strategy: Document all changes, preserve raw data, validate results")
```

**Explanation:**
- **Systematic cleaning** requires a framework, not ad-hoc fixes.
- **The strategy class** tracks:
  - **Initial metrics**: Baseline quality before cleaning
  - **Actions log**: Every modification documented with timestamp
  - **Final metrics**: Quality after cleaning
  - **Comparison**: Did we improve or degrade quality?
- **Key principle**: Never modify raw data files. Always work on copies and save cleaned versions separately.
- **Quality metrics**:
  - Total rows: Did we accidentally delete valid data?
  - Missing values: How complete is the data?
  - Duplicates: Are there redundant records?
  - Outliers: How many extreme values exist?
- **For NEPSE data**, tracking these metrics is crucial because:
  - Stock data should have no duplicates (each date-symbol pair is unique)
  - Missing values might indicate trading halts (information, not just errors)
  - Outliers might be genuine market events (earnings announcements, crashes)

---

## **6.2 Duplicate Detection and Removal**

Duplicates in time-series data can skew statistical analyses and cause data leakage in train-test splits. However, not all duplicates are errors—some represent legitimate repeated measurements.

```python
class DuplicateHandler:
    """
    Handle duplicates in time-series data with awareness of context.
    
    Types of duplicates in financial data:
    1. Exact duplicates: Same timestamp, same values (data entry error)
    2. Partial duplicates: Same timestamp, different values (correction/update)
    3. Near-duplicates: Slightly different timestamps (timezone issues)
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        self.duplicate_analysis = {}
    
    def analyze_duplicates(self) -> Dict:
        """
        Comprehensive duplicate analysis.
        """
        analysis = {}
        
        # 1. Exact duplicates (all columns identical)
        exact_dups = self.df.duplicated(keep=False)
        analysis['exact_duplicates'] = {
            'count': exact_dups.sum(),
            'indices': self.df[exact_dups].index.tolist()
        }
        
        # 2. Index duplicates (same timestamp, any values)
        idx_dups = self.df.index.duplicated(keep=False)
        analysis['index_duplicates'] = {
            'count': idx_dups.sum(),
            'groups': self._analyze_index_duplicates()
        }
        
        # 3. Subset duplicates (same date and symbol)
        if 'Symbol' in self.df.columns:
            subset_cols = [self.df.index.name or 'index', 'Symbol']
            subset_dups = self.df.duplicated(subset=subset_cols, keep=False)
            analysis['subset_duplicates'] = {
                'count': subset_dups.sum(),
                'note': 'Same date and symbol, different prices (corrections?)'
            }
        
        self.duplicate_analysis = analysis
        return analysis
    
    def _analyze_index_duplicates(self) -> List[Dict]:
        """Analyze groups of rows with same index."""
        grouped = self.df.groupby(self.df.index)
        conflicts = []
        
        for idx, group in grouped:
            if len(group) > 1:
                # Check if values are identical or different
                is_identical = group.duplicated().any()
                conflicts.append({
                    'index': idx,
                    'count': len(group),
                    'is_identical': is_identical,
                    'variance': group.select_dtypes(include=[np.number]).var().mean() 
                               if not is_identical else 0
                })
        
        return conflicts
    
    def remove_exact_duplicates(self, keep: str = 'first') -> pd.DataFrame:
        """
        Remove rows where all values are identical.
        
        keep='first': Keep first occurrence
        keep='last': Keep last occurrence (often more recent/corrected)
        keep=False: Remove all duplicates (investigate manually)
        """
        before_count = len(self.df)
        cleaned = self.df.drop_duplicates(keep=keep)
        removed = before_count - len(cleaned)
        
        print(f"Removed {removed} exact duplicate rows")
        print(f"Kept '{keep}' occurrence of each duplicate")
        
        return cleaned
    
    def handle_index_duplicates(self, strategy: str = 'keep_last') -> pd.DataFrame:
        """
        Handle rows with same timestamp (index).
        
        Strategies:
        - 'keep_first': Keep earliest record
        - 'keep_last': Keep latest record (often corrected data)
        - 'average': Average numeric values (for intraday aggregations)
        - 'flag': Keep all but add flag column
        """
        if strategy == 'keep_first':
            return self.df[~self.df.index.duplicated(keep='first')]
        
        elif strategy == 'keep_last':
            return self.df[~self.df.index.duplicated(keep='last')]
        
        elif strategy == 'average':
            # Group by index and average numeric columns
            # For financial data, this might be appropriate for intraday data
            # where you want daily averages
            
            numeric_cols = self.df.select_dtypes(include=[np.number]).columns
            
            # Aggregate: mean for numeric, first for others
            agg_dict = {}
            for col in self.df.columns:
                if col in numeric_cols:
                    agg_dict[col] = 'mean'
                else:
                    agg_dict[col] = 'first'
            
            return self.df.groupby(self.df.index).agg(agg_dict)
        
        elif strategy == 'flag':
            # Keep all rows but add a flag column indicating duplicates
            self.df['is_duplicate'] = self.df.index.duplicated(keep=False)
            self.df['duplicate_count'] = self.df.groupby(self.df.index).transform('size')
            return self.df
        
        else:
            raise ValueError(f"Unknown strategy: {strategy}")

# Continue with the NEPSE example
# First, let's create data with specific duplicate issues
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=10, freq='B')

# Create base NEPSE data
nepse_base = pd.DataFrame({
    'Symbol': 'NABIL',
    'Open': np.random.uniform(2800, 2900, 10),
    'High': np.random.uniform(2850, 2950, 10),
    'Low': np.random.uniform(2750, 2850, 10),
    'Close': np.random.uniform(2800, 2900, 10),
    'Volume': np.random.randint(100000, 200000, 10)
}, index=dates)

# Introduce exact duplicates (data entry error)
nepse_with_dups = pd.concat([nepse_base, nepse_base.iloc[0:2]])

# Introduce index duplicates with different values (price correction scenario)
correction_row = pd.DataFrame({
    'Symbol': 'NABIL',
    'Open': 2850.00,  # Corrected from original
    'High': 2900.00,
    'Low': 2840.00,
    'Close': 2880.00,  # Different from original
    'Volume': 150000
}, index=[dates[2]])  # Same date as row 2

nepse_with_dups = pd.concat([nepse_with_dups, correction_row])
nepse_with_dups.sort_index(inplace=True)

print("Data with duplicates:")
print(nepse_with_dups[nepse_with_dups.index.duplicated(keep=False)])

# Analyze duplicates
dup_handler = DuplicateHandler(nepse_with_dups)
analysis = dup_handler.analyze_duplicates()

print(f"\nDuplicate Analysis:")
print(f"Exact duplicates: {analysis['exact_duplicates']['count']}")
print(f"Index duplicates: {analysis['index_duplicates']['count']}")

# Handle exact duplicates
cleaned_exact = dup_handler.remove_exact_duplicates(keep='last')
print(f"\nAfter removing exact duplicates: {len(cleaned_exact)} rows")

# Handle index duplicates (keep last - assumes corrections are later)
dup_handler2 = DuplicateHandler(cleaned_exact)
final_clean = dup_handler2.handle_index_duplicates(strategy='keep_last')
print(f"After handling index duplicates: {len(final_clean)} rows")
print("\nFinal cleaned data:")
print(final_clean.head())
```

**Explanation:**
- **Exact duplicates** occur when the same row is accidentally inserted twice. This often happens during data collection retries or when merging datasets. We remove these while keeping either the first or last occurrence.
- **Index duplicates** (same timestamp) are more complex in financial data. They might represent:
  - **Corrections**: The exchange issued a correction for a price
  - **Auction trades**: Separate auction session on the same day
  - **Errors**: Data collection glitches
- **Strategy selection**:
  - `keep_last`: Assumes later data is more accurate (common for corrections)
  - `average`: Useful for intraday data aggregated to daily level
  - `flag`: When you want to preserve all data but mark potential issues
- **For NEPSE specifically**: If you see the same date twice with different prices, it's likely a correction was issued by the exchange. The later record is usually the corrected one.

---

## **6.3 Inconsistent Data Handling**

Inconsistent data includes mixed formats, varying units, or categorical values that should be standardized (e.g., "NABIL" vs "nabil" vs "Nabil Bank").

```python
class InconsistencyHandler:
    """
    Handle data inconsistencies common in NEPSE and financial datasets.
    
    Common inconsistencies:
    - Symbol naming variations (NABIL vs Nabil)
    - Date format variations (DD/MM/YYYY vs YYYY-MM-DD)
    - Mixed units (Volume in thousands vs actual shares)
    - Whitespace and special characters
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        self.changes_log = []
    
    def standardize_symbols(self, column: str = 'Symbol') -> pd.DataFrame:
        """
        Standardize stock symbols to uppercase and remove whitespace.
        
        NEPSE symbols should be consistent: NABIL, NICA, SCBL, etc.
        """
        if column not in self.df.columns:
            return self.df
        
        original_unique = self.df[column].nunique()
        
        # Standardize: uppercase, strip whitespace, remove special chars
        self.df[column] = (self.df[column]
                          .astype(str)
                          .str.upper()
                          .str.strip()
                          .str.replace(r'[^\w]', '', regex=True))
        
        new_unique = self.df[column].nunique()
        
        if original_unique != new_unique:
            self.changes_log.append(
                f"Symbol standardization: {original_unique} -> {new_unique} unique values"
            )
            print(f"Warning: Symbol consolidation reduced {original_unique} to {new_unique} unique symbols")
        
        return self.df
    
    def standardize_dates(self, date_column: str = 'Date', 
                         format: Optional[str] = None) -> pd.DataFrame:
        """
        Convert various date formats to standard datetime.
        
        Handles:
        - Different separators (/, -, .)
        - Different orders (DD/MM/YYYY vs MM/DD/YYYY)
        - String vs datetime types
        """
        if date_column not in self.df.columns:
            return self.df
        
        # Convert to datetime with flexible parsing
        if format:
            self.df[date_column] = pd.to_datetime(self.df[date_column], format=format)
        else:
            self.df[date_column] = pd.to_datetime(self.df[date_column], infer_datetime_format=True)
        
        # Ensure business day frequency consistency
        self.df[date_column] = pd.to_datetime(self.df[date_column]).dt.normalize()
        
        return self.df
    
    def normalize_units(self, volume_col: str = 'Volume', 
                       turnover_col: str = 'Turnover',
                       unit_hint: str = 'actual') -> pd.DataFrame:
        """
        Normalize volume and turnover to consistent units.
        
        Some sources report volume in thousands or millions.
        This standardizes everything to actual shares and NPR.
        """
        if volume_col in self.df.columns:
            # Detect if volume is in thousands (if max < 1000 for a liquid stock)
            max_vol = self.df[volume_col].max()
            
            if max_vol < 10000 and unit_hint == 'auto':
                # Likely in thousands
                self.df[volume_col] = self.df[volume_col] * 1000
                self.changes_log.append(f"Converted {volume_col} from thousands to actual")
                print(f"Converted {volume_col} from thousands to actual (detected small values)")
            elif unit_hint == 'thousands':
                self.df[volume_col] = self.df[volume_col] * 1000
                self.changes_log.append(f"Converted {volume_col} from thousands to actual")
        
        if turnover_col in self.df.columns:
            # Turnover should be price * volume
            # If values are too small, might be in millions or thousands
            sample_calc = (self.df['Close'] * self.df[volume_col]).iloc[0] if 'Close' in self.df.columns else 0
            actual_turnover = self.df[turnover_col].iloc[0] if len(self.df) > 0 else 0
            
            if actual_turnover > 0 and sample_calc / actual_turnover > 1000:
                # Turnover seems to be in millions
                self.df[turnover_col] = self.df[turnover_col] * 1000000
                self.changes_log.append(f"Converted {turnover_col} from millions to actual")
                print(f"Converted {turnover_col} from millions to actual")
        
        return self.df
    
    def remove_whitespace(self, columns: Optional[List[str]] = None) -> pd.DataFrame:
        """Remove leading/trailing whitespace from string columns."""
        if columns is None:
            columns = self.df.select_dtypes(include=['object']).columns
        
        for col in columns:
            if self.df[col].dtype == 'object':
                self.df[col] = self.df[col].str.strip()
        
        return self.df

# Usage with NEPSE data
# Create inconsistent data
inconsistent_data = pd.DataFrame({
    'Symbol': ['nabil', 'NABIL', 'Nabil ', 'nica', 'NICA'],
    'Date': ['15/01/2024', '2024-01-16', '17-01-2024', '01/18/2024', '2024-01-19'],
    'Volume': [125, 150, 175, 140, 160],  # In thousands?
    'Turnover': [359062.5, 433200, 507237.5, 403630, 465280],  # In thousands?
    'Close': [2875.25, 2895.50, 2900.00, 2880.50, 2905.00]
})

print("Original inconsistent data:")
print(inconsistent_data)
print(f"\nUnique symbols (should be 2): {inconsistent_data['Symbol'].nunique()}")

# Fix inconsistencies
consistency_handler = InconsistencyHandler(inconsistent_data)
cleaned = (consistency_handler
           .standardize_symbols('Symbol')
           .standardize_dates('Date')
           .normalize_units('Volume', 'Turnover', unit_hint='thousands')
           .remove_whitespace())

print(f"\nCleaned data:")
print(cleaned)
print(f"\nUnique symbols after cleaning: {cleaned['Symbol'].nunique()}")
print(f"Volume now in actual shares: {cleaned['Volume'].iloc[0]:,.0f}")
```

**Explanation:**
- **Symbol standardization** ensures that "nabil", "NABIL", and "Nabil " are treated as the same stock. This is crucial because string matching is case-sensitive and whitespace-sensitive.
- **Date parsing** handles the common issue where different data sources use different date formats. NEPSE official data might use Nepali calendar or different separators than international sources.
- **Unit normalization** is critical for financial data:
  - Some APIs return volume in thousands (125 instead of 125,000)
  - Turnover might be in millions or lakhs
  - We detect this by comparing calculated turnover (Close × Volume) with reported turnover
- **Whitespace removal** prevents issues where "NABIL " (with space) doesn't match "NABIL" in lookup tables.

---

## **6.4 Data Type Standardization**

Ensuring each column has the correct data type saves memory and prevents calculation errors.

```python
class DataTypeStandardizer:
    """
    Standardize data types for optimal memory usage and calculation accuracy.
    
    Financial data types:
    - Prices: float32 (sufficient precision, half the memory of float64)
    - Volumes: int32 or int64 (depending on market size)
    - Dates: datetime64[ns]
    - Symbols: category (repeated strings)
    - Flags: int8 or boolean
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        self.type_changes = []
    
    def optimize_numeric_types(self) -> pd.DataFrame:
        """
        Downcast numeric types to save memory without losing precision.
        
        For NEPSE:
        - Prices: float32 (7 decimal digits precision, sufficient for NPR)
        - Volume: int32 (max ~2 billion, sufficient for NEPSE daily volume)
        - Turnover: int64 (can be large: 2 billion shares * 3000 NPR = 6 trillion)
        """
        # Prices: float32 is sufficient (precise to ~0.01 NPR for values up to 10,000)
        price_cols = ['Open', 'High', 'Low', 'Close', 'VWAP', 'Prev_Close', 'LTP']
        for col in price_cols:
            if col in self.df.columns:
                self.df[col] = pd.to_numeric(self.df[col], errors='coerce')
                self.df[col] = self.df[col].astype('float32')
        
        # Volume: int32 for most cases, int64 if values exceed 2 billion
        if 'Volume' in self.df.columns:
            max_vol = self.df['Volume'].max()
            if max_vol < 2_147_483_647:  # Max int32
                self.df['Volume'] = self.df['Volume'].astype('int32')
            else:
                self.df['Volume'] = self.df['Volume'].astype('int64')
        
        # Turnover: usually needs int64
        if 'Turnover' in self.df.columns:
            self.df['Turnover'] = self.df['Turnover'].astype('int64')
        
        return self.df
    
    def categorize_symbols(self, symbol_col: str = 'Symbol') -> pd.DataFrame:
        """
        Convert symbol column to category type for memory efficiency.
        
        If you have 1000 days of data for 100 stocks, that's 100,000 rows
        but only 100 unique symbols. Category stores integers internally.
        """
        if symbol_col in self.df.columns:
            # Calculate memory before
            mem_before = self.df[symbol_col].memory_usage(deep=True)
            
            self.df[symbol_col] = self.df[symbol_col].astype('category')
            
            mem_after = self.df[symbol_col].memory_usage(deep=True)
            savings = (1 - mem_after/mem_before) * 100
            
            print(f"Symbol column memory reduced by {savings:.1f}% using category type")
        
        return self.df
    
    def standardize_datetime(self, date_col: str = 'Date', 
                           set_index: bool = False) -> pd.DataFrame:
        """
        Ensure proper datetime type and optionally set as index.
        """
        if date_col in self.df.columns:
            self.df[date_col] = pd.to_datetime(self.df[date_col])
            
            if set_index:
                self.df.set_index(date_col, inplace=True)
                self.df.sort_index(inplace=True)
        
        return self.df
    
    def convert_boolean_flags(self, columns: List[str]) -> pd.DataFrame:
        """
        Convert string representations of boolean to actual boolean type.
        
        Handles: 'Yes'/'No', 'True'/'False', 1/0, '1'/'0'
        """
        for col in columns:
            if col in self.df.columns:
                # Handle various representations
                if self.df[col].dtype == 'object':
                    self.df[col] = self.df[col].str.lower().map({
                        'true': True, 'false': False,
                        'yes': True, 'no': False,
                        '1': True, '0': False,
                        't': True, 'f': False
                    }).fillna(self.df[col]).astype('boolean')
                else:
                    self.df[col] = self.df[col].astype('boolean')
        
        return self.df
    
    def get_memory_report(self) -> pd.DataFrame:
        """Generate memory usage report by column."""
        usage = self.df.memory_usage(deep=True)
        usage_mb = usage / 1024 / 1024
        
        report = pd.DataFrame({
            'Column': usage.index,
            'Bytes': usage.values,
            'MB': usage_mb.values,
            'Dtype': [str(self.df[col].dtype) if col in self.df.columns else 'N/A' 
                     for col in usage.index]
        })
        
        return report.sort_values('Bytes', ascending=False)

# Usage with NEPSE data
# Create sample with suboptimal types
nepse_types = pd.DataFrame({
    'Symbol': ['NABIL'] * 1000,  # Will benefit from category
    'Date': pd.date_range('2020-01-01', periods=1000, freq='B'),
    'Open': np.random.uniform(2800, 3000, 1000).astype('float64'),  # Overkill precision
    'High': np.random.uniform(2800, 3000, 1000).astype('float64'),
    'Low': np.random.uniform(2800, 3000, 1000).astype('float64'),
    'Close': np.random.uniform(2800, 3000, 1000).astype('float64'),
    'Volume': np.random.randint(100000, 200000, 1000).astype('int64'),  # Could be int32
    'Is_Bullish': ['Yes', 'No'] * 500  # Should be boolean
})

print("Before optimization:")
print(nepse_types.dtypes)
print(f"\nMemory usage: {nepse_types.memory_usage(deep=True).sum() / 1024:.2f} KB")

# Optimize
standardizer = DataTypeStandardizer(nepse_types)
optimized = (standardizer
             .optimize_numeric_types()
             .categorize_symbols('Symbol')
             .standardize_datetime('Date', set_index=True)
             .convert_boolean_flags(['Is_Bullish']))

print("\nAfter optimization:")
print(optimized.dtypes)
print(f"\nMemory usage: {optimized.memory_usage(deep=True).sum() / 1024:.2f} KB")
print(f"\nMemory savings: {(1 - optimized.memory_usage().sum() / nepse_types.memory_usage().sum()) * 100:.1f}%")

print("\nDetailed memory report:")
print(standardizer.get_memory_report().head())
```

**Explanation:**
- **Memory optimization** is crucial when dealing with years of tick data or hundreds of stocks.
- **float32 vs float64**: 
  - float32: 7 decimal digits precision, 4 bytes
  - float64: 15 decimal digits precision, 8 bytes
  - For stock prices under 10,000 NPR, float32 gives precision to 0.01 NPR, which is sufficient
- **int32 vs int64**:
  - int32 max: ~2.1 billion (sufficient for NEPSE daily volume)
  - int64 max: ~9 quintillion (needed for turnover calculations)
- **Category type**: 
  - Stores repeated strings as integers with a lookup table
  - For 1000 rows of "NABIL", stores [0,0,0...] + mapping {0: "NABIL"}
  - Reduces memory from ~8KB to ~4KB (integer array + small mapping)
- **Boolean type**: Proper boolean uses 1 byte vs object (string) which uses 50+ bytes per entry.

---

## **6.5 Missing Data Patterns**

Understanding why data is missing is as important as the missing data itself. The mechanism of missingness determines the appropriate handling strategy.

```python
class MissingDataAnalyzer:
    """
    Analyze patterns of missing data to determine appropriate handling strategies.
    
    Types of missingness:
    1. MCAR (Missing Completely At Random): No systematic reason
    2. MAR (Missing At Random): Missingness related to observed data
    3. MNAR (Missing Not At Random): Missingness related to the missing value itself
    
    For NEPSE:
    - MCAR: Data collection system glitch
    - MAR: Small cap stocks missing volume on low liquidity days (observable)
    - MNAR: Delisted stocks missing data because company performed poorly
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        self.missing_report = {}
    
    def calculate_missing_statistics(self) -> pd.DataFrame:
        """Calculate comprehensive missing data statistics."""
        missing_stats = pd.DataFrame({
            'Column': self.df.columns,
            'Missing_Count': self.df.isnull().sum(),
            'Missing_Percent': (self.df.isnull().sum() / len(self.df)) * 100,
            'Data_Type': self.df.dtypes.values
        })
        
        missing_stats = missing_stats[missing_stats['Missing_Count'] > 0]
        missing_stats = missing_stats.sort_values('Missing_Percent', ascending=False)
        
        return missing_stats
    
    def analyze_temporal_patterns(self, date_col: str = 'Date') -> Dict:
        """
        Analyze if missing data follows temporal patterns.
        
        Checks:
        - Day of week patterns (weekends, holidays)
        - Month-end effects
        - Trend over time (getting worse/better?)
        """
        if date_col not in self.df.columns:
            return {}
        
        df_time = self.df.copy()
        df_time['DayOfWeek'] = pd.to_datetime(df_time[date_col]).dt.dayofweek
        df_time['Month'] = pd.to_datetime(df_time[date_col]).dt.month
        df_time['Year'] = pd.to_datetime(df_time[date_col]).dt.year
        
        patterns = {}
        
        # By day of week (0=Monday, 6=Sunday)
        dow_missing = df_time.groupby('DayOfWeek').apply(lambda x: x.isnull().sum().sum())
        patterns['day_of_week'] = dow_missing.to_dict()
        
        # By month
        month_missing = df_time.groupby('Month').apply(lambda x: x.isnull().sum().sum())
        patterns['month'] = month_missing.to_dict()
        
        # Trend over time (by year)
        year_missing = df_time.groupby('Year').apply(lambda x: x.isnull().mean().mean() * 100)
        patterns['yearly_trend'] = year_missing.to_dict()
        
        return patterns
    
    def analyze_correlation_with_missingness(self) -> pd.DataFrame:
        """
        Check if missingness in one column correlates with values in another.
        
        This helps determine if data is MAR (Missing At Random).
        """
        # Create missingness indicators (1 if missing, 0 if not)
        missing_indicators = self.df.isnull().astype(int)
        
        # Add suffix to distinguish from original data
        missing_indicators.columns = [f"{col}_missing" for col in missing_indicators.columns]
        
        # Combine with numeric data
        numeric_data = self.df.select_dtypes(include=[np.number])
        combined = pd.concat([numeric_data, missing_indicators], axis=1)
        
        # Calculate correlation matrix
        corr_matrix = combined.corr()
        
        # Extract correlations between data and missing indicators
        missing_corr = pd.DataFrame()
        for missing_col in missing_indicators.columns:
            orig_col = missing_col.replace('_missing', '')
            if orig_col in numeric_data.columns:
                correlations = corr_matrix[missing_col].drop(missing_indicators.columns)
                missing_corr[missing_col] = correlations
        
        return missing_corr
    
    def detect_mcar_little_test(self) -> Dict:
        """
        Little's MCAR test (simplified version).
        
        H0: Data is Missing Completely At Random
        If p-value < 0.05, reject H0 (data is not MCAR)
        
        Note: Full implementation requires complex statistical calculations.
        This is a heuristic check.
        """
        # Simplified check: If missingness correlates with observed values, not MCAR
        corr_analysis = self.analyze_correlation_with_missingness()
        
        strong_correlations = (corr_analysis.abs() > 0.3).sum().sum()
        
        if strong_correlations > 0:
            return {
                'mcar_likely': False,
                'reason': f'Found {strong_correlations} strong correlations between missingness and observed values',
                'recommendation': 'Use MAR-appropriate methods (conditional imputation)'
            }
        else:
            return {
                'mcar_likely': True,
                'reason': 'No strong correlations found between missingness and observed values',
                'recommendation': 'Simple imputation methods may be adequate'
            }
    
    def visualize_missing_pattern(self):
        """Create visualization of missing data pattern."""
        import matplotlib.pyplot as plt
        
        plt.figure(figsize=(12, 6))
        
        # Missingness heatmap
        plt.subplot(1, 2, 1)
        missing_matrix = self.df.isnull().astype(int)
        plt.imshow(missing_matrix.T, cmap='viridis', aspect='auto')
        plt.yticks(range(len(self.df.columns)), self.df.columns)
        plt.xlabel('Time Index')
        plt.title('Missing Data Pattern (Yellow = Missing)')
        plt.colorbar(label='Missing (1) / Present (0)')
        
        # Missingness by column
        plt.subplot(1, 2, 2)
        missing_counts = self.df.isnull().sum().sort_values(ascending=True)
        missing_counts[missing_counts > 0].plot(kind='barh')
        plt.xlabel('Count of Missing Values')
        plt.title('Missing Values by Column')
        plt.tight_layout()
        
        plt.savefig('missing_data_pattern.png', dpi=150)
        plt.close()
        print("Missing data pattern visualization saved")

# Usage with NEPSE data
# Create data with different missing patterns
dates = pd.date_range('2024-01-01', periods=100, freq='B')
np.random.seed(42)

nepse_missing = pd.DataFrame({
    'Date': dates,
    'Symbol': 'NABIL',
    'Open': np.where(np.random.random(100) > 0.9, np.nan, np.random.uniform(2800, 3000, 100)),
    'High': np.where(np.random.random(100) > 0.95, np.nan, np.random.uniform(2800, 3000, 100)),
    'Low': np.where(np.random.random(100) > 0.95, np.nan, np.random.uniform(2800, 3000, 100)),
    'Close': np.where(np.random.random(100) > 0.9, np.nan, np.random.uniform(2800, 3000, 100)),
    'Volume': np.where(np.random.random(100) > 0.8, np.nan, np.random.randint(100000, 200000, 100))
})

# Add systematic missingness (MAR): Low volume days more likely to have missing prices
low_volume_mask = nepse_missing['Volume'] < 120000
nepse_missing.loc[low_volume_mask, 'Close'] = np.nan

print("Missing Data Analysis:")
analyzer = MissingDataAnalyzer(nepse_missing)

# Basic statistics
stats = analyzer.calculate_missing_statistics()
print("\nMissing Statistics:")
print(stats)

# Temporal patterns
patterns = analyzer.analyze_temporal_patterns()
print(f"\nTemporal Patterns:")
print(f"Missing by day of week: {patterns.get('day_of_week', {})}")

# MCAR test
mcar_result = analyzer.detect_mcar_little_test()
print(f"\nMCAR Test Result:")
print(f"Likely MCAR: {mcar_result['mcar_likely']}")
print(f"Reason: {mcar_result['reason']}")
print(f"Recommendation: {mcar_result['recommendation']}")

# Visualize
analyzer.visualize_missing_pattern()
```

**Explanation:**
- **MCAR (Missing Completely At Random)**: The probability of missingness is the same for all observations. Example: A random data transmission error.
- **MAR (Missing At Random)**: Missingness depends on observed data but not the missing value itself. Example: Small-cap stocks (observable characteristic) are more likely to have missing volume data.
- **MNAR (Missing Not At Random)**: Missingness depends on the missing value itself. Example: A company delists (extremely low stock price) and data stops appearing.
- **Why this matters**:
  - MCAR: Simple imputation (mean, median) is unbiased
  - MAR: Use conditional imputation based on observed values
  - MNAR: Requires specialized models or the missingness itself is information
- **Little's MCAR test**: Statistical test to check if data is MCAR. If rejected, you need more sophisticated methods than simple mean imputation.
- **Temporal patterns**: Missing data on weekends is expected (markets closed). Missing data on Tuesdays might indicate a data collection issue.

---

## **6.6 Advanced Imputation Techniques**

When simple methods (mean, forward-fill) are insufficient, advanced techniques use relationships between variables and patterns in the data.

```python
from sklearn.impute import KNNImputer, IterativeImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.preprocessing import StandardScaler

class AdvancedImputation:
    """
    Advanced imputation methods for time-series financial data.
    
    Methods:
    1. KNN Imputation: Use similar days to fill missing values
    2. Iterative Imputation (MICE): Model each feature as function of others
    3. Interpolation with constraints: Respect OHLC relationships
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        self.imputed_columns = []
    
    def knn_imputation(self, n_neighbors: int = 5) -> pd.DataFrame:
        """
        K-Nearest Neighbors imputation.
        
        For a row with missing values, find k most similar rows (based on
        non-missing features) and use their average.
        
        Good for: Cross-sectional data where similar stocks/days exist
        """
        # Select numeric columns
        numeric_df = self.df.select_dtypes(include=[np.number])
        
        # Standardize for distance calculation
        scaler = StandardScaler()
        scaled_data = scaler.fit_transform(numeric_df)
        
        # Apply KNN imputation
        imputer = KNNImputer(n_neighbors=n_neighbors, weights='distance')
        imputed_scaled = imputer.fit_transform(scaled_data)
        
        # Inverse transform
        imputed_data = scaler.inverse_transform(imputed_scaled)
        
        # Update dataframe
        imputed_df = pd.DataFrame(imputed_data, 
                                 columns=numeric_df.columns, 
                                 index=numeric_df.index)
        
        # Only update missing values, preserve original where present
        for col in numeric_df.columns:
            mask = self.df[col].isnull()
            if mask.any():
                self.df.loc[mask, col] = imputed_df.loc[mask, col]
                self.imputed_columns.append(f"{col} (KNN)")
        
        return self.df
    
    def iterative_imputation(self, max_iter: int = 10) -> pd.DataFrame:
        """
        Multiple Imputation by Chained Equations (MICE).
        
        Models each feature with missing values as a function of other features.
        Iterates until convergence.
        
        Good for: Data with strong correlations between features (OHLC)
        """
        numeric_df = self.df.select_dtypes(include=[np.number])
        
        imputer = IterativeImputer(max_iter=max_iter, random_state=42)
        imputed_data = imputer.fit_transform(numeric_df)
        
        imputed_df = pd.DataFrame(imputed_data,
                                 columns=numeric_df.columns,
                                 index=numeric_df.index)
        
        # Update only missing values
        for col in numeric_df.columns:
            mask = self.df[col].isnull()
            if mask.any():
                self.df.loc[mask, col] = imputed_df.loc[mask, col]
                self.imputed_columns.append(f"{col} (Iterative)")
        
        return self.df
    
    def constrained_ohlc_imputation(self) -> pd.DataFrame:
        """
        Impute OHLC data while maintaining financial constraints.
        
        Ensures after imputation:
        - High >= max(Open, Close, Low)
        - Low <= min(Open, Close, High)
        """
        # First, forward fill for continuity
        ohlc_cols = ['Open', 'High', 'Low', 'Close']
        available_cols = [c for c in ohlc_cols if c in self.df.columns]
        
        # Initial fill with interpolation
        for col in available_cols:
            self.df[col] = self.df[col].interpolate(method='linear')
        
        # Enforce constraints
        if all(c in self.df.columns for c in ['High', 'Low', 'Close', 'Open']):
            # High should be >= Open, Close, Low
            self.df['High'] = self.df[['High', 'Open', 'Close', 'Low']].max(axis=1)
            
            # Low should be <= Open, Close, High
            self.df['Low'] = self.df[['Low', 'Open', 'Close', 'High']].min(axis=1)
            
            # If any are still missing, use close as reference
            for col in ['Open', 'High', 'Low']:
                mask = self.df[col].isnull()
                if mask.any():
                    self.df.loc[mask, col] = self.df.loc[mask, 'Close']
        
        return self.df
    
    def time_weighted_imputation(self) -> pd.DataFrame:
        """
        Impute using time-weighted average (recent values weighted more).
        
        Unlike simple mean, this respects that recent prices are more relevant.
        """
        numeric_cols = self.df.select_dtypes(include=[np.number]).columns
        
        for col in numeric_cols:
            if self.df[col].isnull().any():
                # Calculate exponential weighted moving average
                ewma = self.df[col].ewm(span=10, adjust=False).mean()
                
                # Fill missing with EWMA
                self.df[col] = self.df[col].fillna(ewma)
                
                # If still missing (at start of series), use forward/backward fill
                self.df[col] = self.df[col].fillna(method='ffill').fillna(method='bfill')
        
        return self.df

# Usage example
# Create NEPSE data with missing values
dates = pd.date_range('2024-01-01', periods=20, freq='B')
np.random.seed(42)

nepse_impute = pd.DataFrame({
    'Date': dates,
    'Open': [2850.50, np.nan, 2890.00, 2865.75, np.nan, 2880.50, 2905.00, np.nan, 2900.00, 2920.00] * 2,
    'High': [2890.00, 2910.00, np.nan, 2900.00, 2915.00, 2920.00, np.nan, 2925.00, 2930.00, 2940.00] * 2,
    'Low': [2840.00, 2860.00, 2880.00, np.nan, 2870.00, 2875.00, 2890.00, 2900.00, np.nan, 2910.00] * 2,
    'Close': [2875.25, 2895.50, 2900.00, 2880.50, np.nan, 2910.00, 2915.00, 2925.00, 2935.00, np.nan] * 2,
    'Volume': [125000, 150000, np.nan, 140000, 160000, 180000, 155000, np.nan, 170000, 200000] * 2
})

print("Data with missing values:")
print(nepse_impute.isnull().sum())

# Apply constrained imputation (best for OHLC data)
imputer = AdvancedImputation(nepse_impute)
imputed_data = imputer.constrained_ohlc_imputation()

print(f"\nAfter constrained imputation:")
print(imputed_data.isnull().sum())
print(f"\nConstraints maintained:")
print(f"High >= Close: {(imputed_data['High'] >= imputed_data['Close']).all()}")
print(f"Low <= Close: {(imputed_data['Low'] <= imputed_data['Close']).all()}")

# Show imputed values
print(f"\nColumns imputed: {imputer.imputed_columns}")
```

**Explanation:**
- **KNN Imputation**: Finds the 5 most similar days (based on Volume, other prices) and averages their values. Good when you have cross-sectional data (multiple stocks) or features that correlate with the missing value.
- **Iterative Imputation (MICE)**: Creates a model for each column using other columns as predictors. Iterates 10 times, each time using updated estimates. Excellent for OHLC data where Open, High, Low, Close are highly correlated.
- **Constrained Imputation**: Financial data has rigid constraints (High ≥ Low). Standard imputation might violate these. This method imputes first, then adjusts values to ensure High is actually the highest and Low is the lowest.
- **Time-Weighted**: Uses Exponentially Weighted Moving Average (EWMA) which gives more weight to recent observations. This is more appropriate than simple mean for time-series where recent values are more relevant.

---

## **6.7 Outlier Analysis and Treatment**

Outliers in financial data can be errors (bad ticks) or genuine events (market crashes, earnings surprises). Distinguishing between them is crucial.

```python
class OutlierHandler:
    """
    Comprehensive outlier handling for financial time-series.
    
    Approaches:
    1. Statistical: Z-score, IQR (assumes normal distribution)
    2. ML-based: Isolation Forest (unsupervised anomaly detection)
    3. Domain-specific: Price limits, volatility checks
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        self.outlier_mask = pd.DataFrame(False, index=df.index, columns=df.columns)
    
    def statistical_outliers(self, column: str, method: str = 'iqr', 
                            threshold: float = 3.0) -> pd.Series:
        """
        Detect outliers using statistical methods.
        
        method: 'iqr' (Interquartile Range) or 'zscore'
        threshold: For IQR, multiplier (1.5 = standard, 3.0 = extreme)
                   For Z-score, number of standard deviations
        """
        data = self.df[column].dropna()
        
        if method == 'iqr':
            Q1 = data.quantile(0.25)
            Q3 = data.quantile(0.75)
            IQR = Q3 - Q1
            lower = Q1 - threshold * IQR
            upper = Q3 + threshold * IQR
            
            outliers = (self.df[column] < lower) | (self.df[column] > upper)
            
        elif method == 'zscore':
            mean = data.mean()
            std = data.std()
            z_scores = (self.df[column] - mean) / std
            outliers = z_scores.abs() > threshold
            
        else:
            raise ValueError(f"Unknown method: {method}")
        
        self.outlier_mask[column] = outliers
        return outliers
    
    def price_jump_outliers(self, price_col: str = 'Close', 
                           threshold_pct: float = 10.0) -> pd.Series:
        """
        Detect unusual price jumps (common in NEPSE during circuit breakers).
        
        threshold_pct: Percentage change considered unusual (e.g., 10%)
        """
        returns = self.df[price_col].pct_change() * 100
        outliers = returns.abs() > threshold_pct
        
        self.outlier_mask[f"{price_col}_jump"] = outliers
        return outliers
    
    def ohlc_logic_outliers(self) -> pd.Series:
        """
        Detect rows where OHLC logic is violated (data errors).
        
        Valid: High >= Open, Close, Low and Low <= Open, Close, High
        """
        violations = pd.Series(False, index=self.df.index)
        
        if all(c in self.df.columns for c in ['High', 'Low', 'Open', 'Close']):
            # High should be maximum
            high_violation = self.df['High'] < self.df[['Open', 'Close', 'Low']].max(axis=1)
            
            # Low should be minimum
            low_violation = self.df['Low'] > self.df[['Open', 'Close', 'High']].min(axis=1)
            
            violations = high_violation | low_violation
        
        self.outlier_mask['ohlc_logic_error'] = violations
        return violations
    
    def volume_outliers(self, volume_col: str = 'Volume', 
                       method: str = 'mad') -> pd.Series:
        """
        Detect unusual volume spikes.
        
        method: 'mad' (Median Absolute Deviation) - more robust than IQR for volume
        """
        data = self.df[volume_col].dropna()
        
        if method == 'mad':
            median = data.median()
            mad = (data - median).abs().median()
            modified_z = 0.6745 * (self.df[volume_col] - median) / mad
            outliers = modified_z.abs() > 3.5
            
        else:
            # Use IQR method
            Q1 = data.quantile(0.25)
            Q3 = data.quantile(0.75)
            IQR = Q3 - Q1
            outliers = (self.df[volume_col] < (Q1 - 1.5 * IQR)) | \
                      (self.df[volume_col] > (Q3 + 1.5 * IQR))
        
        self.outlier_mask[f"{volume_col}_outlier"] = outliers
        return outliers
    
    def treat_outliers(self, column: str, method: str = 'clip', 
                      lower_quantile: float = 0.01,
                      upper_quantile: float = 0.99) -> pd.DataFrame:
        """
        Treat outliers using specified method.
        
        methods:
        - 'clip': Cap at percentiles (Winsorization)
        - 'remove': Delete rows with outliers
        - 'transform': Apply log transformation
        - 'flag': Keep but add indicator column
        """
        if method == 'clip':
            lower = self.df[column].quantile(lower_quantile)
            upper = self.df[column].quantile(upper_quantile)
            self.df[f"{column}_original"] = self.df[column]
            self.df[column] = self.df[column].clip(lower, upper)
            
        elif method == 'remove':
            mask = self.outlier_mask.get(column, pd.Series(False, index=self.df.index))
            self.df = self.df[~mask]
            
        elif method == 'transform':
            # Log transform (good for right-skewed data like volume)
            self.df[f"{column}_original"] = self.df[column]
            self.df[column] = np.log1p(self.df[column])
            
        elif method == 'flag':
            mask = self.outlier_mask.get(column, pd.Series(False, index=self.df.index))
            self.df[f"{column}_is_outlier"] = mask.astype(int)
            
        return self.df

# Usage with NEPSE data
# Create data with various outlier types
dates = pd.date_range('2024-01-01', periods=50, freq='B')
np.random.seed(42)

nepse_outliers = pd.DataFrame({
    'Date': dates,
    'Open': np.random.uniform(2800, 3000, 50),
    'High': np.random.uniform(2800, 3000, 50),
    'Low': np.random.uniform(2800, 3000, 50),
    'Close': np.random.uniform(2800, 3000, 50),
    'Volume': np.random.randint(100000, 200000, 50)
})

# Add outliers
nepse_outliers.loc[10, 'Close'] = 5000  # Price spike (error or news)
nepse_outliers.loc[20, 'Volume'] = 2000000  # Volume spike
nepse_outliers.loc[30, 'High'] = 2500  # Logic error (High < Low)

# Ensure High/Low logic for non-outlier rows
for i in range(len(nepse_outliers)):
    if i != 30:  # Skip the outlier row
        row = nepse_outliers.loc[i]
        nepse_outliers.loc[i, 'High'] = max(row['High'], row['Open'], row['Close'], row['Low'])
        nepse_outliers.loc[i, 'Low'] = min(row['Low'], row['Open'], row['Close'], row['High'])

# Detect outliers
handler = OutlierHandler(nepse_outliers)

# Statistical outliers in Close price
price_outliers = handler.statistical_outliers('Close', method='iqr', threshold=1.5)
print(f"Price outliers detected: {price_outliers.sum()}")

# Price jump outliers
jump_outliers = handler.price_jump_outliers('Close', threshold_pct=5)
print(f"Price jump outliers: {jump_outliers.sum()}")

# Logic errors
logic_errors = handler.ohlc_logic_outliers()
print(f"OHLC logic errors: {logic_errors.sum()}")

# Volume outliers
vol_outliers = handler.volume_outliers('Volume', method='mad')
print(f"Volume outliers: {vol_outliers.sum()}")

# Treat outliers (clip extreme values)
handler.treat_outliers('Close', method='clip', lower_quantile=0.01, upper_quantile=0.99)
print(f"\nClose price range after treatment: {handler.df['Close'].min():.2f} - {handler.df['Close'].max():.2f}")
```

**Explanation:**
- **Statistical methods** (Z-score, IQR) assume roughly normal distribution. Financial returns are somewhat normal, but prices are not (they trend).
- **Price jump detection**: A 10% daily move might be normal for crypto but is a huge outlier for NEPSE stocks. This is domain-specific.
- **OHLC logic**: Violations (High < Close) are definitely errors and should be flagged or corrected immediately.
- **MAD (Median Absolute Deviation)**: More robust than standard deviation for skewed distributions like trading volume (which has long right tail).
- **Treatment strategies**:
  - **Clip (Winsorize)**: Cap at 1st and 99th percentiles. Keeps data but reduces extreme influence.
  - **Remove**: Delete rows. Use only if certain it's an error.
  - **Transform**: Log-transform for skewed data. Reduces impact of large values.
  - **Flag**: Keep data but mark it. Models can learn that flagged data is special.

---

## **6.8 Data Smoothing Techniques**

Smoothing reduces noise to reveal underlying trends. However, excessive smoothing removes genuine signals.

```python
class DataSmoother:
    """
    Smoothing techniques for noisy financial data.
    
    Warning: Smoothing introduces lookahead bias if not handled carefully.
    Always use causal (backward-looking) filters for prediction tasks.
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
    
    def moving_average(self, column: str, window: int = 5, 
                      type: str = 'simple') -> pd.Series:
        """
        Calculate moving average.
        
        type: 'simple' (SMA), 'exponential' (EMA), 'weighted' (WMA)
        """
        if type == 'simple':
            return self.df[column].rolling(window=window).mean()
        
        elif type == 'exponential':
            # More weight to recent values
            return self.df[column].ewm(span=window, adjust=False).mean()
        
        elif type == 'weighted':
            # Linearly decreasing weights
            weights = np.arange(1, window + 1)
            return self.df[column].rolling(window=window).apply(
                lambda x: np.dot(x, weights) / weights.sum(), raw=True
            )
    
    def savgol_filter(self, column: str, window: int = 5, 
                     polyorder: int = 2) -> pd.Series:
        """
        Savitzky-Golay filter: preserves shape better than moving average.
        
        Fits polynomial to window and uses center point.
        Good for preserving peaks while removing noise.
        """
        from scipy.signal import savgol_filter
        
        # Handle NaNs by interpolation
        data = self.df[column].interpolate()
        
        smoothed = savgol_filter(data, window_length=window, 
                                polyorder=polyorder, mode='nearest')
        return pd.Series(smoothed, index=self.df.index)
    
    def kalman_filter(self, column: str) -> pd.Series:
        """
        Kalman filter: optimal recursive estimator.
        
        Adapts to changing volatility (good for regime changes in markets).
        """
        from pykalman import KalmanFilter
        
        kf = KalmanFilter(transition_matrices=[1],
                         observation_matrices=[1],
                         initial_state_mean=self.df[column].iloc[0],
                         initial_state_covariance=1,
                         observation_covariance=1,
                         transition_covariance=0.01)
        
        state_means, _ = kf.filter(self.df[column].values)
        return pd.Series(state_means.flatten(), index=self.df.index)
    
    def hodrick_prescott(self, column: str, lambda_: float = 1600) -> Tuple[pd.Series, pd.Series]:
        """
        Hodrick-Prescott filter: separate trend and cyclical components.
        
        lambda: smoothing parameter (1600 for quarterly data, 6.25 for yearly, 129600 for monthly)
        """
        from statsmodels.tsa.filters.hp_filter import hpfilter
        
        cycle, trend = hpfilter(self.df[column], lamb=lambda_)
        return trend, cycle

# Usage
# Generate noisy NEPSE price data
dates = pd.date_range('2024-01-01', periods=100, freq='B')
np.random.seed(42)
trend = np.linspace(2800, 3000, 100)
noise = np.random.randn(100) * 20
price = trend + noise

nepse_smooth = pd.DataFrame({
    'Date': dates,
    'Close': price
})
nepse_smooth.set_index('Date', inplace=True)

# Apply smoothing
smoother = DataSmoother(nepse_smooth)

# Simple Moving Average (causal - only uses past data)
nepse_smooth['SMA_5'] = smoother.moving_average('Close', window=5, type='simple')

# Exponential Moving Average
nepse_smooth['EMA_5'] = smoother.moving_average('Close', window=5, type='exponential')

# Savitzky-Golay (non-causal, for analysis only)
try:
    nepse_smooth['SavGol'] = smoother.savgol_filter('Close', window=5, polyorder=2)
except ImportError:
    print("scipy not installed, skipping Savitzky-Golay")

print("Smoothing applied")
print(nepse_smooth.head(10))
print("\nNote: SMA and EMA use only past data (causal) - safe for prediction.")
print("Savitzky-Golay uses future data within window - only for visualization, not prediction features.")
```

**Explanation:**
- **Simple Moving Average (SMA)**: Average of last N periods. Lagging indicator, smooth but slow to react.
- **Exponential Moving Average (EMA)**: Gives more weight to recent prices. Reacts faster to changes than SMA.
- **Savitzky-Golay**: Fits polynomial to window. Preserves peaks and valleys better than averaging. **Warning**: Uses center of window, so for real-time prediction you need to shift it or use only past data (causal filter).
- **Kalman Filter**: Statistical optimal filter that adapts to noise levels. Complex but powerful for financial data with changing volatility regimes.
- **Hodrick-Prescott**: Decomposes into trend and cycle. Lambda parameter controls smoothness. Common in macroeconomics.
- **Critical for prediction**: Only use causal filters (backward-looking) for feature engineering. Non-causal filters (centered) introduce future information (lookahead bias) which makes models look great in backtests but fail in production.

---

## **6.9 Noise Reduction**

Beyond smoothing, specific techniques reduce microstructure noise (bid-ask bounce, discrete pricing).

```python
class NoiseReducer:
    """
    Reduce market microstructure noise from price data.
    
    NEPSE data may contain:
    - Bid-ask bounce (prices oscillating between bid and ask)
    - Discrete pricing (prices at tick increments)
    - Outliers from block trades
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
    
    def remove_bid_ask_bounce(self, price_col: str = 'Close', 
                              threshold: float = 0.001) -> pd.Series:
        """
        Detect and smooth bid-ask bounce using price clustering.
        
        If prices oscillate rapidly between two levels (bid/ask),
        replace with midpoints.
        """
        prices = self.df[price_col]
        
        # Detect rapid reversals (sign changes in returns)
        returns = prices.pct_change()
        reversals = (returns.shift(1) * returns < 0) & \
                   (returns.abs() < threshold)
        
        # Where reversals occur, use moving average
        cleaned = prices.copy()
        cleaned[reversals] = prices.rolling(3, center=True, min_periods=1).mean()[reversals]
        
        return cleaned
    
    def wavelet_denoising(self, column: str, wavelet: str = 'db1', 
                         level: int = 2) -> pd.Series:
        """
        Wavelet denoising: removes high-frequency noise while preserving jumps.
        
        Better than Fourier because it handles non-stationary signals (time-varying frequency).
        """
        try:
            import pywt
            
            data = self.df[column].dropna().values
            
            # Wavelet decomposition
            coeffs = pywt.wavedec(data, wavelet, level=level)
            
            # Threshold detail coefficients (noise is in high frequency details)
            sigma = np.median(np.abs(coeffs[-1])) / 0.6745
            uthresh = sigma * np.sqrt(2 * np.log(len(data)))
            
            # Soft thresholding
            coeffs[1:] = [pywt.threshold(c, value=uthresh, mode='soft') 
                         for c in coeffs[1:]]
            
            # Reconstruct
            denoised = pywt.waverec(coeffs, wavelet)
            
            # Handle length mismatch (wavelet output may be longer by 1)
            if len(denoised) > len(self.df):
                denoised = denoised[:len(self.df)]
            
            return pd.Series(denoised, index=self.df.index)
            
        except ImportError:
            print("PyWavelets not installed, returning original")
            return self.df[column]
    
    def median_filter(self, column: str, window: int = 3) -> pd.Series:
        """
        Median filter: robust to outliers (unlike mean).
        
        Good for removing single-tick errors while preserving edges.
        """
        from scipy.ndimage import median_filter
        
        filtered = median_filter(self.df[column].fillna(method='ffill'), 
                                size=window, mode='nearest')
        return pd.Series(filtered, index=self.df.index)

# Usage example
print("Noise reduction techniques demonstrated")
print("Note: Wavelet denoising requires pywt installation")
```

**Explanation:**
- **Bid-ask bounce**: In tick data, prices alternate between bid (buy) and ask (sell) prices. This creates artificial volatility. Detection looks for rapid reversals small in magnitude.
- **Wavelet denoising**: Unlike Fourier transforms which work on the whole signal, wavelets analyze local time-frequency. This preserves sharp jumps (earnings announcements) while removing high-frequency noise. Uses "soft thresholding" on detail coefficients.
- **Median filter**: Replaces each point with median of neighbors. Unlike mean, median ignores outliers completely. Good for removing single bad ticks.

---

## **6.10 Anomaly Detection**

Anomalies are different from outliers—anomalies are patterns that don't conform to expected behavior, often indicating important events.

```python
class AnomalyDetector:
    """
    Detect anomalies in time-series using various methods.
    
    Anomalies vs Outliers:
    - Outliers: Statistical extremes (single points)
    - Anomalies: Pattern deviations (sequences, contextual)
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
    
    def isolation_forest(self, columns: List[str], 
                        contamination: float = 0.05) -> pd.Series:
        """
        Isolation Forest: ML-based anomaly detection.
        
        Isolates anomalies instead of profiling normal data.
        Efficient for high-dimensional data.
        """
        from sklearn.ensemble import IsolationForest
        
        data = self.df[columns].dropna()
        
        clf = IsolationForest(contamination=contamination, 
                             random_state=42,
                             n_estimators=100)
        predictions = clf.fit_predict(data)
        
        # -1 for anomaly, 1 for normal
        anomalies = pd.Series(predictions == -1, index=data.index)
        return anomalies
    
    def break_points(self, column: str, penalty: int = 10) -> List[int]:
        """
        Detect structural break points (regime changes).
        
        Uses PELT (Pruned Exact Linear Time) algorithm.
        Finds points where statistical properties change significantly.
        """
        try:
            import ruptures as rpt
            
            data = self.df[column].dropna().values.reshape(-1, 1)
            
            # Binary segmentation
            model = rpt.Binseg(model="l2").fit(data)
            break_points = model.predict(n_bkps=3)  # Find 3 breakpoints
            
            return break_points[:-1]  # Exclude end of series
            
        except ImportError:
            print("ruptures not installed")
            return []
    
    def contextual_anomaly(self, column: str, 
                          context_window: int = 5) -> pd.Series:
        """
        Detect contextual anomalies: unusual given recent context.
        
        Example: Normal volume is 100k, but given recent news, 
        150k might be expected. 50k would be anomalously low.
        """
        data = self.df[column]
        
        # Calculate rolling statistics
        rolling_mean = data.rolling(window=context_window).mean()
        rolling_std = data.rolling(window=context_window).std()
        
        # Z-score relative to recent context
        z_scores = (data - rolling_mean) / rolling_std
        
        # Anomaly if z-score > 3 or < -3
        anomalies = z_scores.abs() > 3
        
        return anomalies

# Usage would require scikit-learn and optionally ruptures
print("Anomaly detection class defined")
print("Isolation Forest detects multivariate anomalies")
print("Breakpoint detection finds regime changes")
```

**Explanation:**
- **Isolation Forest**: Randomly selects features and split values. Anomalies are easier to isolate (fewer splits needed) than normal points. Unsupervised—no labeled anomalies needed.
- **Breakpoint detection**: Finds structural changes in the time-series (e.g., NEPSE market regime change after new regulations). Uses algorithms like PELT to find optimal change points.
- **Contextual anomalies**: A volume of 1M might be normal during earnings week but anomalous during quiet periods. Compares to recent context rather than global statistics.

---

## **6.11 Data Transformation Pipelines**

Production systems require automated, reproducible pipelines that combine all preprocessing steps.

```python
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

class NEPSEPreprocessor(BaseEstimator, TransformerMixin):
    """
    Custom sklearn transformer for NEPSE data preprocessing.
    
    Integrates with sklearn pipelines for cross-validation compatibility.
    """
    
    def __init__(self, handle_missing: str = 'interpolate',
                 outlier_method: str = 'clip',
                 add_features: bool = True):
        self.handle_missing = handle_missing
        self.outlier_method = outlier_method
        self.add_features = add_features
        self.fitted_params = {}
    
    def fit(self, X, y=None):
        """Learn parameters from training data (e.g., outlier thresholds)."""
        # Store training statistics for consistent transformation
        self.fitted_params['means'] = X.mean()
        self.fitted_params['stds'] = X.std()
        self.fitted_params['mins'] = X.quantile(0.01)
        self.fitted_params['maxs'] = X.quantile(0.99)
        return self
    
    def transform(self, X):
        """Apply transformations."""
        X = X.copy()
        
        # Handle missing values
        if self.handle_missing == 'interpolate':
            X = X.interpolate(method='linear')
            X = X.fillna(method='ffill').fillna(method='bfill')
        
        # Handle outliers using training thresholds
        if self.outlier_method == 'clip':
            for col in X.columns:
                if col in self.fitted_params['mins']:
                    X[col] = X[col].clip(
                        lower=self.fitted_params['mins'][col],
                        upper=self.fitted_params['maxs'][col]
                    )
        
        # Add engineered features
        if self.add_features:
            if 'Close' in X.columns:
                X['Returns'] = X['Close'].pct_change()
                X['Volatility'] = X['Returns'].rolling(5).std()
                X['MA_Ratio'] = X['Close'] / X['Close'].rolling(5).mean()
        
        return X

# Usage in a pipeline
"""
from sklearn.ensemble import RandomForestRegressor

pipeline = Pipeline([
    ('preprocessor', NEPSEPreprocessor(handle_missing='interpolate')),
    ('model', RandomForestRegressor())
])

# Fit on training data
pipeline.fit(X_train, y_train)

# Predict on test data (preprocessing applied automatically)
predictions = pipeline.predict(X_test)
"""

print("Sklearn-compatible preprocessor defined")
print("Can be used in cross-validation to prevent data leakage")
print("Fit learns parameters on train set, transform applies to test set")
```

**Explanation:**
- **Sklearn compatibility**: By inheriting from `BaseEstimator` and `TransformerMixin`, our preprocessor works with sklearn's `Pipeline`, `GridSearchCV`, etc.
- **Fit vs Transform**: Critical distinction for preventing data leakage:
  - `fit()`: Learns statistics (means, outlier thresholds) from training data only
  - `transform()`: Applies those learned parameters to new data
  - Never calculate statistics on test data—always use training parameters
- **Pipeline benefits**: Ensures preprocessing is part of model training, not separate. Prevents forgetting to apply preprocessing to validation sets.

---

## **6.12 Reproducible Preprocessing**

Ensure that the same cleaning steps produce the same results every time, across different environments.

```python
import hashlib
import json

class ReproducibleCleaner:
    """
    Ensure preprocessing is fully reproducible.
    
    Tracks:
    - Code version (git commit)
    - Random seeds
    - Hyperparameters
    - Data hashes (input and output)
    """
    
    def __init__(self, config: Dict):
        self.config = config
        self.seed = config.get('random_seed', 42)
        np.random.seed(self.seed)
        
        self.execution_log = {
            'config': config,
            'steps': [],
            'input_hash': None,
            'output_hash': None,
            'timestamp': datetime.now().isoformat()
        }
    
    def log_step(self, name: str, params: Dict):
        """Log each preprocessing step."""
        self.execution_log['steps'].append({
            'name': name,
            'params': params,
            'timestamp': datetime.now().isoformat()
        })
    
    def hash_dataframe(self, df: pd.DataFrame) -> str:
        """Create deterministic hash of DataFrame."""
        # Sort columns and index for consistency
        df_sorted = df.sort_index(axis=0).sort_index(axis=1)
        
        # Convert to string representation
        data_string = df_sorted.to_json(sort_keys=True)
        
        # Create hash
        return hashlib.sha256(data_string.encode()).hexdigest()[:16]
    
    def verify_reproducibility(self, df: pd.DataFrame, 
                              expected_hash: str) -> bool:
        """Verify that output matches expected hash."""
        actual_hash = self.hash_dataframe(df)
        return actual_hash == expected_hash
    
    def save_report(self, filename: str):
        """Save execution log to JSON."""
        with open(filename, 'w') as f:
            json.dump(self.execution_log, f, indent=2, default=str)
        print(f"Reproducibility report saved to {filename}")

# Usage example
config = {
    'random_seed': 42,
    'outlier_threshold': 3.0,
    'missing_strategy': 'interpolate',
    'smoothing_window': 5
}

reproducible_cleaner = ReproducibleCleaner(config)

# Load data
data = nepse_outliers.copy()
input_hash = reproducible_cleaner.hash_dataframe(data)
reproducible_cleaner.execution_log['input_hash'] = input_hash

# Apply cleaning steps (example)
cleaned_data = data.interpolate()
reproducible_cleaner.log_step('interpolate', {'method': 'linear'})

cleaned_data = cleaned_data.fillna(method='ffill')
reproducible_cleaner.log_step('fillna', {'method': 'ffill'})

# Hash output
output_hash = reproducible_cleaner.hash_dataframe(cleaned_data)
reproducible_cleaner.execution_log['output_hash'] = output_hash

# Save report
reproducible_cleaner.save_report('cleaning_report.json')

print(f"\nInput hash: {input_hash}")
print(f"Output hash: {output_hash}")
print(f"Config: {config}")
print("\nThis report ensures the exact same cleaning can be reproduced later.")
```

**Explanation:**
- **Reproducibility** is crucial for scientific integrity and debugging. If your model performance drops, you need to know exactly what changed.
- **Data hashing**: Creates a fingerprint of the dataset. If two datasets have the same hash, they are identical. This verifies that preprocessing produced the expected output.
- **Execution log**: Records every step with parameters and timestamps. Like a lab notebook for data science.
- **Random seeds**: NumPy and other libraries use random numbers. Setting the seed ensures "random" operations (like train-test splits) are identical across runs.
- **Configuration**: All parameters externalized (not hardcoded) so they can be saved, versioned, and compared.

---

## **Chapter Summary**

In this chapter, we covered comprehensive data cleaning and preprocessing for time-series:

### **Key Takeaways:**

1. **Systematic Strategy**: Always assess before cleaning, document changes, preserve raw data, and validate results. Use the `DataCleaningStrategy` framework.

2. **Duplicate Handling**: Distinguish between exact duplicates (errors) and index duplicates (corrections). Use appropriate strategies (keep last for corrections, remove for errors).

3. **Inconsistency Resolution**: Standardize symbols (uppercase), dates (ISO format), and units (actual shares, not thousands). Use regex and mapping dictionaries.

4. **Type Optimization**: Use float32 for prices, int32 for volume, category for symbols, and datetime for dates. This reduces memory by 50-80%.

5. **Missing Data Patterns**: Understand MCAR vs MAR vs MNAR. Use Little's test to check assumptions. Different mechanisms require different imputation strategies.

6. **Advanced Imputation**: 
   - KNN for cross-sectional similarity
   - Iterative (MICE) for correlated features like OHLC
   - Constrained imputation to maintain financial logic (High ≥ Low)

7. **Outlier Treatment**: Use statistical (IQR, Z-score), domain-specific (price jumps), and logic-based (OHLC violations) detection. Treat by clipping, removing, or flagging based on context.

8. **Smoothing**: Use causal filters (EMA, SMA with past data only) for prediction features. Non-causal filters (centered) are only for visualization.

9. **Noise Reduction**: Wavelet denoising preserves jumps while removing noise. Median filters remove bad ticks without distorting trends.

10. **Anomaly Detection**: Isolation Forest for multivariate anomalies, breakpoint detection for regime changes, contextual anomalies for unusual patterns given recent history.

11. **Pipelines**: Build sklearn-compatible transformers to prevent data leakage and ensure preprocessing is part of model validation.

12. **Reproducibility**: Hash inputs/outputs, log all steps, set random seeds, and save configurations. Science requires reproducibility.

### **Next Steps:**

In Chapter 7, we will cover **Exploratory Data Analysis**, including:
- Statistical visualization techniques
- Time-series decomposition (trend, seasonality)
- Correlation analysis for feature selection
- Distribution analysis and normality tests
- Autocorrelation and partial autocorrelation functions

---

**End of Chapter 6**

---

*This chapter provided production-grade techniques for cleaning financial time-series data. The NEPSE examples demonstrate how to handle real-world issues like OHLC logic violations, missing trading days, and price spikes. Remember: cleaning is not just about removing bad data—it's about understanding the data generation process and preserving the signal while removing noise.*

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='5. data_collection_and_ingestion.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='7. exploratory_data_analysis.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
