# **Chapter 11: Basic Feature Creation**

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Extract raw value features from NEPSE CSV data and understand their temporal validity constraints
- Construct difference features (absolute and relative) that capture daily price action and volatility
- Calculate percentage change features (returns) normalized for cross-stock and cross-time comparison
- Implement lag features correctly using time-series shift operations to prevent look-ahead bias
- Compute rolling window statistics (moving averages, volatility) with appropriate window sizes for NEPSE's trading calendar
- Generate expanding window features for cumulative trend analysis
- Engineer time-based features specific to NEPSE's Sunday-Thursday trading schedule and Nepali fiscal calendar
- Create interaction features combining price and volume information
- Apply mathematical transformations (log, power) to handle skewed distributions common in financial data
- Implement efficient computation patterns using vectorized pandas/numpy operations

---

## **11.1 Raw Value Features**

Raw value features are the fundamental measurements present in the original NEPSE dataset without transformation. While seemingly simple, proper handling of raw features requires understanding their temporal characteristics, data types, and implicit information content. These features form the foundation upon which all derived features are built.

The NEPSE CSV provides several raw price and volume measurements: `Open`, `High`, `Low`, `Close`, `LTP` (Last Traded Price), `VWAP` (Volume Weighted Average Price), `Vol` (Volume), `Turnover`, and `Trans.` (Number of Transactions). Each has distinct properties regarding when the information becomes available during the trading day and how it should be used in predictive modeling.

```python
import pandas as pd
import numpy as np
from typing import Dict, List, Tuple
import matplotlib.pyplot as plt

class NEPSEBasicFeatureEngineer:
    """
    Basic feature creation for NEPSE stock prediction system.
    Handles raw value extraction and initial preprocessing.
    """
    
    def __init__(self, csv_path: str):
        self.csv_path = csv_path
        self.df = None
        self.feature_metadata = {}
        
    def load_and_validate_raw_features(self) -> pd.DataFrame:
        """
        Load NEPSE CSV and extract raw features with validation.
        
        NEPSE CSV Structure:
        S.No, Symbol, Conf., Open, High, Low, Close, LTP, Close - LTP, 
        Close - LTP %, VWAP, Vol, Prev. Close, Turnover, Trans., Diff, 
        Range, Diff %, Range %, VWAP %, 52 Weeks High, 52 Weeks Low
        """
        print("Loading NEPSE raw data...")
        self.df = pd.read_csv(self.csv_path)
        
        # Display initial structure
        print(f"Loaded {len(self.df)} records")
        print(f"Available columns: {self.df.columns.tolist()}")
        
        # Validate required raw features exist
        required_features = ['Open', 'High', 'Low', 'Close', 'Vol', 'Prev. Close']
        missing = [f for f in required_features if f not in self.df.columns]
        
        if missing:
            raise ValueError(f"Missing required features: {missing}")
        
        # Data type validation and conversion
        price_columns = ['Open', 'High', 'Low', 'Close', 'LTP', 'VWAP', 
                        'Prev. Close', '52 Weeks High', '52 Weeks Low']
        
        for col in price_columns:
            if col in self.df.columns:
                # Convert to float, handling any string formatting
                self.df[col] = pd.to_numeric(self.df[col], errors='coerce')
                
        # Volume and turnover as integers
        volume_columns = ['Vol', 'Turnover', 'Trans.']
        for col in volume_columns:
            if col in self.df.columns:
                self.df[col] = pd.to_numeric(self.df[col], errors='coerce')
        
        # Handle missing values (forward fill for prices, 0 for volume)
        for col in price_columns:
            if col in self.df.columns:
                self.df[col] = self.df[col].ffill().bfill()  # Forward then backward fill
        
        for col in volume_columns:
            if col in self.df.columns:
                self.df[col] = self.df[col].fillna(0)
        
        # Sort by S.No to ensure temporal order
        if 'S.No' in self.df.columns:
            self.df = self.df.sort_values('S.No').reset_index(drop=True)
        
        print(f"✓ Raw features validated and cleaned")
        return self.df
    
    def extract_price_features(self) -> Dict[str, pd.Series]:
        """
        Extract and categorize raw price features by temporal availability.
        
        Critical for preventing look-ahead bias in feature engineering.
        """
        features = {}
        
        # Group 1: Opening information (known at 11:00 AM NPT when market opens)
        features['opening'] = {
            'Open': self.df['Open'],
            'Prev_Close': self.df['Prev. Close'],  # Yesterday's close
            'Overnight_Gap': self.df['Open'] - self.df['Prev. Close']
        }
        
        # Group 2: Intraday extremes (only known after 3:00 PM market close)
        # WARNING: These cannot be used for same-day prediction without lagging
        features['intraday_extremes'] = {
            'High': self.df['High'],
            'Low': self.df['Low'],
            'True_Range': self.df['High'] - self.df['Low'],
            'Mid_Price': (self.df['High'] + self.df['Low']) / 2
        }
        
        # Group 3: Closing information (known after 3:00 PM)
        features['closing'] = {
            'Close': self.df['Close'],
            'LTP': self.df['LTP'],  # Last Traded Price (usually = Close)
            'VWAP': self.df['VWAP']  # Volume Weighted Average Price
        }
        
        # Group 4: Volume information (cumulative during day, final after close)
        features['volume'] = {
            'Volume': self.df['Vol'],
            'Turnover': self.df['Turnover'],  # Value traded
            'Transactions': self.df['Trans.'],  # Number of trades
            'Avg_Trade_Size': self.df['Turnover'] / (self.df['Trans.'] + 1)  # Avoid div by zero
        }
        
        # Document temporal constraints
        self.feature_metadata['temporal_groups'] = {
            'opening': 'Available at market open (11:00 AM NPT)',
            'intraday_extremes': 'Only available after market close (3:00 PM NPT)',
            'closing': 'Available after market close (3:00 PM NPT)',
            'volume': 'Cumulative during day, final after close'
        }
        
        return features
    
    def create_basic_price_relationships(self) -> pd.DataFrame:
        """
        Create basic relationship features from raw prices.
        These are still "basic" as they use only same-day information.
        """
        df = self.df
        
        # Position of Close within daily range (0=at low, 1=at high)
        # Also known as "Stochastic" or "%K" in technical analysis
        df['Close_Position'] = (df['Close'] - df['Low']) / (df['High'] - df['Low'] + 0.0001)
        
        # Distance from VWAP (Volume Weighted Average Price)
        # Positive = closed above average price (bullish)
        # Negative = closed below average price (bearish)
        df['VWAP_Distance'] = df['Close'] - df['VWAP']
        df['VWAP_Distance_Pct'] = (df['VWAP_Distance'] / df['VWAP']) * 100
        
        # Spread between High and Low relative to Close (volatility proxy)
        df['Daily_Range_Pct'] = ((df['High'] - df['Low']) / df['Close']) * 100
        
        # Open to Close relationship (intraday trend)
        df['Open_Close_Diff'] = df['Close'] - df['Open']
        df['Open_Close_Return'] = (df['Open_Close_Diff'] / df['Open']) * 100
        
        # Distance from 52-week extremes (long-term position)
        df['Distance_52W_High'] = ((df['52 Weeks High'] - df['Close']) / df['52 Weeks High']) * 100
        df['Distance_52W_Low'] = ((df['Close'] - df['52 Weeks Low']) / df['52 Weeks Low']) * 100
        
        print(f"Created {6} basic relationship features")
        return df

# Usage Example
if __name__ == "__main__":
    # Create sample NEPSE data matching the CSV structure
    np.random.seed(42)
    n_days = 100
    
    sample_data = pd.DataFrame({
        'S.No': range(1, n_days + 1),
        'Symbol': ['NEPSE'] * n_days,
        'Conf.': [''] * n_days,
        'Open': np.random.uniform(1800, 2000, n_days),
        'High': np.random.uniform(1900, 2100, n_days),
        'Low': np.random.uniform(1700, 1900, n_days),
        'Close': np.random.uniform(1850, 2050, n_days),
        'LTP': np.random.uniform(1850, 2050, n_days),
        'Close - LTP': np.random.uniform(-5, 5, n_days),
        'Close - LTP %': np.random.uniform(-0.5, 0.5, n_days),
        'VWAP': np.random.uniform(1840, 2040, n_days),
        'Vol': np.random.randint(1000000, 5000000, n_days),
        'Prev. Close': np.random.uniform(1840, 2040, n_days),
        'Turnover': np.random.randint(1000000000, 5000000000, n_days),
        'Trans.': np.random.randint(1000, 5000, n_days),
        'Diff': np.random.uniform(-50, 50, n_days),
        'Range': np.random.uniform(100, 200, n_days),
        'Diff %': np.random.uniform(-2, 2, n_days),
        'Range %': np.random.uniform(1, 5, n_days),
        'VWAP %': np.random.uniform(-1, 1, n_days),
        '52 Weeks High': np.random.uniform(2500, 3000, n_days),
        '52 Weeks Low': np.random.uniform(1200, 1600, n_days)
    })
    
    # Save to CSV for demonstration
    sample_data.to_csv('nepse_sample.csv', index=False)
    
    # Initialize engineer
    engineer = NEPSEBasicFeatureEngineer('nepse_sample.csv')
    
    # Load and validate
    engineer.load_and_validate_raw_features()
    
    # Extract categorized features
    raw_groups = engineer.extract_price_features()
    
    print("\nRaw Feature Groups:")
    for group, features in raw_groups.items():
        print(f"  {group}: {list(features.keys())}")
    
    # Create basic relationships
    engineer.create_basic_price_relationships()
    
    # Display sample
    print("\nSample of Basic Features:")
    display_cols = ['Close', 'Open', 'Close_Position', 'VWAP_Distance_Pct', 
                   'Daily_Range_Pct', 'Distance_52W_High']
    print(engineer.df[display_cols].head())
```

**Explanation:**

This implementation establishes the foundation of feature engineering by carefully handling raw NEPSE data with attention to temporal validity—critical for preventing look-ahead bias.

**Temporal Grouping Strategy:**
The `extract_price_features()` method categorizes raw features by when they become available during the trading day. This is essential because NEPSE operates from 11:00 AM to 3:00 PM Nepal Time, and features must respect this timeline:

- **Opening Group**: `Open` and `Prev. Close` are known at 11:00 AM when the market opens. The `Overnight_Gap` (difference between today's open and yesterday's close) captures pre-market sentiment and overnight news impact. These are safe to use for predicting intraday movements because they're known before trading begins.

- **Intraday Extremes Group**: `High` and `Low` represent the day's trading range but are **only known after 3:00 PM** when the market closes. The code explicitly warns that these cannot be used for same-day prediction without lagging (shifting by one period). The `True_Range` (High minus Low) measures daily volatility, while `Mid_Price` represents the equilibrium point between buyers and sellers.

- **Closing Group**: `Close`, `LTP` (Last Traded Price), and `VWAP` are final values determined after market close. In NEPSE data, `LTP` usually equals `Close` for the last trade, but during the day, LTP represents the most recent transaction price. `VWAP` is particularly valuable because it represents the average price weighted by volume, indicating where the majority of trading occurred.

- **Volume Group**: `Volume`, `Turnover`, and `Transactions` measure market activity. The `Avg_Trade_Size` (Turnover divided by number of transactions) reveals whether trading is dominated by large institutional orders or small retail trades—a key characteristic of the NEPSE market structure.

**Basic Relationship Features:**
The `create_basic_price_relationships()` method transforms raw absolute prices into relative metrics that are more predictive:

- **Close_Position**: Normalizes the closing price within the daily range (0 to 1 scale). A value of 0.9 indicates the stock closed near its daily high (bullish sentiment), while 0.1 indicates a close near the low (bearish). This normalization allows comparison across different price levels.

- **VWAP_Distance**: Measures how far the closing price deviated from the volume-weighted average. In NEPSE, closing significantly above VWAP suggests late-day buying pressure or institutional accumulation, while closing below suggests distribution. The percentage version normalizes this for cross-stock comparison.

- **Daily_Range_Pct**: The intraday range (High-Low) as a percentage of the closing price. This is a pure volatility measure—high values indicate uncertainty or news-driven trading, while low values indicate consensus. For NEPSE, range percentages above 4% often indicate circuit breaker proximity.

- **Distance_52W_High/Low**: Positions the current price within its 52-week trading range. These are mean-reversion indicators—stocks near 52-week highs may be overextended, while those near lows may be oversold. In the context of NEPSE's cyclical market behavior, these extremes often precede reversals.

**Data Validation and Cleaning:**
The `load_and_validate_raw_features()` method implements production-quality data handling. It converts all price columns to numeric types (handling potential string formatting issues in CSV exports), forward-fills missing price data (reasonable for time-series where the last known price remains valid), and fills missing volume data with zeros (assuming no trading occurred). This ensures the pipeline doesn't fail on real-world data quality issues common in NEPSE historical datasets.

---

## **11.2 Difference Features**

Difference features capture absolute changes and spreads between related price measurements. Unlike percentage changes (which we'll cover next), differences preserve the magnitude information and are particularly useful when the absolute scale of price movements carries predictive information—such as when analyzing support/resistance levels or volatility clustering in the NEPSE market.

In the context of the Nepalese stock market, difference features help identify intraday momentum, opening gaps, and price compression/expansion patterns that are specific to the market's liquidity characteristics and trading behavior.

```python
class NEPSEDifferenceFeatures:
    """
    Creation of absolute difference features for NEPSE time-series.
    These capture spreads, gaps, and absolute price movements.
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        
    def create_intraday_spread_features(self) -> pd.DataFrame:
        """
        Create features based on spreads between OHLC values.
        These capture intraday volatility and price action.
        """
        df = self.df
        
        # Basic range (High - Low)
        # Large values indicate high volatility, small values consolidation
        df['HL_Spread'] = df['High'] - df['Low']
        
        # Upper and lower shadows (wicks) of the candlestick
        # These represent rejected price levels
        df['Upper_Shadow'] = df['High'] - df[['Close', 'Open']].max(axis=1)
        df['Lower_Shadow'] = df[['Close', 'Open']].min(axis=1) - df['Low']
        
        # Body of the candlestick (Open to Close)
        # Large body = strong directional movement
        # Small body = indecision (doji)
        df['Body_Size'] = abs(df['Close'] - df['Open'])
        
        # Shadow to Body ratio
        # High values indicate rejection of price levels (potential reversal)
        # Low values indicate strong trend (marubozu candles)
        df['Shadow_Body_Ratio'] = (df['Upper_Shadow'] + df['Lower_Shadow']) / (df['Body_Size'] + 0.0001)
        
        # True Range (accounts for gaps from previous close)
        # More accurate volatility measure than simple HL range
        prev_close = df['Prev. Close']
        df['True_Range'] = np.maximum(
            df['High'] - df['Low'],
            np.maximum(
                abs(df['High'] - prev_close),
                abs(df['Low'] - prev_close)
            )
        )
        
        print("Created intraday spread features:")
        print("  - HL_Spread: High-Low range")
        print("  - Upper/Lower_Shadow: Rejection wicks")
        print("  - Body_Size: Open-Close magnitude")
        print("  - Shadow_Body_Ratio: Rejection strength")
        print("  - True_Range: Gap-adjusted volatility")
        
        return df
    
    def create_gap_features(self) -> pd.DataFrame:
        """
        Create features for overnight and intraday gaps.
        Gaps in NEPSE often indicate significant news or sentiment shifts.
        """
        df = self.df
        
        # Overnight gap (Open - Previous Close)
        # Positive = gap up (bullish overnight sentiment)
        # Negative = gap down (bearish overnight sentiment)
        df['Overnight_Gap'] = df['Open'] - df['Prev. Close']
        
        # Gap percentage (normalized)
        df['Overnight_Gap_Pct'] = (df['Overnight_Gap'] / df['Prev. Close']) * 100
        
        # Gap fill indicator (did price return to previous close during the day?)
        # Gap up + Close below Open = potential exhaustion
        # Gap down + Close above Open = potential reversal
        df['Gap_Fill'] = (
            ((df['Overnight_Gap'] > 0) & (df['Close'] < df['Open'])) |  # Gap up, closed lower
            ((df['Overnight_Gap'] < 0) & (df['Close'] > df['Open']))    # Gap down, closed higher
        ).astype(int)
        
        # Gap size categories (NEPSE context: >2% is significant)
        df['Gap_Size_Category'] = pd.cut(
            df['Overnight_Gap_Pct'].abs(),
            bins=[0, 0.5, 2, 4, 100],
            labels=['Small', 'Medium', 'Large', 'Extreme']
        )
        
        # Intraday extension (how far did price move beyond open?)
        df['Up_Extension'] = df['High'] - df['Open']  # Buying pressure
        df['Down_Extension'] = df['Open'] - df['Low']  # Selling pressure
        
        # Extension balance (buying vs selling pressure)
        df['Extension_Balance'] = df['Up_Extension'] - df['Down_Extension']
        
        print("\nCreated gap features:")
        print("  - Overnight_Gap: Open vs Previous Close")
        print("  - Gap_Fill: Boolean if gap was filled")
        print("  - Gap_Size_Category: Discretized gap magnitude")
        print("  - Up/Down_Extension: Intraday movement beyond open")
        
        return df
    
    def create_momentum_divergence_features(self) -> pd.DataFrame:
        """
        Create features showing divergence between price and volume.
        Divergences often predict trend changes.
        """
        df = self.df
        
        # Price change vs Volume change divergence
        # Rising price on falling volume = weak trend (potential reversal)
        # Falling price on rising volume = strong trend (continuation likely)
        
        # Calculate differences
        price_diff = df['Close'].diff()
        volume_diff = df['Vol'].diff()
        
        # Divergence flags
        df['Price_Volume_Divergence'] = (
            ((price_diff > 0) & (volume_diff < 0)) |  # Price up, volume down
            ((price_diff < 0) & (volume_diff > 0))    # Price down, volume up
        ).astype(int)
        
        # VWAP divergence (price vs volume-weighted price)
        # Price above VWAP on increasing volume = accumulation
        # Price below VWAP on increasing volume = distribution
        df['VWAP_Diff'] = df['Close'] - df['VWAP']
        
        # Change in VWAP difference (acceleration)
        df['VWAP_Diff_Change'] = df['VWAP_Diff'].diff()
        
        print("\nCreated divergence features:")
        print("  - Price_Volume_Divergence: Inverse relationship flag")
        print("  - VWAP_Diff: Deviation from volume-weighted price")
        print("  - VWAP_Diff_Change: Change in deviation")
        
        return df

# Demonstration
if __name__ == "__main__":
    # Use the engineer from previous section
    engineer = NEPSEBasicFeatureEngineer('nepse_sample.csv')
    engineer.load_and_validate_raw_features()
    engineer.create_basic_price_relationships()
    
    # Create difference features
    diff_engineer = NEPSEDifferenceFeatures(engineer.df)
    
    diff_engineer.create_intraday_spread_features()
    diff_engineer.create_gap_features()
    diff_engineer.create_momentum_divergence_features()
    
    # Display results
    print("\nSample Difference Features:")
    display_cols = ['Open', 'High', 'Low', 'Close', 'HL_Spread', 
                   'Upper_Shadow', 'Body_Size', 'Overnight_Gap', 
                   'Gap_Fill', 'VWAP_Diff']
    print(diff_engineer.df[display_cols].head())
```

**Explanation:**

This section focuses on **absolute difference features** that capture the magnitude of price movements and spreads without normalizing to percentages. These features are particularly valuable for NEPSE analysis because they preserve the scale of price action, which can indicate the intensity of buying/selling pressure.

**Intraday Spread Features:**
The `create_intraday_spread_features()` method analyzes the internal structure of each trading day's price bar (candlestick). In technical analysis, the relationship between the body (Open-to-Close range) and shadows (wicks) reveals market psychology:

- **HL_Spread (High-Low Spread)**: This is the raw daily trading range in Nepalese Rupees (NPR). Unlike percentage ranges, the absolute spread indicates the actual money at stake during the day's volatility. For NEPSE, spreads above NPR 50 on mid-cap stocks often indicate institutional participation or significant news events.

- **Upper_Shadow and Lower_Shadow**: These represent price rejections. The `Upper_Shadow` measures how far the price rose above the Open/Close before being rejected (selling pressure). The `Lower_Shadow` measures how far it fell below before recovering (buying support). In the NEPSE context, long upper shadows on high-volume days often indicate distribution by large holders at resistance levels.

- **Body_Size**: The absolute difference between Open and Close. Large bodies (relative to recent history) indicate strong conviction—either buying (green/bullish) or selling (red/bearish). Small bodies suggest indecision or equilibrium between buyers and sellers.

- **Shadow_Body_Ratio**: A high ratio (>2.0) indicates that the market tested levels significantly beyond the open/close but failed to hold them—a potential reversal signal. Low ratios (<0.5) indicate strong trending days where the price moved consistently in one direction (marubozu candles).

- **True_Range**: This is a more sophisticated volatility measure than simple High-Low range because it accounts for gaps from the previous day. If NEPSE opens with a large gap up (due to overnight news), the True Range captures the full extent of the movement, including the gap. This is essential for volatility-based position sizing and stop-loss calculations.

**Gap Analysis Features:**
The `create_gap_features()` method analyzes discontinuities between trading sessions, which are particularly significant in NEPSE due to the Friday-Saturday weekend and potential overnight news from the Nepali government or central bank (NRB).

- **Overnight_Gap**: The absolute difference between today's Open and yesterday's Close. In NEPSE, gaps often occur due to:
  - Policy announcements (monetary policy, fiscal budget)
  - Quarterly earnings releases
  - Regional market movements (Indian markets influence)
  - Weekend news accumulation (Friday-Saturday gap)

- **Gap_Fill**: A binary indicator showing whether the price returned to the previous day's close during the trading session. Gap fills suggest that the initial sentiment was temporary—gap up followed by selling (close below open) or gap down followed by buying (close above open). In mean-reversion strategies, unfilled gaps often indicate trend continuation.

- **Gap_Size_Category**: Discretizes gaps into Small (<0.5%), Medium (0.5-2%), Large (2-4%), and Extreme (>4%). In NEPSE, Extreme gaps often trigger circuit breakers or indicate major corporate actions. This categorical encoding helps tree-based models handle non-linear gap effects.

**Momentum Divergence Features:**
The `create_momentum_divergence_features()` method identifies discrepancies between price movement and volume, which often precede trend changes:

- **Price_Volume_Divergence**: A binary flag indicating when price and volume move in opposite directions (price up/volume down or price down/volume up). In low-liquidity NEPSE stocks, rising prices on declining volume suggest the move is not supported by broad participation and may reverse. Conversely, falling prices on rising volume indicates strong selling pressure likely to continue.

- **VWAP_Diff**: The absolute difference between Close and VWAP. Since VWAP represents the "fair" average price where most trading occurred, deviations indicate end-of-day sentiment shifts. Large positive VWAP_Diff suggests late buying (bullish), while negative suggests late selling (bearish).

These difference features provide the raw material for understanding market microstructure in NEPSE—how prices move, where they find support/resistance, and how volume validates (or invalidates) price movements.

---

## **11.3 Percentage Change Features**

Percentage change features (returns) normalize price movements relative to a base value, enabling cross-sectional comparison across stocks with different price levels and time periods. This normalization is essential for the NEPSE prediction system because it allows the model to learn patterns that generalize across different stocks and market regimes, rather than memorizing price levels specific to individual securities.

Returns also have more desirable statistical properties than raw prices—they are closer to stationary (mean-reverting), exhibit less heteroscedasticity (changing variance over time), and approximate normal distributions more closely (though financial returns typically have fat tails).

```python
class NEPSEPercentageFeatures:
    """
    Percentage change (return) features for NEPSE data.
    Normalizes price movements for cross-stock and cross-time comparison.
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        
    def create_daily_returns(self) -> pd.DataFrame:
        """
        Create basic return features from closing prices.
        Returns are the foundation of financial time-series analysis.
        """
        df = self.df
        
        # Simple return (arithmetic)
        # Formula: (P_t - P_{t-1}) / P_{t-1}
        df['Daily_Return'] = df['Close'].pct_change()
        
        # Log return (continuously compounded)
        # Formula: ln(P_t / P_{t-1})
        # Properties: Time-additive, symmetric for +/- moves
        df['Log_Return'] = np.log(df['Close'] / df['Close'].shift(1))
        
        # Adjusted returns (accounting for splits/dividends if data available)
        # For NEPSE, this is often just the simple return unless corporate actions file provided
        df['Adj_Return'] = (df['Close'] - df['Prev. Close']) / df['Prev. Close']
        
        # Signed returns (positive/negative classification)
        df['Return_Direction'] = np.sign(df['Daily_Return'])
        df['Is_Positive_Return'] = (df['Daily_Return'] > 0).astype(int)
        
        # Return magnitude categories (for classification models)
        df['Return_Magnitude'] = pd.cut(
            df['Daily_Return'].abs(),
            bins=[0, 0.01, 0.02, 0.05, 1.0],
            labels=['Small', 'Medium', 'Large', 'Extreme']
        )
        
        # Volatility regime (rolling standard deviation of returns)
        df['Volatility_20'] = df['Daily_Return'].rolling(window=20).std()
        
        print("Created daily return features:")
        print("  - Daily_Return: Arithmetic return (P_t - P_{t-1})/P_{t-1}")
        print("  - Log_Return: Continuously compounded return ln(P_t/P_{t-1})")
        print("  - Return_Direction: Sign of return (-1, 0, +1)")
        print("  - Volatility_20: 20-day rolling standard deviation")
        
        return df
    
    def create_intraday_returns(self) -> pd.DataFrame:
        """
        Returns relative to different reference points during the day.
        Captures intraday momentum and mean-reversion.
        """
        df = self.df
        
        # Return from Open to Close (intraday trend)
        df['Intraday_Return'] = (df['Close'] - df['Open']) / df['Open']
        
        # Return from Low to Close (recovery strength)
        # High values indicate strong buying from daily lows
        df['Recovery_Strength'] = (df['Close'] - df['Low']) / (df['High'] - df['Low'] + 0.0001)
        
        # Return from High to Close (give-back)
        # Low values indicate selling pressure from highs
        df['Give_Back'] = (df['High'] - df['Close']) / (df['High'] - df['Low'] + 0.0001)
        
        # VWAP return (performance vs average price)
        df['VWAP_Return'] = (df['Close'] - df['VWAP']) / df['VWAP']
        
        # Gap return (overnight performance)
        df['Gap_Return'] = (df['Open'] - df['Prev. Close']) / df['Prev. Close']
        
        # Post-gap drift (how much did price move after the gap?)
        df['Post_Gap_Drift'] = df['Intraday_Return'] - df['Gap_Return']
        
        print("\nCreated intraday return features:")
        print("  - Intraday_Return: Open to Close performance")
        print("  - Recovery_Strength: Bounce from daily low (0-1)")
        print("  - VWAP_Return: Out/under-performance vs VWAP")
        print("  - Gap_Return: Overnight gap percentage")
        print("  - Post_Gap_Drift: Intraday move after gap")
        
        return df
    
    def create_relative_returns(self) -> pd.DataFrame:
        """
        Returns relative to long-term benchmarks.
        Contextualizes daily performance within broader trends.
        """
        df = self.df
        
        # Return from 52-week high (drawdown)
        df['Drawdown_52W'] = (df['Close'] - df['52 Weeks High']) / df['52 Weeks High']
        
        # Return from 52-week low (upside)
        df['Upside_52W'] = (df['Close'] - df['52 Weeks Low']) / df['52 Weeks Low']
        
        # Position in 52-week range (0 = at low, 1 = at high)
        df['Position_52W'] = (df['Close'] - df['52 Weeks Low']) / \
                            (df['52 Weeks High'] - df['52 Weeks Low'] + 0.0001)
        
        # Distance to 52-week high as percentage
        df['Distance_To_High_Pct'] = ((df['52 Weeks High'] - df['Close']) / df['52 Weeks High']) * 100
        
        print("\nCreated relative return features:")
        print("  - Drawdown_52W: Distance from 52-week high (negative)")
        print("  - Position_52W: Relative position in 52W range (0-1)")
        print("  - Distance_To_High_Pct: Percentage below 52-week high")
        
        return df
    
    def create_volume_normalized_returns(self) -> pd.DataFrame:
        """
        Returns weighted by volume activity.
        Distinguishes between high-conviction and low-conviction moves.
        """
        df = self.df
        
        # Volume-weighted return (return * relative volume)
        avg_volume = df['Vol'].rolling(window=20).mean()
        rel_volume = df['Vol'] / avg_volume
        
        df['Volume_Weighted_Return'] = df['Daily_Return'] * rel_volume
        
        # Dollar volume (turnover) weighted features
        df['Turnover_Normalized_Return'] = df['Daily_Return'] * (df['Turnover'] / df['Turnover'].rolling(20).mean())
        
        # Return efficiency (return per unit of volume)
        # High return on low volume = efficient move (likely to continue)
        # High return on high volume = exhausting move (likely to reverse)
        df['Return_Efficiency'] = df['Daily_Return'] / (rel_volume + 0.001)
        
        print("\nCreated volume-normalized return features:")
        print("  - Volume_Weighted_Return: Return scaled by relative volume")
        print("  - Return_Efficiency: Return per unit of volume")
        
        return df

# Demonstration
if __name__ == "__main__":
    # Continue from previous example
    engineer = NEPSEBasicFeatureEngineer('nepse_sample.csv')
    engineer.load_and_validate_raw_features()
    engineer.create_basic_price_relationships()
    
    # Add percentage features
    pct_engineer = NEPSEPercentageFeatures(engineer.df)
    
    pct_engineer.create_daily_returns()
    pct_engineer.create_intraday_returns()
    pct_engineer.create_relative_returns()
    pct_engineer.create_volume_normalized_returns()
    
    # Display comparison of return types
    print("\nComparison of Return Calculations:")
    return_cols = ['Close', 'Daily_Return', 'Log_Return', 'Intraday_Return', 'Gap_Return']
    print(pct_engineer.df[return_cols].head(10))
    
    # Statistical properties
    print("\nStatistical Properties of Returns:")
    stats = pct_engineer.df[['Daily_Return', 'Log_Return']].describe()
    print(stats)
```

**Explanation:**

This section implements **percentage change features** (returns) which are fundamental to financial time-series analysis. Unlike raw price differences, returns normalize movements to a percentage of the base price, enabling meaningful comparison across different stocks and time periods in the NEPSE universe.

**Daily Return Calculations:**
The `create_daily_returns()` method implements the two primary return calculations used in quantitative finance:

- **Arithmetic Return (Simple Return)**: Calculated as `(P_t - P_{t-1}) / P_{t-1}` using pandas' `pct_change()` method. This represents the actual percentage gain or loss for holding the stock for one day. For example, a move from NPR 100 to NPR 102 yields a 2% arithmetic return. This is the standard measure for reporting investment performance and calculating portfolio values.

- **Log Return (Continuously Compounded Return)**: Calculated as `ln(P_t / P_{t-1})` where `ln` is the natural logarithm. Log returns have mathematical properties that make them preferable for statistical modeling: they are time-additive (the sum of daily log returns equals the total log return over the period), and they are symmetric—a 2% gain followed by a 2% loss returns exactly to the starting point (arithmetic returns do not: 100 → 102 → 99.96).

The method also creates **directional indicators** (`Return_Direction` as -1, 0, +1, and `Is_Positive_Return` as binary) which are useful for classification models predicting whether the next day will be up or down, rather than the exact magnitude.

**Intraday Return Decomposition:**
The `create_intraday_returns()` method breaks down the daily return into components that reveal intraday dynamics:

- **Intraday_Return**: The return from market open to close, `(Close - Open) / Open`. This measures the directional trend during the trading session, separate from overnight gaps. For NEPSE, intraday returns are often driven by local sentiment and order flow, while gaps reflect overnight news.

- **Recovery_Strength**: Normalized as `(Close - Low) / (High - Low)`, this measures where the closing price sits within the daily range. Values near 1.0 indicate the stock recovered from its lows to close near highs (bullish), while values near 0.0 indicate it sold off from highs to close near lows (bearish). This is a key mean-reversion indicator in NEPSE's often range-bound market.

- **VWAP_Return**: The return relative to the volume-weighted average price, `(Close - VWAP) / VWAP`. Closing above VWAP indicates that the "smart money" (institutional volume) was buying late in the day, while closing below suggests distribution. In NEPSE, where retail investors dominate, VWAP deviations can indicate institutional positioning.

- **Gap_Return vs. Post_Gap_Drift**: The `Gap_Return` captures overnight movement (open vs. previous close), while `Post_Gap_Drift` measures the intraday continuation or reversal of that gap. In NEPSE, "gap and go" patterns (strong gap followed by continued move) often occur during earnings seasons, while "gap and fade" (gap filled during day) is common during low-volume periods.

**Relative Performance Features:**
The `create_relative_returns()` method contextualizes current performance within the 52-week trading range:

- **Drawdown_52W**: The percentage decline from the 52-week high (negative values). This measures how far the stock has fallen from its peak—a drawdown of -20% indicates a bear market for that specific stock. In NEPSE, where many stocks experience deep drawdowns followed by strong recoveries, this feature helps identify oversold conditions.

- **Position_52W**: A normalized position within the 52-week range (0 to 1). Values above 0.8 indicate the stock is near 52-week highs (overbought risk), while values below 0.2 indicate proximity to lows (oversold potential). This is a mean-reversion signal particularly effective in NEPSE's cyclical market structure.

**Volume-Normalized Returns:**
The `create_volume_normalized_returns()` method addresses a key aspect of NEPSE market microstructure—low liquidity. A 2% return on massive volume has different implications than a 2% return on minimal volume:

- **Volume_Weighted_Return**: Multiplies the daily return by relative volume (today's volume divided by 20-day average). This amplifies returns that occur on high volume (strong conviction) and diminishes those on low volume (weak participation). For NEPSE stocks with irregular liquidity, this distinguishes genuine moves from noise.

- **Return_Efficiency**: Calculates return per unit of volume, `Daily_Return / Relative_Volume`. High efficiency (large move on low volume) suggests thin markets and potential manipulation or news-driven gaps. Low efficiency (small move on high volume) suggests strong resistance/support levels where supply absorbed demand.

These percentage features transform absolute price levels into relative performance metrics that are stationary (stable statistical properties over time) and comparable across the diverse universe of NEPSE stocks, from high-priced commercial banks to low-priced microfinance companies.

---

## **11.4 Lag Features**

Lag features (also called autoregressive features) are the values of a variable at previous time steps. They are the most fundamental time-series features because they directly model the temporal dependence inherent in sequential data—the idea that the past influences the future. In the context of NEPSE stock prediction, lag features capture momentum, mean-reversion, and cyclical patterns that repeat over time.

Proper construction of lag features is critical because it is the primary defense against look-ahead bias. By explicitly shifting data backward using the `shift()` function, we ensure that when predicting time $t$, we only use information from times $t-1, t-2, ..., t-n$, never from $t$ or $t+1$.

```python
class NEPSELagFeatures:
    """
    Lag (autoregressive) feature creation for NEPSE time-series.
    Critical for capturing temporal dependencies while preventing look-ahead bias.
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        self.lag_features_created = []
        
    def create_price_lags(self, lags: List[int] = [1, 2, 3, 5, 10, 20]) -> pd.DataFrame:
        """
        Create lagged price features for autoregressive modeling.
        
        Args:
            lags: List of periods to lag. 
                  For NEPSE: 1=yesterday, 5=last week, 20=last month (~20 trading days)
        """
        df = self.df
        
        for lag in lags:
            # Lag of Close price (most important for prediction)
            col_name = f'Close_Lag_{lag}'
            df[col_name] = df['Close'].shift(lag)
            self.lag_features_created.append(col_name)
            
            # Lag of Open (captures opening sentiment)
            df[f'Open_Lag_{lag}'] = df['Open'].shift(lag)
            
            # Lag of High/Low (captures previous support/resistance)
            df[f'High_Lag_{lag}'] = df['High'].shift(lag)
            df[f'Low_Lag_{lag}'] = df['Low'].shift(lag)
            
            # Lag of VWAP (institutional price memory)
            if 'VWAP' in df.columns:
                df[f'VWAP_Lag_{lag}'] = df['VWAP'].shift(lag)
        
        print(f"Created price lag features for periods: {lags}")
        print(f"Total price lag features: {len(lags) * 4}")  # Close, Open, High, Low
        
        return df
    
    def create_return_lags(self, lags: List[int] = [1, 2, 3, 5]) -> pd.DataFrame:
        """
        Create lagged return features.
        Returns are more stationary than prices, often better for modeling.
        """
        df = self.df
        
        # Ensure we have daily returns calculated
        if 'Daily_Return' not in df.columns:
            df['Daily_Return'] = df['Close'].pct_change()
        
        for lag in lags:
            # Lagged returns (past performance)
            df[f'Return_Lag_{lag}'] = df['Daily_Return'].shift(lag)
            
            # Sign of lagged return (direction only)
            df[f'Return_Direction_Lag_{lag}'] = np.sign(df[f'Return_Lag_{lag}'])
            
            # Absolute magnitude of lagged return (volatility memory)
            df[f'Return_Abs_Lag_{lag}'] = df[f'Return_Lag_{lag}'].abs()
        
        # Create return autocorrelation features
        # (how correlated is today's return with past returns?)
        df['Return_AutoCorr_1'] = df['Daily_Return'].shift(1) * df['Daily_Return'].shift(2)
        
        print(f"\nCreated return lag features:")
        print(f"  - Return lags for periods: {lags}")
        print(f"  - Direction lags (sign)")
        print(f"  - Absolute return lags (magnitude)")
        
        return df
    
    def create_volume_lags(self, lags: List[int] = [1, 3, 5]) -> pd.DataFrame:
        """
        Create lagged volume features.
        Volume often leads price (climax volume signals turns).
        """
        df = self.df
        
        for lag in lags:
            # Raw volume lag
            df[f'Volume_Lag_{lag}'] = df['Vol'].shift(lag)
            
            # Relative volume (vs recent average)
            avg_vol = df['Vol'].rolling(window=20).mean()
            rel_vol = df['Vol'] / avg_vol
            df[f'Rel_Volume_Lag_{lag}'] = rel_vol.shift(lag)
            
            # Volume change (acceleration)
            df[f'Volume_Change_Lag_{lag}'] = df['Vol'].pct_change().shift(lag)
        
        print(f"\nCreated volume lag features for periods: {lags}")
        return df
    
    def create_lag_interactions(self) -> pd.DataFrame:
        """
        Create interaction features between different lagged variables.
        Captures complex temporal relationships.
        """
        df = self.df
        
        # Price momentum (difference between recent and older lags)
        df['Momentum_1_5'] = df['Close_Lag_1'] - df['Close_Lag_5']  # Short-term momentum
        df['Momentum_5_20'] = df['Close_Lag_5'] - df['Close_Lag_20']  # Medium-term momentum
        
        # Price acceleration (change in momentum)
        df['Acceleration'] = df['Momentum_1_5'] - df['Momentum_1_5'].shift(1)
        
        # Volume-Price interaction lags
        if 'Volume_Lag_1' in df.columns and 'Return_Lag_1' in df.columns:
            df['Volume_Price_Interaction'] = df['Volume_Lag_1'] * df['Return_Lag_1']
        
        # Range expansion/contraction (volatility regime)
        df['Range_Lag_1'] = (df['High_Lag_1'] - df['Low_Lag_1']) / df['Close_Lag_1']
        df['Range_Lag_5'] = (df['High_Lag_5'] - df['Low_Lag_5']) / df['Close_Lag_5']
        df['Range_Contraction'] = df['Range_Lag_1'] < df['Range_Lag_5']  # Volatility squeeze
        
        print("\nCreated lag interaction features:")
        print("  - Momentum features (1-5 day, 5-20 day)")
        print("  - Acceleration (change in momentum)")
        print("  - Range contraction/expansion")
        
        return df
    
    def validate_no_lookahead(self):
        """
        Critical validation: Ensure no feature uses future information.
        """
        print("\n" + "="*60)
        print("VALIDATING NO LOOK-AHEAD BIAS")
        print("="*60)
        
        # Check that lag features are properly shifted
        if 'Close_Lag_1' in self.df.columns:
            # Close_Lag_1 at row i should equal Close at row i-1
            is_valid = (self.df['Close_Lag_1'].iloc[1:] == self.df['Close'].iloc[:-1]).all()
            
            if is_valid:
                print("✓ Close_Lag_1 correctly shifted (no look-ahead)")
            else:
                print("✗ ERROR: Close_Lag_1 may contain look-ahead bias!")
        
        # Check for any correlation between current target and future features
        # (should be near zero for properly constructed lags)
        if 'Close_Lag_1' in self.df.columns:
            corr = self.df['Close'].corr(self.df['Close_Lag_1'])
            print(f"  Correlation Close vs Close_Lag_1: {corr:.4f} (should be high, ~0.95)")
            
        # Ensure no NaN values at the end (which would indicate future data)
        nan_count = self.df['Close_Lag_1'].isna().sum()
        print(f"  NaN values in lag features (expected at start): {nan_count}")
        
        return self.df

# Demonstration
if __name__ == "__main__":
    # Setup data with returns
    engineer = NEPSEBasicFeatureEngineer('nepse_sample.csv')
    engineer.load_and_validate_raw_features()
    
    pct_engineer = NEPSEPercentageFeatures(engineer.df)
    df_with_returns = pct_engineer.create_daily_returns()
    
    # Create lag features
    lag_engineer = NEPSELagFeatures(df_with_returns)
    
    lag_engineer.create_price_lags(lags=[1, 3, 5])
    lag_engineer.create_return_lags(lags=[1, 2, 5])
    lag_engineer.create_volume_lags(lags=[1, 3])
    lag_engineer.create_lag_interactions()
    lag_engineer.validate_no_lookahead()
    
    # Display lag structure
    print("\nLag Feature Structure (showing temporal alignment):")
    display_cols = ['Close', 'Close_Lag_1', 'Close_Lag_3', 'Return_Lag_1', 'Volume_Lag_1']
    print(lag_engineer.df[display_cols].head(10))
    
    # Show correlation between current price and lags
    print("\nAutocorrelation of Close Price:")
    autocorr = lag_engineer.df[['Close', 'Close_Lag_1', 'Close_Lag_3', 'Close_Lag_5']].corr()
    print(autocorr)
```

**Explanation:**

This section implements **lag features**—the temporal backbone of any time-series prediction system. The code demonstrates proper construction techniques that prevent the most common and damaging error in financial machine learning: look-ahead bias.

**Price Lag Construction:**
The `create_price_lags()` method generates shifted versions of price data. When we execute `df['Close'].shift(1)`, pandas moves the entire series backward by one position, so that at row $t$, we see the value from row $t-1$. This is equivalent to $Close_{t-1}$ in mathematical notation.

The method creates lags for multiple periods:
- **Lag 1**: Yesterday's price (the strongest predictor of today's price due to market efficiency and momentum)
- **Lag 3**: Price from 3 days ago (captures short-term patterns and weekly effects—since NEPSE trades Sun-Thu, lag 3 moves from Sunday → Wednesday)
- **Lag 5**: Approximately one trading week in NEPSE (Sun-Thu schedule)
- **Lag 20**: Approximately one trading month (~20 trading days in NEPSE)

Creating lags for Open, High, Low, and VWAP provides additional context:
- **Open_Lag_1**: Yesterday's opening price (indicates how the market started vs. how it ended)
- **High_Lag_1/Low_Lag_1**: Yesterday's support and resistance levels (key technical analysis concepts where old highs become new resistance)

**Return Lag Construction:**
The `create_return_lags()` method shifts percentage returns rather than absolute prices. Returns are often preferred for modeling because they are closer to stationary (constant mean and variance over time). The method creates three variants:
- **Return_Lag_n**: The actual percentage return (e.g., +2.3%)
- **Return_Direction_Lag_n**: Just the sign (+1, 0, -1), useful for classification tasks predicting up/down
- **Return_Abs_Lag_n**: The absolute magnitude (2.3%), capturing volatility memory regardless of direction

The **Return_AutoCorr_1** feature multiplies yesterday's return by the day-before's return. Positive values indicate momentum (up followed by up, or down followed by down), while negative values indicate mean-reversion (up followed by down).

**Volume Lag Construction:**
Volume lags are particularly important in NEPSE because volume often leads price—climax volume (extremely high relative volume) frequently marks turning points. The `create_volume_lags()` method creates:
- **Volume_Lag_n**: Raw volume shifted back
- **Rel_Volume_Lag_n**: Volume relative to 20-day average (normalizes for different stocks' typical liquidity)
- **Volume_Change_Lag_n**: Percentage change in volume (acceleration/deceleration of trading interest)

**Lag Interactions:**
The `create_lag_interactions()` method combines different lagged variables to capture dynamic relationships:
- **Momentum_1_5**: The difference between yesterday's close and the close 5 days ago. Positive values indicate short-term upward momentum; negative indicates downward. This is essentially a 4-day return calculated using lag features.
- **Momentum_5_20**: Medium-term trend (close 5 days ago vs. 20 days ago), capturing the "monthly" trend in NEPSE.
- **Acceleration**: The change in momentum—whether the trend is speeding up or slowing down. This is the second derivative of price and often signals trend exhaustion.
- **Range_Contraction**: A boolean indicator that triggers when yesterday's range (High-Low) is smaller than the range 5 days ago. This "volatility squeeze" often precedes explosive breakout moves in NEPSE stocks.

**Critical Validation:**
The `validate_no_lookahead()` method performs essential safety checks. It verifies that `Close_Lag_1` at row $i$ truly equals `Close` at row $i-1$, confirming the shift operation worked correctly. It also checks the correlation between current close and lagged close—if the lag is constructed correctly, this should be very high (~0.95 for daily stock prices), indicating strong autocorrelation. If the correlation were near zero or negative, it would indicate a bug in the shifting logic.

For the NEPSE prediction system, these lag features capture the market's memory—how yesterday's price action influences today's opening sentiment, how last week's trend affects this week's direction, and how volume spikes three days ago might predict today's volatility. They form the autoregressive component of the model, allowing it to learn from historical patterns while strictly respecting the temporal arrow that prevents future information from contaminating predictions.

---

## **11.5 Rolling Window Features**

Rolling window features (also called moving window statistics) calculate aggregate metrics over a fixed-size window of recent observations that "rolls" forward through time. Unlike expanding windows (which grow indefinitely), rolling windows maintain a constant lookback period, making them adaptive to recent market regimes while ignoring distant history that may no longer be relevant.

For the NEPSE prediction system, rolling windows are essential because the Nepalese stock market exhibits regime changes—periods of high volatility during political instability or policy announcements, followed by quiet consolidation phases. Rolling statistics adapt to these changes by focusing on recent behavior, providing dynamic benchmarks for trend and volatility measurement.

```python
class NEPSE rollingFeatures:
    """
    Rolling window (moving) statistics for NEPSE time-series.
    Captures local trends, volatility, and adaptive benchmarks.
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        
    def create_moving_averages(self, windows: List[int] = [5, 10, 20, 50]) -> pd.DataFrame:
        """
        Create Simple and Exponential Moving Averages (SMA, EMA).
        Core trend indicators for technical analysis.
        """
        df = self.df
        
        for window in windows:
            # Simple Moving Average (SMA) - equal weight to all days
            df[f'SMA_{window}'] = df['Close'].rolling(window=window).mean()
            
            # Exponential Moving Average (EMA) - more weight to recent days
            # span = window approximates the center of mass
            df[f'EMA_{window}'] = df['Close'].ewm(span=window, adjust=False).mean()
            
            # Moving average distance (price relative to trend)
            # Positive = above average (bullish), Negative = below (bearish)
            df[f'Dist_SMA_{window}'] = df['Close'] - df[f'SMA_{window}']
            df[f'Dist_SMA_{window}_Pct'] = (df[f'Dist_SMA_{window}'] / df[f'SMA_{window}']) * 100
            
            # Price position within MA envelope (0-1 scale)
            # Uses High/Low to create dynamic bands
            ma_high = df['High'].rolling(window=window).max()
            ma_low = df['Low'].rolling(window=window).min()
            df[f'Position_MA_{window}'] = (df['Close'] - ma_low) / (ma_high - ma_low + 0.0001)
        
        print(f"Created moving averages for windows: {windows}")
        print(f"  - SMA (Simple Moving Average)")
        print(f"  - EMA (Exponential Moving Average)")
        print(f"  - Distance from MA (absolute and percentage)")
        print(f"  - Position within MA range")
        
        return df
    
    def create_volatility_features(self, windows: List[int] = [5, 10, 20]) -> pd.DataFrame:
        """
        Rolling volatility and dispersion measures.
        Critical for risk management and regime detection in NEPSE.
        """
        df = self.df
        
        for window in windows:
            # Standard deviation (volatility)
            df[f'Volatility_{window}'] = df['Close'].rolling(window=window).std()
            
            # Variance (for some statistical models)
            df[f'Variance_{window}'] = df['Close'].rolling(window=window).var()
            
            # Average True Range (ATR) - robust volatility measure
            # Uses True Range which accounts for gaps
            high_low = df['High'] - df['Low']
            high_close = (df['High'] - df['Close'].shift()).abs()
            low_close = (df['Low'] - df['Close'].shift()).abs()
            true_range = pd.concat([high_low, high_close, low_close], axis=1).max(axis=1)
            df[f'ATR_{window}'] = true_range.rolling(window=window).mean()
            
            # Normalized ATR (ATR as percentage of price)
            df[f'ATR_Pct_{window}'] = (df[f'ATR_{window}'] / df['Close']) * 100
            
            # Bollinger Bands (volatility envelopes)
            sma = df['Close'].rolling(window=window).mean()
            std = df['Close'].rolling(window=window).std()
            df[f'BB_Upper_{window}'] = sma + (std * 2)
            df[f'BB_Lower_{window}'] = sma - (std * 2)
            df[f'BB_Width_{window}'] = df[f'BB_Upper_{window}'] - df[f'BB_Lower_{window}']
            df[f'BB_Position_{window}'] = (df['Close'] - df[f'BB_Lower_{window}']) / \
                                          (df[f'BB_Upper_{window}'] - df[f'BB_Lower_{window}'] + 0.0001)
        
        print(f"\nCreated volatility features:")
        print(f"  - Standard deviation (price volatility)")
        print(f"  - Average True Range (ATR)")
        print(f"  - Bollinger Bands (volatility envelopes)")
        print(f"  - BB Position (relative location within bands)")
        
        return df
    
    def create_volume_rolling_features(self, windows: List[int] = [5, 10, 20]) -> pd.DataFrame:
        """
        Rolling statistics for volume analysis.
        Identifies accumulation/distribution patterns.
        """
        df = self.df
        
        for window in windows:
            # Average volume
            df[f'Volume_SMA_{window}'] = df['Vol'].rolling(window=window).mean()
            
            # Volume standard deviation (identifies unusual activity)
            df[f'Volume_Std_{window}'] = df['Vol'].rolling(window=window).std()
            
            # Relative Volume (today vs recent average)
            df[f'Rel_Volume_{window}'] = df['Vol'] / df[f'Volume_SMA_{window}']
            
            # Volume trend (increasing or decreasing liquidity)
            df[f'Volume_Trend_{window}'] = df[f'Volume_SMA_{window}'].diff(5)  # 5-day change
            
            # On-Balance Volume (OBV) - cumulative volume flow
            if 'Daily_Return' in df.columns:
                obv = (np.sign(df['Daily_Return']) * df['Vol']).cumsum()
                df[f'OBV_{window}'] = obv.rolling(window=window).mean()
        
        print(f"\nCreated volume rolling features:")
        print(f"  - Volume moving averages")
        print(f"  - Relative volume (vs average)")
        print(f"  - Volume trend (change in average)")
        print(f"  - On-Balance Volume (OBV)")
        
        return df
    
    def create_rolling_statistics(self, windows: List[int] = [10, 20]) -> pd.DataFrame:
        """
        Advanced rolling statistics (skewness, kurtosis, etc.).
        Captures distribution shape changes.
        """
        df = self.df
        
        for window in windows:
            # Skewness (asymmetry of returns)
            # Positive = longer right tail (bigger up moves)
            # Negative = longer left tail (bigger down moves)
            df[f'Skew_{window}'] = df['Close'].rolling(window=window).skew()
            
            # Kurtosis (fat tails vs normal distribution)
            # High kurtosis = extreme moves more likely (tail risk)
            df[f'Kurt_{window}'] = df['Close'].rolling(window=window).kurt()
            
            # Min/Max (support and resistance levels)
            df[f'Rolling_Min_{window}'] = df['Low'].rolling(window=window).min()
            df[f'Rolling_Max_{window}'] = df['High'].rolling(window=window).max()
            
            # Range (volatility proxy)
            df[f'Rolling_Range_{window}'] = df[f'Rolling_Max_{window}'] - df[f'Rolling_Min_{window}']
            
            # Percent rank (where does current price sit in recent range?)
            # 0 = at 20-day low, 1 = at 20-day high
            df[f'Percent_Rank_{window}'] = df['Close'].rolling(window=window).apply(
                lambda x: pd.Series(x).rank(pct=True).iloc[-1], raw=True
            )
        
        print(f"\nCreated advanced rolling statistics:")
        print(f"  - Skewness and Kurtosis (distribution shape)")
        print(f"  - Rolling Min/Max (support/resistance)")
        print(f"  - Percent Rank (position in recent range)")
        
        return df

# Demonstration
if __name__ == "__main__":
    # Continue from lag features example
    lag_engineer = NEPSELagFeatures(lag_engineer.df)
    
    # Add rolling features
    rolling_engineer = NEPSE rollingFeatures(lag_engineer.df)
    
    rolling_engineer.create_moving_averages(windows=[5, 20])
    rolling_engineer.create_volatility_features(windows=[5, 20])
    rolling_engineer.create_volume_rolling_features(windows=[10])
    rolling_engineer.create_rolling_statistics(windows=[20])
    
    # Display rolling features
    print("\nRolling Window Features Sample:")
    display_cols = ['Close', 'SMA_20', 'EMA_20', 'Dist_SMA_20_Pct', 
                   'Volatility_20', 'ATR_20', 'BB_Position_20', 'Percent_Rank_20']
    print(rolling_engineer.df[display_cols].tail(10))
    
    # Show how rolling stats adapt to regime changes
    print("\nRolling Statistics During Different Regimes:")
    regime_cols = ['Close', 'Volatility_20', 'Skew_20', 'Rel_Volume_10']
    print(rolling_engineer.df[regime_cols].describe())
```

**Explanation:**

This section implements **rolling window features**—adaptive statistics that calculate metrics over the most recent $n$ observations, providing dynamic benchmarks that respond to changing market regimes in NEPSE.

**Moving Averages (Trend Indicators):**
The `create_moving_averages()` method implements both Simple Moving Averages (SMA) and Exponential Moving Averages (EMA). 

- **SMA**: Calculates the arithmetic mean over the window. For a 20-day window in NEPSE (approximately one trading month), `SMA_20` represents the average consensus price over the past month. Prices above the SMA indicate bullish sentiment; prices below indicate bearish.

- **EMA**: Uses exponentially decaying weights, giving more importance to recent prices. The formula `ewm(span=window, adjust=False)` applies a smoothing factor of $2/(window+1)$. For NEPSE, EMA responds faster to new information than SMA, making it better for volatile emerging markets where trends change quickly.

- **Distance from MA**: `Dist_SMA_20_Pct` measures how far the current price deviates from its 20-day average as a percentage. Values above +5% suggest overbought conditions (potential mean reversion), while values below -5% suggest oversold conditions. In NEPSE's cyclical market, these deviations often trigger algorithmic and institutional rebalancing.

**Volatility Features:**
The `create_volatility_features()` method measures price variability and risk:

- **Standard Deviation**: The most common volatility measure, calculated as `df['Close'].rolling(window=20).std()`. For NEPSE, 20-day volatility above 2% daily indicates high uncertainty (often during political events or budget announcements), while below 1% indicates consolidation.

- **Average True Range (ATR)**: A more robust volatility measure that accounts for gaps between trading sessions. The True Range is the maximum of: (1) current High minus current Low, (2) absolute value of current High minus previous Close, (3) absolute value of current Low minus previous Close. This is crucial for NEPSE because overnight gaps are common due to the Friday-Saturday weekend and overnight news.

- **Bollinger Bands**: Volatility envelopes set at ±2 standard deviations from the moving average. The `BB_Position_20` feature (0-1 scale) indicates where the price sits within the bands—values near 0 touch the lower band (oversold), values near 1 touch the upper band (overbought). The `BB_Width` measures the distance between bands; narrowing bands predict volatility expansion (the "squeeze" pattern).

**Volume Rolling Features:**
Volume analysis is particularly important in NEPSE's relatively illiquid market:

- **Relative Volume**: `Rel_Volume_10` compares today's volume to the 10-day average. Values above 2.0 indicate twice-normal activity, often signaling institutional participation or news-driven trading. In NEPSE, volume spikes above 3x average frequently mark trend reversals or breakouts.

- **On-Balance Volume (OBV)**: A cumulative indicator that adds volume on up-days and subtracts on down-days. The rolling average of OBV (`OBV_10`) smooths this to show whether volume is flowing into or out of the stock. Rising OBV with flat price suggests accumulation (smart money buying before rally).

**Advanced Statistics:**
The `create_rolling_statistics()` method captures higher-order moments of the return distribution:

- **Skewness**: Measures asymmetry. Positive skew indicates frequent small losses and occasional large gains (bullish asymmetry); negative skew indicates frequent small gains with occasional crashes. NEPSE stocks often exhibit negative skew during bear markets as selling accelerates.

- **Kurtosis**: Measures "fat tails"—the likelihood of extreme moves compared to a normal distribution. High kurtosis (>3) indicates elevated tail risk, common in NEPSE during periods of political uncertainty.

- **Percent Rank**: Indicates the current price's position within the recent range (0 = 20-day low, 1 = 20-day high). This is a pure mean-reversion indicator—values near 0 suggest bounce potential, values near 1 suggest pullback risk.

These rolling features provide adaptive context that raw prices cannot. While the absolute price level of NPR 2000 for a NEPSE stock is meaningless without context, being "5% above the 20-day average with volatility expanding and volume 2x normal" provides a rich, actionable feature vector for machine learning models.

---

## **11.6 Expanding Window Features**

Expanding window features calculate cumulative statistics from the beginning of the series up to the current observation. Unlike rolling windows which use a fixed recent period, expanding windows incorporate all historical data up to time $t$, providing long-term context and historical benchmarks that remain stable over time.

For the NEPSE prediction system, expanding features are valuable because they capture the evolving "memory" of the market since the stock's listing or the start of the dataset. They provide absolute benchmarks (all-time average, all-time high) against which current prices can be compared, helping identify when stocks reach historically significant levels.

```python
class NEPSEExpandingFeatures:
    """
    Expanding window (cumulative) statistics for NEPSE.
    Capture long-term trends and historical context.
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        
    def create_cumulative_statistics(self) -> pd.DataFrame:
        """
        Create expanding (cumulative) statistics.
        These grow with time, incorporating all history up to current point.
        """
        df = self.df
        
        # Cumulative mean (all-time average price)
        # Shows the long-term equilibrium level
        df['Cumulative_Mean'] = df['Close'].expanding().mean()
        
        # Cumulative standard deviation (long-term volatility)
        df['Cumulative_Std'] = df['Close'].expanding().std()
        
        # Distance from cumulative mean (z-score using all history)
        df['Dist_Cumulative_Mean'] = (df['Close'] - df['Cumulative_Mean']) / df['Cumulative_Std']
        
        # Cumulative min/max (all-time highs and lows)
        df['Cumulative_Max'] = df['High'].expanding().max()
        df['Cumulative_Min'] = df['Low'].expanding().min()
        
        # Drawdown from all-time high (peak-to-trough decline)
        df['Drawdown'] = (df['Close'] - df['Cumulative_Max']) / df['Cumulative_Max']
        
        # Distance from all-time low (recovery measure)
        df['Recovery'] = (df['Close'] - df['Cumulative_Min']) / df['Cumulative_Min']
        
        print("Created expanding window features:")
        print("  - Cumulative_Mean: All-time average price")
        print("  - Cumulative_Std: Long-term volatility")
        print("  - Drawdown: Distance from all-time high (negative)")
        print("  - Recovery: Distance from all-time low (positive)")
        
        return df
    
    def create_cumulative_returns(self) -> pd.DataFrame:
        """
        Cumulative and compound return features.
        Track total performance since inception.
        """
        df = self.df
        
        # Calculate daily returns if not present
        if 'Daily_Return' not in df.columns:
            df['Daily_Return'] = df['Close'].pct_change()
        
        # Cumulative return (total percentage gain since start)
        # Formula: (P_t / P_0) - 1
        df['Cumulative_Return'] = (df['Close'] / df['Close'].iloc[0]) - 1
        
        # Compound Annual Growth Rate (CAGR) to current point
        # Approximate using trading days (assume 252 trading days/year)
        days = np.arange(len(df))
        df['CAGR'] = ((df['Close'] / df['Close'].iloc[0]) ** (252 / (days + 1))) - 1
        
        # Cumulative volume (total shares traded since start)
        df['Cumulative_Volume'] = df['Vol'].expanding().sum()
        
        # Average daily volume since start
        df['Avg_Volume_Since_Start'] = df['Vol'].expanding().mean()
        
        # Volume trend (is recent volume above or below historical average?)
        df['Volume_vs_Historical'] = df['Vol'] / df['Avg_Volume_Since_Start']
        
        print("\nCreated cumulative return features:")
        print("  - Cumulative_Return: Total return since start")
        print("  - CAGR: Compound annual growth rate")
        print("  - Cumulative_Volume: Total historical volume")
        print("  - Volume_vs_Historical: Recent volume relative to all-time average")
        
        return df
    
    def create_running_counts(self) -> pd.DataFrame:
        """
        Running counts and frequencies of specific events.
        """
        df = self.df
        
        if 'Daily_Return' in df.columns:
            # Count of positive days so far
            df['Positive_Days_Count'] = (df['Daily_Return'] > 0).expanding().sum()
            
            # Win rate (percentage of days that were positive)
            days = np.arange(1, len(df) + 1)
            df['Win_Rate'] = df['Positive_Days_Count'] / days
            
            # Count of extreme moves (>2% in NEPSE context)
            extreme_moves = df['Daily_Return'].abs() > 0.02
            df['Extreme_Move_Count'] = extreme_moves.expanding().sum()
            
            # Frequency of extreme moves (increasing = rising volatility regime)
            df['Extreme_Move_Freq'] = df['Extreme_Move_Count'] / days
        
        # Days since all-time high (how long has it been since the peak?)
        is_new_high = df['Close'] == df['Cumulative_Max']
        df['Days_Since_High'] = (~is_new_high).expanding().sum()
        
        # Reset counter when new high is made
        df.loc[is_new_high, 'Days_Since_High'] = 0
        
        print("\nCreated running count features:")
        print("  - Positive_Days_Count: Total up days")
        print("  - Win_Rate: Percentage of up days")
        print("  - Extreme_Move_Count: Days with >2% moves")
        print("  - Days_Since_High: Duration since last all-time high")
        
        return df

# Demonstration
if __name__ == "__main__":
    # Continue from previous features
    expanding_engineer = NEPSEExpandingFeatures(rolling_engineer.df)
    
    expanding_engineer.create_cumulative_statistics()
    expanding_engineer.create_cumulative_returns()
    expanding_engineer.create_running_counts()
    
    # Display expanding features
    print("\nExpanding Window Features (showing evolution over time):")
    display_cols = ['Close', 'Cumulative_Mean', 'Drawdown', 'Cumulative_Return', 
                   'Win_Rate', 'Days_Since_High']
    print(expanding_engineer.df[display_cols].iloc[::20])  # Show every 20th row
    
    # Show how features stabilize over time
    print("\nFeature Stability (first vs last 20 observations):")
    print("First 20:")
    print(expanding_engineer.df[['Cumulative_Mean', 'Win_Rate']].head(20))
    print("\nLast 20:")
    print(expanding_engineer.df[['Cumulative_Mean', 'Win_Rate']].tail(20))
```

**Explanation:**

This section implements **expanding window features**—cumulative statistics that incorporate all historical data up to the current point, providing long-term context and stable benchmarks for the NEPSE prediction system.

**Cumulative Price Statistics:**
The `create_cumulative_statistics()` method generates features that represent the "full history" context:

- **Cumulative_Mean**: The average price since the beginning of the dataset. Unlike the 20-day moving average which fluctuates, this expands to include all data and becomes increasingly stable. For a NEPSE stock listed 5 years ago, the cumulative mean represents its long-term equilibrium value—prices significantly above this level may be overvalued relative to historical norms.

- **Cumulative_Std**: The standard deviation calculated over the entire history. This measures the stock's inherent volatility characteristic. A stock with cumulative_std of NPR 50 is inherently more volatile than one with NPR 10, regardless of current price level.

- **Dist_Cumulative_Mean**: A z-score using the cumulative statistics: $(Close - Cumulative\_Mean) / Cumulative\_Std$. This indicates how many standard deviations the current price is from its all-time average. Values above +2 suggest extreme overvaluation; below -2 suggest extreme undervaluation (mean reversion opportunities).

- **Drawdown**: The percentage decline from the all-time high: $(Close - Cumulative\_Max) / Cumulative\_Max$. This is always negative or zero, measuring how far the stock has fallen from its peak. In NEPSE, drawdowns of -30% to -50% are common during bear markets and often mark good entry points for long-term investors.

**Cumulative Return Metrics:**
The `create_cumulative_returns()` method tracks total performance:

- **Cumulative_Return**: The total percentage return since the start of the data: $(P_t / P_0) - 1$. If a NEPSE stock started at NPR 100 and is now NPR 250, the cumulative return is 150%. This provides absolute performance context—has this stock been a long-term winner or loser?

- **CAGR (Compound Annual Growth Rate)**: The annualized return rate implied by the cumulative performance. Calculated as $(P_t / P_0)^{(252/t)} - 1$, where 252 represents the number of trading days in a year. This allows comparison across different time periods and stocks. A NEPSE stock with 15% CAGR is outperforming typical bank deposits and many fixed-income alternatives.

- **Volume_vs_Historical**: Compares today's volume to the all-time average volume since listing. Values above 2.0 indicate today's activity is twice the historical norm, suggesting unusual interest or institutional activity in an otherwise quiet stock.

**Running Count Features:**
The `create_running_counts()` method tracks event frequencies:

- **Win_Rate**: The cumulative percentage of trading days that closed positive. This stabilizes over time to reflect the stock's underlying trend bias. A win rate above 55% indicates a strong uptrend; below 45% indicates persistent downtrend. For NEPSE, win rates cluster around 50% in sideways markets but deviate during trending periods.

- **Days_Since_High**: Counts how many days have passed since the last all-time high was made. In bull markets, this stays near 0 (constant new highs). During corrections, it accumulates, indicating duration of the pullback. In NEPSE, pullbacks lasting more than 60 days (3 months) often represent significant bear markets requiring fundamental reassessment.

**Comparison with Rolling Windows:**
While rolling windows (20-day) adapt quickly to recent changes, expanding windows provide stability and long-term context. For example:
- **Rolling_20_Mean**: Changes significantly during a month-long rally
- **Cumulative_Mean**: Moves slowly, reflecting the stock's long-term value anchor

In the NEPSE prediction system, combining both is powerful: the rolling mean identifies short-term trends, while the expanding mean identifies deviations from historical value. When `SMA_20` crosses above `Cumulative_Mean`, it signals that recent momentum has pushed the price above its long-term average, potentially indicating the start of a new bull phase in the cyclical NEPSE market.

---

## **11.7 Time-Based Features**

Time-based features encode calendar effects, seasonal patterns, and temporal context that influence stock market behavior. Financial markets exhibit regular patterns tied to the calendar—day-of-week effects (weekends, Monday blues), month-end effects (portfolio rebalancing), and seasonal patterns (fiscal year-end tax selling).

For the NEPSE prediction system, time-based features are particularly important because of Nepal's unique calendar structure: the Sunday-Thursday trading week (different from Western markets), the mid-July fiscal year-end (Shrawan to Ashad), and the influence of Nepali festivals (Dashain, Tihar) which create distinct seasonal liquidity patterns.

```python
class NEPSETimeFeatures:
    """
    Time-based and calendar features specific to NEPSE.
    Captures seasonal patterns, trading calendar effects, and fiscal year cycles.
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        if 'Date' in self.df.columns:
            self.df['Date'] = pd.to_datetime(self.df['Date'])
        
    def create_basic_calendar_features(self) -> pd.DataFrame:
        """
        Standard calendar features (day, month, year).
        """
        df = self.df
        
        # Ensure Date is datetime
        if 'Date' not in df.columns:
            print("Warning: No Date column found, creating sequence-based features")
            df['Date'] = pd.date_range(start='2023-01-01', periods=len(df), freq='B')
        
        # Extract components
        df['Year'] = df['Date'].dt.year
        df['Month'] = df['Date'].dt.month
        df['Day'] = df['Date'].dt.day
        df['Day_of_Week'] = df['Date'].dt.dayofweek  # 0=Monday, 6=Sunday
        df['Day_of_Year'] = df['Date'].dt.dayofyear
        df['Week_of_Year'] = df['Date'].dt.isocalendar().week
        
        # Quarter (calendar)
        df['Quarter'] = df['Date'].dt.quarter
        
        print("Created basic calendar features:")
        print(f"  - Date range: {df['Date'].min()} to {df['Date'].max()}")
        print(f"  - Days: {df['Day_of_Week'].nunique()} unique days")
        print(f"  - Months: {df['Month'].nunique()} unique months")
        
        return df
    
    def create_nepse_trading_calendar(self) -> pd.DataFrame:
        """
        NEPSE-specific trading calendar features.
        NEPSE trades Sunday (6) through Thursday (4).
        Friday (5) and Saturday (0 in some systems, but 5=Friday, 6=Saturday in pandas) are weekend.
        """
        df = self.df
        
        # In pandas: Monday=0, Sunday=6
        # NEPSE trades Sunday(6), Monday(0), Tuesday(1), Wednesday(2), Thursday(3)
        # Wait, pandas standard: Monday=0, Tuesday=1, Wednesday=2, Thursday=3, Friday=4, Saturday=5, Sunday=6
        # So NEPSE trades: Sunday(6), Monday(0), Tuesday(1), Wednesday(2), Thursday(3)
        
        df['Is_Sunday'] = (df['Day_of_Week'] == 6).astype(int)
        df['Is_Monday'] = (df['Day_of_Week'] == 0).astype(int)
        df['Is_Thursday'] = (df['Day_of_Week'] == 3).astype(int)  # Thursday is last trading day
        
        # Weekend proximity (Friday/Saturday are weekend)
        # Thursday is "pre-weekend", Sunday is "post-weekend"
        df['Is_Pre_Weekend'] = df['Is_Thursday']  # Last trading day before weekend
        df['Is_Post_Weekend'] = df['Is_Sunday']   # First trading day after weekend
        
        # Days since weekend (0 for Sunday, 4 for Thursday)
        # Maps Sunday->0, Monday->1, Tuesday->2, Wednesday->3, Thursday->4
        day_map = {6: 0, 0: 1, 1: 2, 2: 3, 3: 4}
        df['Days_Since_Weekend'] = df['Day_of_Week'].map(day_map)
        
        # Weekend gap risk (Thursday close to Sunday open gap potential)
        # Higher values indicate longer time since last trade (more news accumulation)
        df['Weekend_Risk_Score'] = df['Days_Since_Weekend'] / 4  # Normalized 0-1
        
        print("\nCreated NEPSE trading calendar features:")
        print("  - Is_Sunday: First trading day (gap risk)")
        print("  - Is_Thursday: Last trading day (position squaring)")
        print("  - Days_Since_Weekend: Trading day counter (0-4)")
        print("  - Weekend_Risk_Score: News accumulation proxy")
        
        return df
    
    def create_nepali_fiscal_calendar(self) -> pd.DataFrame:
        """
        Nepali fiscal year features.
        Nepal's fiscal year: Shrawan (mid-July) to Ashad (mid-July)
        Critical for tax and reporting seasonality.
        """
        df = self.df
        
        # Fiscal Year determination
        # FY starts mid-July (approx July 16)
        month_day = df['Date'].dt.month * 100 + df['Date'].dt.day
        
        # If before July 16, we're in previous fiscal year
        # Example: Jan 2024 (1/15 = 115) < 716, so FY 2023/24 (FY 2080 in Nepali calendar)
        df['Fiscal_Year'] = df['Date'].dt.year
        df.loc[month_day < 716, 'Fiscal_Year'] -= 1
        
        # Fiscal Quarter (Nepal specific)
        # Q1: Shrawan-Bhadra-Ashwin (Jul, Aug, Sep) - Monsoon, slow start
        # Q2: Kartik-Mangsir-Poush (Oct, Nov, Dec) - Post-Dashain activity
        # Q3: Magh-Falgun-Chaitra (Jan, Feb, Mar) - Winter, steady trading
        # Q4: Baisakh-Jestha-Ashad (Apr, May, Jun) - Year-end rally/tax selling
        df['Fiscal_Quarter'] = df['Month'].apply(
            lambda m: 1 if m in [7, 8, 9] else
                      2 if m in [10, 11, 12] else
                      3 if m in [1, 2, 3] else 4
        )
        
        # Fiscal month (1-12 starting from Shrawan/July)
        # July=1, August=2, ..., June=12
        fiscal_month_map = {7: 1, 8: 2, 9: 3, 10: 4, 11: 5, 12: 6,
                           1: 7, 2: 8, 3: 9, 4: 10, 5: 11, 6: 12}
        df['Fiscal_Month'] = df['Month'].map(fiscal_month_map)
        
        # Fiscal year-end proximity (days until mid-July)
        # Critical for tax-loss harvesting and window dressing
        year_end_date = pd.to_datetime(df['Fiscal_Year'].astype(str) + '-07-15')
        # For dates after July 15, use next year
        mask = df['Date'].dt.month > 7
        year_end_date[mask] = pd.to_datetime((df['Fiscal_Year'][mask] + 1).astype(str) + '-07-15')
        
        df['Days_to_FY_End'] = (year_end_date - df['Date']).dt.days
        
        # Is year-end quarter (Q4: Apr-Jul)
        df['Is_FY_End_Quarter'] = (df['Fiscal_Quarter'] == 4).astype(int)
        
        # Is fiscal year-end month (Ashad/June)
        df['Is_FY_End_Month'] = (df['Month'] == 6).astype(int)
        
        print("\nCreated Nepali fiscal calendar features:")
        print(f"  - Fiscal Year range: {df['Fiscal_Year'].min()} to {df['Fiscal_Year'].max()}")
        print(f"  - Fiscal_Quarter: Q1=Jul-Sep, Q2=Oct-Dec, Q3=Jan-Mar, Q4=Apr-Jul")
        print(f"  - Days_to_FY_End: {df['Days_to_FY_End'].min()} to {df['Days_to_FY_End'].max()}")
        
        return df
    
    def create_seasonal_features(self) -> pd.DataFrame:
        """
        Seasonal and cyclical encoding.
        Converts linear time into cyclical patterns.
        """
        df = self.df
        
        # Cyclical encoding of month (preserves circularity: Dec close to Jan)
        df['Month_Sin'] = np.sin(2 * np.pi * df['Month'] / 12)
        df['Month_Cos'] = np.cos(2 * np.pi * df['Month'] / 12)
        
        # Cyclical encoding of day of week
        df['DOW_Sin'] = np.sin(2 * np.pi * df['Day_of_Week'] / 7)
        df['DOW_Cos'] = np.cos(2 * np.pi * df['Day_of_Week'] / 7)
        
        # Cyclical encoding of fiscal month
        if 'Fiscal_Month' in df.columns:
            df['Fiscal_Month_Sin'] = np.sin(2 * np.pi * df['Fiscal_Month'] / 12)
            df['Fiscal_Month_Cos'] = np.cos(2 * np.pi * df['Fiscal_Month'] / 12)
        
        # Month-end effects (common in all markets)
        df['Is_Month_Start'] = (df['Date'].dt.is_month_start).astype(int)
        df['Is_Month_End'] = (df['Date'].dt.is_month_end).astype(int)
        df['Days_to_Month_End'] = (df['Date'] + pd.offsets.MonthEnd(0) - df['Date']).dt.days
        
        # Festival seasons (approximate dates for NEPSE)
        # Dashain (October), Tihar (November) - markets typically thin
        df['Is_Dashain_Season'] = (df['Month'] == 10).astype(int)
        df['Is_Tihar_Season'] = (df['Month'] == 11).astype(int)
        
        # Monsoon season (Jun-Sep) - affects hydropower stocks heavily
        df['Is_Monsoon'] = df['Month'].isin([6, 7, 8, 9]).astype(int)
        
        print("\nCreated seasonal features:")
        print("  - Cyclical encoding (sin/cos) for Month, Day of Week")
        print("  - Month start/end indicators")
        print("  - Festival season flags (Dashain, Tihar)")
        print("  - Monsoon season flag (hydropower sensitivity)")
        
        return df

# Demonstration
if __name__ == "__main__":
    # Create sample data with dates spanning multiple years
    np.random.seed(42)
    dates = pd.date_range(start='2022-01-01', end='2023-12-31', freq='B')
    # Filter for Sunday-Thursday (simplified)
    
    sample_df = pd.DataFrame({
        'Date': dates,
        'Close': np.random.uniform(1800, 2200, len(dates))
    })
    
    time_engineer = NEPSETimeFeatures(sample_df)
    
    time_engineer.create_basic_calendar_features()
    time_engineer.create_nepse_trading_calendar()
    time_engineer.create_nepali_fiscal_calendar()
    time_engineer.create_seasonal_features()
    
    # Display time features
    print("\nTime-Based Features Sample:")
    display_cols = ['Date', 'Day_of_Week', 'Is_Sunday', 'Is_Thursday', 
                   'Fiscal_Quarter', 'Days_to_FY_End', 'Month_Sin', 'Month_Cos']
    print(time_engineer.df[display_cols].head(20))
    
    # Show fiscal year transition
    print("\nFiscal Year Transition (June-July):")
    jun_jul_mask = (time_engineer.df['Month'] == 6) | (time_engineer.df['Month'] == 7)
    print(time_engineer.df[jun_jul_mask][['Date', 'Fiscal_Year', 'Fiscal_Month', 'Days_to_FY_End']].head(10))
```

**Explanation:**

This section implements **time-based features** that encode calendar effects specific to the NEPSE trading environment and Nepali fiscal calendar.

**NEPSE Trading Calendar:**
The `create_nepse_trading_calendar()` method addresses the unique Sunday-Thursday trading schedule of the Nepal Stock Exchange, which differs from the Monday-Friday schedule of Western markets:

- **Is_Sunday**: Identifies the first trading day of the week. In NEPSE, Sunday often exhibits "weekend gap" behavior—prices opening significantly higher or lower than Thursday's close due to news accumulation over Friday and Saturday. This creates volatility and mean-reversion opportunities distinct from other trading days.

- **Is_Thursday**: The last trading day before the weekend. In NEPSE, Thursdays often see "position squaring"—traders closing positions to avoid weekend risk (two days of potential news accumulation without ability to trade). This can create volume spikes and reversal patterns.

- **Days_Since_Weekend**: A counter from 0 (Sunday) to 4 (Thursday) that captures the progression of the trading week. Studies of NEPSE show different volatility patterns—typically lower on Sunday (wait-and-see approach), building through Tuesday-Wednesday, and elevated on Thursday (position adjustments).

**Nepali Fiscal Calendar:**
The `create_nepali_fiscal_calendar()` method implements Nepal's unique fiscal year (Shrawan to Ashad, mid-July to mid-July), which drives distinct seasonal patterns in the stock market:

- **Fiscal_Year**: Determined by checking if the date is before or after July 16. If January 15, 2024 (115 in MMDD format) is before July 16 (716), it belongs to Fiscal Year 2023 (which runs July 2023-July 2024). This is critical because institutional investors report performance by fiscal year.

- **Fiscal_Quarter**: Mapped to Nepali months—Q1 (July-September) is the post-budget period when new government spending plans affect infrastructure and bank stocks. Q4 (April-July) is the fiscal year-end when tax-loss harvesting and "window dressing" (buying winners to show in portfolio reports) create distinct market dynamics.

- **Days_to_FY_End**: Counts down to mid-July. As this approaches 0 (June), NEPSE typically sees increased volatility as investors realize tax losses (selling losers) and rebalance portfolios. Stocks down significantly for the fiscal year often face additional selling pressure in Ashad (June).

**Cyclical Encoding:**
The `create_seasonal_features()` method addresses a mathematical issue with raw month/day numbers: December (12) and January (1) are numerically far apart (difference of 11), but cyclically adjacent. The sine/cosine transformation maps months onto a circle:

- **Month_Sin/Month_Cos**: Convert month 1-12 into coordinates on a unit circle. January (month 1) maps to specific (sin, cos) values, as does December (month 12), and these points are close together on the circle, correctly representing that December flows into January. This allows machine learning models to understand seasonal continuity.

**Festival and Seasonal Effects:**
- **Is_Dashain_Season/October**: Dashain (Nepal's biggest festival) typically occurs in October. During this period, NEPSE trading volume drops significantly as investors focus on celebrations rather than markets, creating liquidity crunches and erratic price movements.

- **Is_Monsoon**: June-September marks the monsoon season in Nepal. This is particularly relevant for NEPSE because the index is heavily weighted toward hydropower companies (which generate more electricity during monsoon) and agricultural stocks. The monsoon feature captures sectoral seasonality that affects the broad market.

These time-based features allow the NEPSE prediction model to account for structural calendar effects that pure price-based models would miss—from the weekly rhythm of Sunday-Thursday trading to the annual cycle of fiscal year-end portfolio adjustments unique to Nepal's tax and reporting calendar.

---

## **11.8 Interaction Features**

Interaction features capture the combined effect of two or more variables that is greater than the sum of their individual effects. In financial markets, the relationship between price and volume, or between volatility and trend, often provides more predictive power than either variable alone. These features model the "chemistry" between different market dimensions.

For the NEPSE prediction system, interaction features are particularly valuable because they can identify regime-specific patterns—for example, a price increase on high volume has different implications than the same price increase on low volume in Nepal's relatively illiquid market structure.

```python
class NEPSEInteractionFeatures:
    """
    Interaction features combining multiple variables.
    Capture synergistic effects between price, volume, and volatility.
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        
    def create_price_volume_interactions(self) -> pd.DataFrame:
        """
        Interactions between price movements and volume activity.
        Volume validates (or invalidates) price moves.
        """
        df = self.df
        
        # Ensure required columns exist
        if 'Daily_Return' not in df.columns:
            df['Daily_Return'] = df['Close'].pct_change()
        
        if 'Rel_Volume_10' not in df.columns:
            avg_vol = df['Vol'].rolling(10).mean()
            df['Rel_Volume_10'] = df['Vol'] / avg_vol
        
        # Volume-Confirmed Return (return weighted by relative volume)
        # High values = strong move with high conviction
        df['Volume_Confirmed_Return'] = df['Daily_Return'] * df['Rel_Volume_10']
        
        # Price change efficiency (return per unit of volume)
        # High = efficient move (large price change on modest volume)
        # Low = inefficient move (large volume for small price change)
        df['Price_Efficiency'] = df['Daily_Return'] / (df['Rel_Volume_10'] + 0.001)
        
        # Volume-Price Trend (cumulative confirmation)
        # Positive = rising prices on rising volume (strong uptrend)
        # Negative = falling prices on rising volume (strong downtrend)
        df['Volume_Price_Trend'] = np.sign(df['Daily_Return']) * np.log(df['Rel_Volume_10'] + 1)
        
        # VWAP deviation interaction (price vs average price, weighted by volume)
        if 'VWAP_Distance_Pct' in df.columns:
            df['VWAP_Volume_Interaction'] = df['VWAP_Distance_Pct'] * df['Rel_Volume_10']
        
        # Climactic volume indicator (extreme volume + extreme price)
        # Often marks turning points
        extreme_price = df['Daily_Return'].abs() > df['Daily_Return'].rolling(20).std() * 2
        extreme_vol = df['Rel_Volume_10'] > 2.0
        df['Climactic_Move'] = (extreme_price & extreme_vol).astype(int)
        
        print("Created price-volume interactions:")
        print("  - Volume_Confirmed_Return: Return weighted by volume")
        print("  - Price_Efficiency: Return per unit volume")
        print("  - Volume_Price_Trend: Signed volume confirmation")
        print("  - Climactic_Move: Extreme volume + price (reversal signal)")
        
        return df
    
    def create_volatility_trend_interactions(self) -> pd.DataFrame:
        """
        Interactions between volatility and trend direction.
        Different volatility regimes favor different strategies.
        """
        df = self.df
        
        # Trend strength (distance from moving average)
        if 'Dist_SMA_20_Pct' in df.columns:
            trend_strength = df['Dist_SMA_20_Pct']
        else:
            sma = df['Close'].rolling(20).mean()
            trend_strength = ((df['Close'] - sma) / sma) * 100
        
        # Volatility regime
        if 'Volatility_20' not in df.columns:
            df['Volatility_20'] = df['Daily_Return'].rolling(20).std()
        
        vol_percentile = df['Volatility_20'].rolling(252).apply(
            lambda x: pd.Series(x).rank(pct=True).iloc[-1], raw=True
        )
        
        # Volatility-Adjusted Trend (trend strength relative to noise)
        # High values = strong trend, low noise (good for trend following)
        df['Trend_SNR'] = trend_strength / (df['Volatility_20'] + 0.0001)  # Signal-to-noise ratio
        
        # Volatility Regime Trend
        # How does trend behave in high vs low volatility?
        df['High_Vol_Trend'] = trend_strength * (vol_percentile > 0.75).astype(int)
        df['Low_Vol_Trend'] = trend_strength * (vol_percentile < 0.25).astype(int)
        
        # Expansion/Contraction interaction
        # Price trend during volatility expansion vs contraction
        vol_expanding = df['Volatility_20'] > df['Volatility_20'].shift(5)
        df['Expansion_Trend'] = trend_strength * vol_expanding.astype(int)
        df['Contraction_Trend'] = trend_strength * (~vol_expanding).astype(int)
        
        print("\nCreated volatility-trend interactions:")
        print("  - Trend_SNR: Trend strength relative to volatility")
        print("  - High/Low_Vol_Trend: Trend in different volatility regimes")
        print("  - Expansion/Contraction_Trend: Trend during volatility changes")
        
        return df
    
    def create_range_volume_interactions(self) -> pd.DataFrame:
        """
        Interactions between daily range (volatility) and volume.
        Identifies breakouts vs false moves.
        """
        df = self.df
        
        # Range expansion on volume (true breakout)
        daily_range = (df['High'] - df['Low']) / df['Close']
        df['Range_Volume_Breakout'] = daily_range * df['Rel_Volume_10']
        
        # Narrow range on low volume (consolidation before expansion)
        narrow_range = daily_range < daily_range.rolling(20).quantile(0.2)
        low_volume = df['Rel_Volume_10'] < 0.8
        df['Consolidation_Signal'] = (narrow_range & low_volume).astype(int)
        
        # Wide range on low volume (potential trap)
        wide_range = daily_range > daily_range.rolling(20).quantile(0.8)
        df['Trap_Signal'] = (wide_range & low_volume).astype(int)
        
        print("\nCreated range-volume interactions:")
        print("  - Range_Volume_Breakout: Wide range on high volume")
        print("  - Consolidation_Signal: Tight range on low volume")
        print("  - Trap_Signal: Wide range on low volume (false breakout)")
        
        return df
    
    def create_multi_lag_interactions(self) -> pd.DataFrame:
        """
        Interactions between different time lags.
        Captures acceleration and momentum shifts.
        """
        df = self.df
        
        # Ensure lag features exist
        if 'Close_Lag_1' not in df.columns:
            for lag in [1, 5, 20]:
                df[f'Close_Lag_{lag}'] = df['Close'].shift(lag)
        
        # Short-term vs Long-term momentum interaction
        short_term = (df['Close'] - df['Close_Lag_1']) / df['Close_Lag_1']
        long_term = (df['Close'] - df['Close_Lag_20']) / df['Close_Lag_20']
        
        # Momentum alignment (both positive or both negative = strong trend)
        df['Momentum_Alignment'] = short_term * long_term
        
        # Momentum divergence (short term opposite to long term = potential reversal)
        df['Momentum_Divergence'] = abs(short_term - long_term)
        
        # Acceleration (change in short-term momentum)
        df['Momentum_Acceleration'] = short_term - (df['Close_Lag_1'] - df['Close_Lag_5']) / df['Close_Lag_5']
        
        print("\nCreated multi-lag interactions:")
        print("  - Momentum_Alignment: Short-term × Long-term (trend strength)")
        print("  - Momentum_Divergence: Difference between timeframes")
        print("  - Momentum_Acceleration: Change in momentum rate")
        
        return df

# Demonstration
if __name__ == "__main__":
    # Setup with previous features
    interaction_engineer = NEPSEInteractionFeatures(expanding_engineer.df)
    
    interaction_engineer.create_price_volume_interactions()
    interaction_engineer.create_volatility_trend_interactions()
    interaction_engineer.create_range_volume_interactions()
    interaction_engineer.create_multi_lag_interactions()
    
    # Display interaction features
    print("\nInteraction Features Sample:")
    display_cols = ['Close', 'Daily_Return', 'Rel_Volume_10', 'Volume_Confirmed_Return',
                   'Trend_SNR', 'Climactic_Move', 'Momentum_Alignment']
    print(interaction_engineer.df[display_cols].tail(10))
```

**Explanation:**

This section implements **interaction features** that combine multiple variables to capture synergistic effects not visible when examining features in isolation.

**Price-Volume Interactions:**
The `create_price_volume_interactions()` method addresses a fundamental principle of technical analysis: volume validates price. A price movement on high volume indicates broad participation and conviction, while the same movement on low volume suggests lack of participation and potential reversal:

- **Volume_Confirmed_Return**: Multiplies the daily return by relative volume. A +2% return on 3x average volume scores +6, while the same +2% on 0.5x volume scores only +1. This amplifies signals that have institutional backing and diminishes noise from thin trading.

- **Price_Efficiency**: Measures return generated per unit of volume. High efficiency (large price move on modest volume) suggests strong supply/demand imbalance and potential continuation. Low efficiency (large volume for small price move) suggests equilibrium and potential reversal. In NEPSE's illiquid market, efficiency spikes often precede trend changes.

- **Climactic_Move**: A binary flag identifying extreme price moves (>2 standard deviations) occurring on extreme volume (>2x average). In NEPSE, these "blow-off" moves often mark trend exhaustion—parabolic rallies ending in volume spikes as the last buyers enter, or panic selling capitulations marking bottoms.

**Volatility-Trend Interactions:**
The `create_volatility_trend_interactions()` method combines trend strength with market noise levels:

- **Trend_SNR (Signal-to-Noise Ratio)**: Divides trend strength (distance from 20-day MA) by volatility. High SNR indicates a strong, clean trend suitable for trend-following strategies. Low SNR indicates choppy, directionless markets where mean-reversion strategies work better. For NEPSE, SNR > 2.0 indicates strong trending conditions; < 0.5 indicates range-bound markets.

- **High_Vol_Trend**: Isolates trend strength during high volatility regimes (top quartile). In NEPSE, trends during high volatility are often unsustainable and prone to sharp reversals, unlike low-volatility trends which tend to persist.

**Range-Volume Interactions:**
The `create_range_volume_interactions()` method distinguishes between genuine breakouts and false moves:

- **Range_Volume_Breakout**: Multiplies daily range by relative volume. Wide ranges on high volume indicate genuine breakouts as consensus forms around new price levels. This is a high-conviction entry signal for NEPSE momentum strategies.

- **Consolidation_Signal**: Identifies tight ranges (bottom 20% of 20-day range) on low volume (<0.8x average). This "coiled spring" pattern often precedes explosive moves in NEPSE stocks as pent-up supply/demand imbalances resolve.

- **Trap_Signal**: Wide ranges on low volume suggest "bull traps" or "bear traps"—false breakouts that suck in traders before reversing. In low-liquidity NEPSE stocks, a few large orders can create wide ranges without broad participation, creating these trap patterns.

**Multi-Lag Interactions:**
The `create_multi_lag_interactions()` method combines different time horizons:

- **Momentum_Alignment**: Multiplies short-term (1-day) momentum by long-term (20-day) momentum. Positive values indicate both short and long-term trends agree (strong directional move). Negative values indicate conflict (short-term counter-trend move within longer trend—potential pullback or reversal).

- **Momentum_Divergence**: The absolute difference between short and long-term momentum. High divergence warns of potential trend changes as near-term action conflicts with established direction.

These interaction features allow the NEPSE prediction model to understand context—a +2% move means different things depending on volume, volatility regime, and alignment with longer-term trends. This multi-dimensional view is essential for accurate prediction in complex, adaptive markets like NEPSE.

---

## **11.9 Transformation Features**

Transformation features apply mathematical functions to raw or engineered features to improve their statistical properties for machine learning algorithms. Financial data often exhibits skewed distributions, heteroscedasticity (changing variance), and non-linear relationships that can impair model performance. Transformations stabilize variance, reduce skewness, and linearize relationships.

For the NEPSE prediction system, transformations are crucial because financial time-series typically have fat-tailed (leptokurtic) distributions with extreme outliers during market crashes or rallies. Transformations like log, square root, and power transforms make these distributions more normal (Gaussian), improving the performance of algorithms that assume normality, such as linear regression, neural networks, and SVMs.

```python
class NEPSETransformationFeatures:
    """
    Mathematical transformation features for NEPSE data.
    Improve statistical properties (normality, stationarity) for ML models.
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        
    def create_log_transforms(self) -> pd.DataFrame:
        """
        Logarithmic transformations.
        Compresses scale, stabilizes variance, linearizes exponential growth.
        """
        df = self.df
        
        # Log of price (turns multiplicative relationships into additive)
        # Log(P_t) - Log(P_{t-1}) = Log(P_t / P_{t-1}) = log return
        df['Log_Close'] = np.log(df['Close'])
        df['Log_Volume'] = np.log(df['Vol'] + 1)  # +1 to handle zero volume
        df['Log_Turnover'] = np.log(df['Turnover'] + 1)
        
        # Log of ranges (stabilizes volatility measures)
        df['Log_Range'] = np.log((df['High'] - df['Low']) + 0.001)
        
        # Log differences (alternative return calculation)
        df['Log_Diff_1'] = df['Log_Close'].diff(1)
        df['Log_Diff_5'] = df['Log_Close'].diff(5)
        
        print("Created log transformation features:")
        print("  - Log_Close: Natural log of price")
        print("  - Log_Volume: Log of volume (handles skewness)")
        print("  - Log_Diff: Log differences (returns)")
        
        return df
    
    def create_power_transforms(self) -> pd.DataFrame:
        """
        Power transformations (Box-Cox, Yeo-Johnson approximations).
        Reduces skewness in volume and volatility data.
        """
        df = self.df
        
        # Square root transformation (moderate skew reduction)
        # Good for volume data which is often Poisson-like
        df['Sqrt_Volume'] = np.sqrt(df['Vol'])
        df['Sqrt_Turnover'] = np.sqrt(df['Turnover'])
        
        # Cube root (stronger than sqrt, weaker than log)
        df['Cbrt_Volume'] = np.cbrt(df['Vol'])
        
        # Power transformation for returns (reduce kurtosis/fat tails)
        # Sign(x) * |x|^0.5 compresses extreme values while preserving sign
        if 'Daily_Return' in df.columns:
            df['Signed_Sqrt_Return'] = np.sign(df['Daily_Return']) * np.sqrt(np.abs(df['Daily_Return']))
            
            # Tanh compression (sigmoid-like, strongly compresses outliers)
            df['Tanh_Return'] = np.tanh(df['Daily_Return'] * 10)  # Scale factor 10 for daily returns
        
        print("\nCreated power transformation features:")
        print("  - Sqrt_Volume: Square root of volume")
        print("  - Signed_Sqrt_Return: Signed square root of returns")
        print("  - Tanh_Return: Hyperbolic tangent compression")
        
        return df
    
    def create_rank_and_quantile_features(self) -> pd.DataFrame:
        """
        Rank and quantile transformations (robust to outliers).
        Converts values to percentiles within recent history.
        """
        df = self.df
        
        # Percentile rank (0-1 scale) - robust to outliers
        for window in [20, 60]:
            # Price position in recent range
            df[f'Price_Rank_{window}'] = df['Close'].rolling(window).apply(
                lambda x: pd.Series(x).rank(pct=True).iloc[-1], raw=True
            )
            
            # Volume rank (is today's volume high or low historically?)
            df[f'Volume_Rank_{window}'] = df['Vol'].rolling(window).apply(
                lambda x: pd.Series(x).rank(pct=True).iloc[-1], raw=True
            )
            
            # Return rank (how extreme is today's move?)
            if 'Daily_Return' in df.columns:
                df[f'Return_Rank_{window}'] = df['Daily_Return'].rolling(window).apply(
                    lambda x: pd.Series(x).rank(pct=True).iloc[-1], raw=True
                )
        
        # Quantile bins (discretization)
        # Reduces noise, captures non-linear relationships
        df['Volume_Quintile'] = pd.qcut(df['Vol'], q=5, labels=['Very_Low', 'Low', 'Medium', 'High', 'Very_High'])
        df['Return_Decile'] = pd.qcut(df['Daily_Return'].rank(method='first'), q=10, labels=False)
        
        print("\nCreated rank/quantile features:")
        print("  - Price_Rank: Percentile position in recent range")
        print("  - Volume_Rank: Volume percentile (0=lowest, 1=highest)")
        print("  - Volume_Quintile: Discrete volume categories")
        
        return df
    
    def create_differencing_features(self) -> pd.DataFrame:
        """
        Differencing transformations for stationarity.
        Removes trends and seasonal components.
        """
        df = self.df
        
        # First-order differencing (removes linear trends)
        df['Close_Diff_1'] = df['Close'].diff(1)
        df['Volume_Diff_1'] = df['Vol'].diff(1)
        
        # Second-order differencing (removes curvature)
        df['Close_Diff_2'] = df['Close'].diff(2)
        
        # Seasonal differencing (for fiscal year effects in NEPSE)
        # Difference from same time last year (approx 252 trading days)
        df['Close_Diff_252'] = df['Close'].diff(252)
        
        # Relative differencing (percentage change from n periods ago)
        for period in [5, 20, 252]:
            df[f'Close_Pct_Change_{period}'] = (df['Close'] - df['Close'].shift(period)) / df['Close'].shift(period)
        
        print("\nCreated differencing features:")
        print("  - Close_Diff_1: First difference (daily change)")
        print("  - Close_Diff_2: Second difference (acceleration)")
        print("  - Close_Diff_252: Year-over-year difference")
        print("  - Pct_Change: Relative changes over 5, 20, 252 days")
        
        return df
    
    def create_normalization_features(self) -> pd.DataFrame:
        """
        Normalization and scaling transformations.
        Standardize features for algorithms sensitive to scale.
        """
        df = self.df
        
        # Z-score normalization (mean=0, std=1)
        for window in [20, 60]:
            # Rolling z-score (how many std devs from recent mean?)
            rolling_mean = df['Close'].rolling(window).mean()
            rolling_std = df['Close'].rolling(window).std()
            df[f'Close_ZScore_{window}'] = (df['Close'] - rolling_mean) / rolling_std
            
            if 'Daily_Return' in df.columns:
                ret_mean = df['Daily_Return'].rolling(window).mean()
                ret_std = df['Daily_Return'].rolling(window).std()
                df[f'Return_ZScore_{window}'] = (df['Daily_Return'] - ret_mean) / ret_std
        
        # Min-Max scaling (0-1 range)
        for window in [20]:
            rolling_min = df['Close'].rolling(window).min()
            rolling_max = df['Close'].rolling(window).max()
            df[f'Close_MinMax_{window}'] = (df['Close'] - rolling_min) / (rolling_max - rolling_min)
        
        # Robust scaling (using median and IQR, resistant to outliers)
        for window in [20]:
            rolling_median = df['Close'].rolling(window).median()
            rolling_iqr = df['Close'].rolling(window).quantile(0.75) - df['Close'].rolling(window).quantile(0.25)
            df[f'Close_Robust_{window}'] = (df['Close'] - rolling_median) / rolling_iqr
        
        print("\nCreated normalization features:")
        print("  - Close_ZScore: Standard score (rolling)")
        print("  - Close_MinMax: Min-max scaling (0-1)")
        print("  - Close_Robust: Robust scaling (median/IQR)")
        
        return df

# Demonstration
if __name__ == "__main__":
    # Setup
    transform_engineer = NEPSETransformationFeatures(interaction_engineer.df)
    
    transform_engineer.create_log_transforms()
    transform_engineer.create_power_transforms()
    transform_engineer.create_rank_and_quantile_features()
    transform_engineer.create_differencing_features()
    transform_engineer.create_normalization_features()
    
    # Display transformations
    print("\nTransformation Features Sample:")
    display_cols = ['Close', 'Log_Close', 'Sqrt_Volume', 'Price_Rank_20', 
                   'Close_ZScore_20', 'Close_Diff_1', 'Signed_Sqrt_Return']
    print(transform_engineer.df[display_cols].tail(10))
    
    # Show distribution improvement
    print("\nDistribution Comparison (Skewness):")
    print(f"Raw Volume Skew: {transform_engineer.df['Vol'].skew():.2f}")
    print(f"Log Volume Skew: {transform_engineer.df['Log_Volume'].skew():.2f}")
    print(f"Sqrt Volume Skew: {transform_engineer.df['Sqrt_Volume'].skew():.2f}")
```

**Explanation:**

This section implements **transformation features** that apply mathematical functions to improve the statistical properties of raw financial data for machine learning.

**Logarithmic Transformations:**
The `create_log_transforms()` method applies the natural logarithm to compress scale and stabilize variance:

- **Log_Close**: Converts price levels to log-prices. In finance, log-prices have the desirable property that differences equal log-returns: $Log(P_t) - Log(P_{t-1}) = Log(P_t/P_{t-1})$. This turns the exponential growth of stock prices (which compounds multiplicatively) into linear growth (which compounds additively), making trends easier to model.

- **Log_Volume**: Volume data in NEPSE is typically right-skewed (few days with extremely high volume, many with moderate). The log transformation compresses the long right tail, making the distribution more symmetric and Gaussian-like. The `+1` handles potential zero-volume days (though rare in NEPSE index data).

**Power Transformations:**
The `create_power_transforms()` method uses power functions to reduce skewness and kurtosis (fat tails):

- **Signed_Sqrt_Return**: Applies square root to the absolute value of returns while preserving the sign. This compresses extreme returns (e.g., ±5% moves become ±0.22 after square root) while maintaining directional information. This is particularly useful for NEPSE during volatile periods when circuit breakers (4% limits) create artificial ceilings but intraday swings can still be extreme.

- **Tanh_Return**: Uses the hyperbolic tangent function to strongly compress outliers into the range (-1, 1). The scaling factor of 10 maps typical daily returns (±0.02) to the linear region of tanh, while extreme crashes (±0.10) are compressed toward ±1 without creating infinite outliers.

**Rank and Quantile Features:**
The `create_rank_and_quantile_features()` method converts absolute values into relative percentiles:

- **Price_Rank_20**: Indicates the current price's percentile position within the last 20 days (0 = lowest, 1 = highest). Unlike raw prices which trend upward over time, rank features are inherently stationary and bounded, making them ideal for models that assume stable input distributions.

- **Volume_Quintile**: Discretizes volume into 5 equal-sized buckets (quintiles). This categorical encoding can improve tree-based models by reducing noise and highlighting regime changes—moving from "Very_Low" to "Very_High" volume is more meaningful than the exact share count difference.

**Differencing Features:**
The `create_differencing_features()` method implements differencing to achieve stationarity:

- **Close_Diff_1**: First-order difference ($P_t - P_{t-1}$), equivalent to the absolute price change. This removes linear trends from price series, converting a trending series into a stationary series of changes suitable for ARMA models.

- **Close_Diff_252**: Seasonal differencing with a 252-day lag (approximately one trading year). This removes annual trends and seasonality, comparing today's price to the same date last year. For NEPSE, this captures year-over-year growth while removing fiscal year seasonality.

**Normalization Features:**
The `create_normalization_features()` method scales features to standard ranges:

- **Close_ZScore_20**: Rolling z-score calculated over 20 days: $(Close - MA_{20}) / Std_{20}$. This measures how unusual today's price is relative to the recent month. Z-scores above +2 or below -2 indicate statistically significant deviations (potential mean reversion opportunities).

- **Close_Robust_20**: Robust scaling using median and interquartile range (IQR) instead of mean and standard deviation. This is resistant to outliers—if NEPSE has a flash crash day, it doesn't permanently distort the scaling parameters as it would with z-score.

These transformations ensure that the NEPSE prediction model receives inputs with desirable statistical properties: symmetric distributions, stable variance, and comparable scales across different features and time periods.

---

## **11.10 Implementation Patterns**

This section consolidates best practices for implementing the basic feature creation pipeline efficiently and robustly. When working with large NEPSE datasets (potentially millions of rows across thousands of stocks), implementation details—vectorization, memory management, and pipeline architecture—significantly impact performance and maintainability.

The patterns covered include: **Vectorized Operations** (avoiding loops), **Pipeline Architecture** (scikit-learn compatible), **Memory Efficiency** (categorical dtypes, chunking), **Feature Stores** (saving computed features), and **Validation Frameworks** (ensuring correctness).

```python
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
from typing import Dict, List, Optional
import joblib

class NEPSEFeaturePipeline(BaseEstimator, TransformerMixin):
    """
    Production-ready feature engineering pipeline for NEPSE.
    Implements sklearn-compatible transformer for integration with ML workflows.
    """
    
    def __init__(self, 
                 price_lags: List[int] = [1, 5, 20],
                 ma_windows: List[int] = [5, 20],
                 volatility_windows: List[int] = [20],
                 include_time_features: bool = True):
        self.price_lags = price_lags
        self.ma_windows = ma_windows
        self.volatility_windows = volatility_windows
        self.include_time_features = include_time_features
        self.feature_names_ = None
        
    def fit(self, X, y=None):
        """Fit method (required for sklearn compatibility)."""
        return self
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Transform raw NEPSE data into engineered features.
        Vectorized implementation for performance.
        """
        df = X.copy()
        
        # 1. Basic Features (Vectorized)
        df['Daily_Return'] = df['Close'].pct_change()
        df['Log_Return'] = np.log(df['Close'] / df['Close'].shift(1))
        df['High_Low_Pct'] = (df['High'] - df['Low']) / df['Close']
        df['Open_Close_Pct'] = (df['Close'] - df['Open']) / df['Open']
        
        # 2. Lag Features (Vectorized using shift)
        for lag in self.price_lags:
            df[f'Close_Lag_{lag}'] = df['Close'].shift(lag)
            df[f'Return_Lag_{lag}'] = df['Daily_Return'].shift(lag)
            df[f'Volume_Lag_{lag}'] = df['Vol'].shift(lag)
        
        # 3. Rolling Features (Vectorized using rolling)
        for window in self.ma_windows:
            df[f'SMA_{window}'] = df['Close'].rolling(window=window).mean()
            df[f'EMA_{window}'] = df['Close'].ewm(span=window).mean()
            df[f'Dist_SMA_{window}'] = (df['Close'] - df[f'SMA_{window}']) / df[f'SMA_{window}']
        
        for window in self.volatility_windows:
            df[f'Volatility_{window}'] = df['Daily_Return'].rolling(window=window).std()
            df[f'ATR_{window}'] = (df['High'] - df['Low']).rolling(window=window).mean()
        
        # 4. Time Features (if enabled)
        if self.include_time_features and 'Date' in df.columns:
            df['Date'] = pd.to_datetime(df['Date'])
            df['Day_of_Week'] = df['Date'].dt.dayofweek
            df['Month'] = df['Date'].dt.month
            
            # NEPSE fiscal year
            month_day = df['Date'].dt.month * 100 + df['Date'].dt.day
            df['Fiscal_Year'] = df['Date'].dt.year
            df.loc[month_day < 716, 'Fiscal_Year'] -= 1
        
        # 5. Interaction Features (Vectorized)
        if 'Rel_Volume' not in df.columns:
            df['Rel_Volume'] = df['Vol'] / df['Vol'].rolling(20).mean()
        
        df['Volume_Confirmed_Return'] = df['Daily_Return'] * df['Rel_Volume']
        
        # Store feature names
        self.feature_names_ = [c for c in df.columns if c not in ['Date', 'Symbol', 'S.No']]
        
        return df
    
    def get_feature_names(self):
        """Return list of created feature names."""
        return self.feature_names_

class NEPSEFeatureStore:
    """
    Feature store for saving and retrieving engineered features.
    Prevents recomputation and ensures consistency.
    """
    
    def __init__(self, store_path: str = 'features/'):
        self.store_path = store_path
        
    def save_features(self, df: pd.DataFrame, symbol: str, date: str):
        """Save engineered features to parquet (efficient storage)."""
        filepath = f"{self.store_path}/{symbol}_{date}.parquet"
        df.to_parquet(filepath)
        print(f"Saved features to {filepath}")
        
    def load_features(self, symbol: str, date: str) -> pd.DataFrame:
        """Load pre-computed features."""
        filepath = f"{self.store_path}/{symbol}_{date}.parquet"
        return pd.read_parquet(filepath)

def create_optimization_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Memory-optimized feature creation for large NEPSE datasets.
    """
    # 1. Downcast numeric types to reduce memory
    float_cols = df.select_dtypes(include=['float64']).columns
    int_cols = df.select_dtypes(include=['int64']).columns
    
    df[float_cols] = df[float_cols].astype('float32')
    df[int_cols] = df[int_cols].astype('int32')
    
    # 2. Use categorical for low-cardinality features
    if 'Symbol' in df.columns:
        df['Symbol'] = df['Symbol'].astype('category')
    
    # 3. Vectorized calculation (no loops)
    # Calculate all lags at once using list comprehension
    lags = [1, 5, 20]
    lag_features = pd.concat(
        [df['Close'].shift(lag).rename(f'Close_Lag_{lag}') for lag in lags],
        axis=1
    )
    df = pd.concat([df, lag_features], axis=1)
    
    # 4. Efficient rolling with min_periods to handle start of series
    df['SMA_20'] = df['Close'].rolling(window=20, min_periods=1).mean()
    
    return df

# Demonstration of complete pipeline
if __name__ == "__main__":
    # Create sample data
    np.random.seed(42)
    n = 1000
    dates = pd.date_range('2022-01-01', periods=n, freq='B')
    
    data = pd.DataFrame({
        'Date': dates,
        'Symbol': ['NEPSE'] * n,
        'Open': np.random.uniform(1800, 2000, n),
        'High': np.random.uniform(1900, 2100, n),
        'Low': np.random.uniform(1700, 1900, n),
        'Close': np.random.uniform(1850, 2050, n),
        'Vol': np.random.randint(1000000, 5000000, n)
    })
    
    # Method 1: Sklearn Pipeline (Production)
    print("Method 1: Sklearn Pipeline")
    pipeline = Pipeline([
        ('features', NEPSEFeaturePipeline(
            price_lags=[1, 5],
            ma_windows=[5, 20],
            include_time_features=True
        ))
    ])
    
    result = pipeline.fit_transform(data)
    print(f"Features created: {len(result.columns)}")
    print(f"Feature names: {pipeline.named_steps['features'].get_feature_names()[:5]}...")
    
    # Method 2: Optimized for Memory
    print("\nMethod 2: Memory Optimized")
    optimized = create_optimization_features(data)
    print(f"Memory usage: {optimized.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    # Method 3: Feature Store
    print("\nMethod 3: Feature Store")
    store = NEPSEFeatureStore()
    # store.save_features(result, 'NEPSE', '2023-01-01')
    
    print("\n✓ Implementation patterns demonstration complete")
```

**Explanation:**

This final section provides **production implementation patterns** that ensure the feature engineering pipeline is efficient, scalable, and maintainable for real-world NEPSE prediction systems.

**Sklearn-Compatible Pipeline:**
The `NEPSEFeaturePipeline` class inherits from `BaseEstimator` and `TransformerMixin`, making it compatible with scikit-learn's `Pipeline` and `GridSearchCV`. This integration is crucial for production ML workflows because it allows feature engineering to be treated as a first-class citizen in the modeling process—parameterized, cross-validated, and serialized along with the model.

The `transform()` method implements **vectorized operations** throughout. Instead of Python loops (which are slow), it uses pandas' built-in vectorized methods (`shift()`, `rolling()`, `ewm()`) which are implemented in C and optimized for performance. For a dataset of 1 million NEPSE records, vectorized operations complete in seconds while loop-based implementations might take minutes.

**Feature Store Pattern:**
The `NEPSEFeatureStore` class implements a simple feature store using Apache Parquet format. In production NEPSE systems, feature stores prevent redundant computation—once features are engineered for a specific stock and date range, they are saved to disk and can be retrieved instantly for model retraining or backtesting. Parquet is chosen over CSV because it:
- Stores data in columnar format (efficient for feature matrices where we often query specific columns)
- Preserves data types (int32, float32, categorical)
- Supports compression (reduces storage for large NEPSE historical datasets)

**Memory Optimization:**
The `create_optimization_features()` function demonstrates techniques for handling large datasets:
- **Type Downcasting**: Converts `float64` to `float32` (halving memory usage) and `int64` to `int32`. For NEPSE prices (rarely exceeding NPR 10,000), float32 provides sufficient precision while reducing memory footprint by 50%.
- **Categorical Encoding**: Converts `Symbol` (stock ticker) from string to categorical. With hundreds of NEPSE symbols repeated millions of times, this reduces memory from ~50 bytes per string to ~8 bytes per integer reference.
- **Concatenation Pattern**: Instead of adding lag features one-by-one in a loop (which fragments DataFrame memory), it creates all lag columns in a list and concatenates them once, reducing memory fragmentation.

**Vectorization Strategy:**
The code emphasizes **vectorized calculations**—operations performed on entire arrays at once rather than element-by-element. For example:
- **Bad (Slow)**: `for i in range(len(df)): df.loc[i, 'Lag_1'] = df.loc[i-1, 'Close']`
- **Good (Fast)**: `df['Lag_1'] = df['Close'].shift(1)`

The vectorized approach delegates the iteration to pandas' underlying C libraries, achieving 100x+ speedups on large NEPSE datasets spanning decades of daily data across hundreds of stocks.

**Pipeline Validation:**
The implementation includes `min_periods=1` in rolling calculations to handle the start of the time series gracefully. Without this, the first 19 rows of a 20-day moving average would be NaN, potentially causing model training failures. With `min_periods=1`, the SMA uses available data (even just 1 day) at the beginning of the series, ensuring continuous output.

These implementation patterns ensure that the NEPSE feature engineering pipeline can handle production-scale data efficiently while maintaining code clarity and integration with standard ML tooling.

---

## **Chapter Summary**

In this chapter, we implemented the foundational feature creation techniques for the NEPSE stock prediction system, transforming raw OHLCV data into informative predictors through systematic engineering.

### **Key Accomplishments:**

**1. Raw Value Features (11.1)**
We established proper handling of NEPSE CSV data, categorizing features by temporal availability (opening vs. intraday vs. closing) to prevent look-ahead bias. We created basic relationships like `Close_Position` (within daily range) and `VWAP_Distance` (deviation from volume-weighted average) that encode market microstructure.

**2. Difference Features (11.2)**
We engineered absolute spread features including candlestick shadows (`Upper_Shadow`, `Lower_Shadow`), True Range (gap-adjusted volatility), and overnight gap analysis. These capture intraday dynamics and opening sentiment specific to NEPSE's Sunday-Thursday trading cycle.

**3. Percentage Change Features (11.3)**
We implemented arithmetic and logarithmic returns, intraday performance metrics (Open-to-Close), and volume-normalized returns. These transformations enable cross-sectional comparison across NEPSE's diverse universe of bank, hydropower, and insurance stocks with different price levels.

**4. Lag Features (11.4)**
We created autoregressive features using proper `shift()` operations to prevent look-ahead bias, including price lags (1, 3, 5, 20-day), return lags, and volume lags. We validated temporal integrity ensuring `Close_Lag_1` correctly references yesterday's price, forming the backbone of time-series prediction.

**5. Rolling Window Features (11.5)**
We computed adaptive statistics including Simple and Exponential Moving Averages, Bollinger Bands, Average True Range (ATR), and rolling skewness/kurtosis. These capture local trends and volatility regimes essential for NEPSE's cyclical market behavior.

**6. Expanding Window Features (11.6)**
We implemented cumulative statistics including all-time highs/lows, drawdown calculations, and cumulative returns since inception. These provide long-term context and historical benchmarks for identifying when NEPSE stocks reach extreme levels.

**7. Time-Based Features (11.7)**
We encoded Nepal's unique calendar structure: Sunday-Thursday trading week, mid-July fiscal year-end (Shrawan-Ashad), and seasonal effects (monsoon, Dashain festival). Cyclical encodings (sin/cos) preserved the continuity of seasonal patterns.

**8. Interaction Features (11.8)**
We combined variables multiplicatively to capture synergistic effects: Volume-Confirmed Returns (validating price moves with volume), Trend Signal-to-Noise ratios, and momentum alignment between short and long-term trends.

**9. Transformation Features (11.9)**
We applied mathematical transforms to improve statistical properties: log transforms for volume (reducing skew), signed square roots for returns (compressing fat tails), and z-score normalizations for stationarity.

**10. Implementation Patterns (11.10)**
We established production-ready patterns: sklearn-compatible pipelines for ML integration, vectorized operations for performance (100x speedup over loops), memory optimization (float32, categorical dtypes), and feature stores for persistence.

### **Practical Skills Acquired:**

- **Temporal Safety**: Implementing lag features with `shift()` to strictly prevent look-ahead bias in financial data
- **Domain Adaptation**: Engineering features specific to NEPSE's fiscal calendar and Sunday-Thursday trading schedule
- **Statistical Rigor**: Creating stationary features (returns, differences) suitable for time-series models
- **Production Engineering**: Building memory-efficient, vectorized pipelines capable of processing millions of NEPSE records

### **Next Steps:**

In **Chapter 12: Advanced Rolling Window Features**, we will explore sophisticated window selection strategies, multiple window harmonics, adaptive windows that adjust to volatility regimes, and efficient computation techniques for handling large-scale NEPSE datasets with high-frequency features.

---

**End of Chapter 11**