
## **Chapter 2: Understanding Time-Series Data**

---

### **2.1 What Defines Time-Series Data?**

Time-series data is a sequence of data points collected or recorded at successive points in time, typically at uniform intervals. What distinguishes time-series data from other data types is the fundamental relationship between observations: **the order matters**.

#### **Key Characteristics of Time-Series Data**

**1. Temporal Ordering**

The sequence of observations carries critical information. In our NEPSE dataset, the order of trading days tells us about price evolution, momentum, and market dynamics.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load or create NEPSE data
def load_nepse_data(filepath=None):
    """
    Load NEPSE data from CSV or create sample data for demonstration.
    
    Parameters:
    -----------
    filepath : str, optional
        Path to NEPSE CSV file
    
    Returns:
    --------
    pd.DataFrame : NEPSE data
    """
    if filepath:
        return pd.read_csv(filepath)
    else:
        # Create sample data for demonstration
        np.random.seed(42)
        n_days = 250  # About one trading year
        
        dates = pd.date_range(start='2024-01-01', periods=n_days, freq='B')  # Business days
        
        # Simulate ABL stock prices
        base_price = 500
        returns = np.random.normal(0.0003, 0.015, n_days)  # Slight upward drift
        prices = base_price * np.cumprod(1 + returns)
        
        data = pd.DataFrame({
            'Date': dates,
            'S.No': range(1, n_days + 1),
            'Symbol': 'ABL',
            'Open': prices * (1 + np.random.uniform(-0.005, 0.005, n_days)),
            'High': prices * (1 + np.abs(np.random.normal(0.01, 0.005, n_days))),
            'Low': prices * (1 - np.abs(np.random.normal(0.01, 0.005, n_days))),
            'Close': prices,
            'Vol': np.random.randint(10000, 100000, n_days),
            'VWAP': prices * (1 + np.random.uniform(-0.003, 0.003, n_days))
        })
        
        return data


def demonstrate_temporal_ordering(data):
    """
    Demonstrate why temporal ordering matters in time-series data.
    
    This function shows that shuffling time-series data destroys
    critical patterns needed for prediction.
    
    Parameters:
    -----------
    data : pd.DataFrame
        Time-series data with chronological order
    
    Returns:
    --------
    dict : Comparison metrics between ordered and shuffled data
    """
    print("=" * 70)
    print("DEMONSTRATION: Why Temporal Order Matters")
    print("=" * 70)
    
    # Extract closing prices
    close_prices = data['Close'].values.copy()
    
    # ========================================
    # Analysis 1: Autocorrelation in Ordered Data
    # ========================================
    print("\n📊 Analysis 1: Autocorrelation (Relationship with Past Values)")
    print("-" * 70)
    
    # Autocorrelation measures how correlated a time series is with
    # a lagged version of itself
    #
    # Formula: ρ(k) = Cov(X_t, X_{t-k}) / Var(X_t)
    # where k is the lag (number of time periods)
    
    def calculate_autocorrelation(series, lag):
        """
        Calculate autocorrelation at a specific lag.
        
        Autocorrelation tells us how much the current value
        depends on the value k periods ago.
        
        Parameters:
        -----------
        series : np.array
            Time series values
        lag : int
            Number of periods to look back
        
        Returns:
        --------
        float : Autocorrelation coefficient (-1 to 1)
        """
        n = len(series)
        if lag >= n:
            return np.nan
        
        # Get the overlapping portion
        series_t = series[lag:]      # Current values
        series_lag = series[:-lag]   # Lagged values
        
        # Calculate means
        mean_t = np.mean(series_t)
        mean_lag = np.mean(series_lag)
        
        # Calculate covariance and variances
        covariance = np.sum((series_t - mean_t) * (series_lag - mean_lag)) / (n - lag)
        variance_t = np.var(series_t)
        variance_lag = np.var(series_lag)
        
        # Autocorrelation
        if variance_t * variance_lag > 0:
            autocorr = covariance / np.sqrt(variance_t * variance_lag)
        else:
            autocorr = 0
        
        return autocorr
    
    # Calculate autocorrelation for multiple lags
    lags = [1, 2, 3, 5, 10, 20]
    
    print("\nOrdered Data - Autocorrelation at Different Lags:")
    print("(How much does today's price depend on past prices?)")
    print()
    
    for lag in lags:
        autocorr = calculate_autocorrelation(close_prices, lag)
        bar = "█" * int(abs(autocorr) * 50)
        print(f"  Lag {lag:2d} days: {autocorr:+.4f} {bar}")
    
    # ========================================
    # Analysis 2: Shuffled Data Comparison
    # ========================================
    print("\n📊 Analysis 2: Shuffled Data Comparison")
    print("-" * 70)
    
    # Shuffle the prices to destroy temporal structure
    shuffled_prices = close_prices.copy()
    np.random.shuffle(shuffled_prices)
    
    print("\nShuffled Data - Autocorrelation at Different Lags:")
    print("(Temporal structure destroyed - should be near zero)")
    print()
    
    for lag in lags:
        autocorr = calculate_autocorrelation(shuffled_prices, lag)
        bar = "█" * int(abs(autocorr) * 50)
        print(f"  Lag {lag:2d} days: {autocorr:+.4f} {bar}")
    
    # ========================================
    # Analysis 3: Predictability Comparison
    # ========================================
    print("\n📊 Analysis 3: Predictability Analysis")
    print("-" * 70)
    
    # For ordered data: predict using previous value
    ordered_predictions = close_prices[:-1]  # Use yesterday's price
    ordered_actual = close_prices[1:]        # Today's actual price
    ordered_mae = np.mean(np.abs(ordered_actual - ordered_predictions))
    
    # For shuffled data: predict using previous value
    shuffled_predictions = shuffled_prices[:-1]
    shuffled_actual = shuffled_prices[1:]
    shuffled_mae = np.mean(np.abs(shuffled_actual - shuffled_predictions))
    
    print(f"\nNaive Prediction (predict today = yesterday):")
    print(f"  Ordered Data MAE:   {ordered_mae:.4f}")
    print(f"  Shuffled Data MAE:  {shuffled_mae:.4f}")
    print(f"\n  Ratio (Shuffled/Ordered): {shuffled_mae/ordered_mae:.2f}x")
    
    print("\n💡 Key Insight:")
    print("   The shuffled data is MUCH harder to predict because")
    print("   the temporal dependencies have been destroyed.")
    
    return {
        'ordered_mae': ordered_mae,
        'shuffled_mae': shuffled_mae,
        'ratio': shuffled_mae / ordered_mae
    }


# Run the demonstration
nepse_data = load_nepse_data()
result = demonstrate_temporal_ordering(nepse_data)
```

**Explanation of Temporal Ordering**:

The code above demonstrates the fundamental importance of temporal ordering in time-series data. Let me break down the key concepts:

1. **Autocorrelation**: This measures how much a value at time t depends on values at time t-k. In financial data like NEPSE:
   - High autocorrelation at lag 1 (0.99+) means today's price is very similar to yesterday's
   - This gradually decreases as we look further back
   - This is WHY time-series prediction works—we can learn from past values

2. **Shuffling Destroys Information**: When we randomly shuffle the data:
   - Autocorrelation drops to nearly zero
   - Prediction becomes much harder (higher MAE)
   - The patterns that made prediction possible are gone

**2. Time Intervals**

Time-series data is collected at specific intervals. Understanding these intervals is crucial for proper analysis.

```python
def analyze_time_intervals(data):
    """
    Analyze and explain time intervals in time-series data.
    
    Time interval (frequency) determines:
    - How often observations are recorded
    - What patterns can be detected
    - What models are appropriate
    
    Parameters:
    -----------
    data : pd.DataFrame
        Time-series data with Date or S.No column
    """
    print("=" * 70)
    print("ANALYSIS: Time Intervals in Time-Series Data")
    print("=" * 70)
    
    # ========================================
    # Different Time Intervals in Practice
    # ========================================
    print("\n📅 Common Time Intervals in Different Domains:")
    print("-" * 70)
    
    intervals = {
        'Tick Data': {
            'frequency': 'Milliseconds to seconds',
            'example': 'High-frequency trading',
            'patterns': 'Microstructure effects, order flow',
            'nepse_relevance': 'Not commonly available for NEPSE'
        },
        'Intraday': {
            'frequency': 'Minutes to hours',
            'example': 'Day trading analysis',
            'patterns': 'Intraday volatility, session effects',
            'nepse_relevance': 'Available during trading hours'
        },
        'Daily': {
            'frequency': 'One observation per trading day',
            'example': 'NEPSE closing prices',
            'patterns': 'Daily trends, day-of-week effects',
            'nepse_relevance': 'Standard NEPSE data format'
        },
        'Weekly': {
            'frequency': 'Weekly aggregation',
            'example': 'Weekly market reports',
            'patterns': 'Weekly seasonality',
            'nepse_relevance': 'Can aggregate daily data'
        },
        'Monthly': {
            'frequency': 'Monthly observations',
            'example': 'Economic indicators',
            'patterns': 'Monthly cycles, earnings seasons',
            'nepse_relevance': 'Long-term trend analysis'
        }
    }
    
    for interval_type, info in intervals.items():
        print(f"\n{interval_type}:")
        print(f"  Frequency:      {info['frequency']}")
        print(f"  Example:        {info['example']}")
        print(f"  Patterns:       {info['patterns']}")
        print(f"  NEPSE Use:      {info['nepse_relevance']}")
    
    # ========================================
    # NEPSE-Specific Interval Analysis
    # ========================================
    print("\n" + "=" * 70)
    print("NEPSE TRADING SCHEDULE")
    print("=" * 70)
    
    print("""
    Nepal Stock Exchange Trading Hours:
    ┌─────────────────────────────────────────────────────────────┐
    │  Session          │  Time (NPT)    │  Activity              │
    ├─────────────────────────────────────────────────────────────┤
    │  Pre-Open         │  10:30 - 11:00 │  Order collection      │
    │  Opening Auction  │  11:00         │  Opening price set     │
    │  Continuous       │  11:00 - 15:00 │  Regular trading       │
    │  Closing Auction  │  15:00         │  Closing price set     │
    │  Post-Close       │  15:00 - 15:30 │  Settlement            │
    └─────────────────────────────────────────────────────────────┘
    
    Trading Days: Sunday to Friday (Saturday closed)
    Holidays: Public holidays as per Nepal government calendar
    
    Implications for Time-Series Analysis:
    • Data frequency: Daily (one record per stock per day)
    • Weekly patterns: 6 trading days, Saturday gap
    • Holiday effects: Must account for market closures
    """)
    
    # ========================================
    # Creating Time Features
    # ========================================
    print("\n" + "=" * 70)
    print("CREATING TIME-BASED FEATURES")
    print("=" * 70)
    
    # Convert to datetime if available
    if 'Date' in data.columns:
        data['Date'] = pd.to_datetime(data['Date'])
        
        # Extract time components
        data['Year'] = data['Date'].dt.year
        data['Month'] = data['Date'].dt.month
        data['Day'] = data['Date'].dt.day
        data['DayOfWeek'] = data['Date'].dt.dayofweek  # 0=Monday, 6=Sunday
        data['WeekOfYear'] = data['Date'].dt.isocalendar().week
        data['Quarter'] = data['Date'].dt.quarter
        
        print("\nTime Features Created:")
        print(data[['Date', 'Year', 'Month', 'DayOfWeek', 'Quarter']].head(10))
        
        # Analyze day-of-week patterns
        print("\n📊 Average Returns by Day of Week:")
        print("-" * 40)
        
        data['Return'] = data['Close'].pct_change()
        day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
        
        # NEPSE trades Sunday-Friday (Saturday closed)
        # dayofweek: 0=Monday, 5=Saturday, 6=Sunday
        # So NEPSE trading days are: 0-4 (Mon-Fri) and 6 (Sunday)
        
        for day_num in range(7):
            day_data = data[data['DayOfWeek'] == day_num]
            if len(day_data) > 0:
                avg_return = day_data['Return'].mean() * 100
                print(f"  {day_names[day_num]:12s}: {avg_return:+.4f}%")
    else:
        # Using S.No as proxy for time
        print("\nNote: No Date column found. Using S.No as time index.")
        print("In real NEPSE data, always ensure you have proper dates!")
    
    return data


# Analyze time intervals
nepse_with_time = analyze_time_intervals(nepse_data)
```

**3. Irregular vs Regular Time-Series**

```python
def compare_regular_irregular():
    """
    Explain the difference between regular and irregular time-series.
    
    This is important because:
    - Regular time-series have constant intervals (easier to model)
    - Irregular time-series have varying intervals (requires special handling)
    """
    print("=" * 70)
    print("REGULAR VS IRREGULAR TIME-SERIES")
    print("=" * 70)
    
    print("""
    ┌─────────────────────────────────────────────────────────────────────┐
    │                     REGULAR TIME-SERIES                             │
    ├─────────────────────────────────────────────────────────────────────┤
    │ • Constant time intervals between observations                      │
    │ • Example: Daily closing prices (every trading day)                 │
    │ • Example: Hourly temperature readings                               │
    │                                                                     │
    │     Day 1 ── Day 2 ── Day 3 ── Day 4 ── Day 5                       │
    │     (24h)    (24h)    (24h)    (24h)                                │
    │                                                                     │
    │ • Advantages: Simpler to model, standard techniques apply           │
    │ • NEPSE Context: Daily data with weekend/holiday gaps               │
    └─────────────────────────────────────────────────────────────────────┘
    
    ┌─────────────────────────────────────────────────────────────────────┐
    │                    IRREGULAR TIME-SERIES                            │
    ├─────────────────────────────────────────────────────────────────────┤
    │ • Variable time intervals between observations                      │
    │ • Example: Stock trades (random timing)                             │
    │ • Example: Patient hospital visits                                  │
    │                                                                     │
    │     Event 1 ──── Event 2 ─ Event 3 ────── Event 4                   │
    │     (4 days)   (1 day)  (5 days)                                    │
    │                                                                     │
    │ • Challenges: Need to model time between events                     │
    │ • Solutions: Interpolation, time-aware models                       │
    └─────────────────────────────────────────────────────────────────────┘
    """)
    
    # ========================================
    # Handling Weekend Gaps in NEPSE
    # ========================================
    print("\n📊 Handling Non-Trading Days in NEPSE:")
    print("-" * 70)
    
    print("""
    NEPSE has regular daily data BUT with gaps:
    • Saturdays: Market closed
    • Public holidays: Market closed
    
    Approaches to handle these gaps:
    
    1. IGNORE GAPS (Simplest)
       - Treat data as consecutive observations
       - S.No becomes the time index
       - Works well for short-term prediction
       
    2. FILL GAPS
       - Use forward fill or interpolation
       - Preserves calendar time
       - Important for seasonality analysis
    
    3. TIME-AWARE MODELING
       - Include day-of-week features
       - Account for time between observations
       - Most sophisticated approach
    """)
    
    # Demonstration code for handling gaps
    print("\n💻 Code Example: Handling Weekend Gaps")
    print("-" * 70)
    
    code_example = '''
    import pandas as pd
    
    # Option 1: Ignore gaps (use trading days as index)
    data['Trading_Day'] = range(len(data))
    
    # Option 2: Fill gaps (add calendar days)
    full_date_range = pd.date_range(
        start=data['Date'].min(), 
        end=data['Date'].max(), 
        freq='D'  # Daily frequency
    )
    data = data.set_index('Date').reindex(full_date_range)
    data = data.ffill()  # Forward fill prices
    
    # Option 3: Time-aware features
    data['Is_Monday'] = (data['Date'].dt.dayofweek == 0).astype(int)
    data['Is_Saturday'] = (data['Date'].dt.dayofweek == 5).astype(int)
    data['Days_Since_Last_Trade'] = data['Date'].diff().dt.days
    '''
    
    print(code_example)


# Run the comparison
compare_regular_irregular()
```

---

### **2.2 Components of Time-Series**

Every time-series can be decomposed into fundamental components. Understanding these components is essential for building effective prediction models.

#### **2.2.1 Trend**

**Definition**: The long-term movement or direction in the data, representing the underlying tendency of the series to increase, decrease, or remain stable over time.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

class TrendAnalyzer:
    """
    Comprehensive trend analysis for time-series data.
    
    Trend represents the long-term direction of a time series,
    filtering out short-term fluctuations and seasonal patterns.
    """
    
    def __init__(self, data, price_column='Close'):
        """
        Initialize the trend analyzer.
        
        Parameters:
        -----------
        data : pd.DataFrame
            Time-series data
        price_column : str
            Column containing the price/values to analyze
        """
        self.data = data.copy()
        self.price_column = price_column
        self.trend = None
        self.detrended = None
    
    def identify_trend_methods(self):
        """
        Explain different methods for identifying trends.
        
        Each method has its strengths and is appropriate for
        different types of data and analysis goals.
        """
        print("=" * 70)
        print("TREND IDENTIFICATION METHODS")
        print("=" * 70)
        
        methods = {
            'Moving Average': {
                'description': 'Smooth data by averaging over a window',
                'pros': ['Simple to understand', 'Reduces noise', 'Flexible window size'],
                'cons': ['Lags behind actual trend', 'Edge effects', 'Window size selection'],
                'use_case': 'Identifying general direction in noisy data'
            },
            'Linear Regression': {
                'description': 'Fit a straight line to the data',
                'pros': ['Quantifies trend direction', 'Provides slope', 'Easy to interpret'],
                'cons': ['Assumes linear trend', 'Sensitive to outliers', 'May miss curvature'],
                'use_case': 'Determining if overall trend is up/down'
            },
            'Polynomial Regression': {
                'description': 'Fit a curved line to the data',
                'pros': ['Captures non-linear trends', 'More flexible'],
                'cons': ['Risk of overfitting', 'Choosing degree', 'Less interpretable'],
                'use_case': 'Data with accelerating or decelerating trends'
            },
            'Decomposition': {
                'description': 'Separate trend, seasonal, and residual components',
                'pros': ['Comprehensive view', 'Isolates trend cleanly'],
                'cons': ['Assumes additive/multiplicative model', 'Requires seasonality'],
                'use_case': 'Complex time-series with multiple patterns'
            }
        }
        
        for method, info in methods.items():
            print(f"\n📊 {method}")
            print(f"   Description: {info['description']}")
            print(f"   Pros: {', '.join(info['pros'])}")
            print(f"   Cons: {', '.join(info['cons'])}")
            print(f"   Best for: {info['use_case']}")
    
    def moving_average_trend(self, window=20):
        """
        Calculate trend using moving average.
        
        The moving average smooths out short-term fluctuations
        and highlights the longer-term trend.
        
        Parameters:
        -----------
        window : int
            Number of periods to average over
        
        Returns:
        --------
        np.array : Smoothed trend values
        """
        prices = self.data[self.price_column].values
        
        # Simple Moving Average (SMA)
        # SMA_t = (P_t + P_{t-1} + ... + P_{t-n+1}) / n
        
        self.trend_sma = np.convolve(
            prices, 
            np.ones(window) / window, 
            mode='valid'
        )
        
        # Centered Moving Average
        # This aligns the average with the center of the window
        # Better for trend identification (not for prediction!)
        
        self.trend_centered = pd.Series(prices).rolling(
            window=window, 
            center=True
        ).mean().values
        
        print(f"\n📊 Moving Average Trend (Window = {window} days)")
        print(f"   Original data points: {len(prices)}")
        print(f"   SMA trend points: {len(self.trend_sma)}")
        print(f"   Centered MA points: {len(self.trend_centered) - window + 1}")
        
        return self.trend_sma
    
    def linear_regression_trend(self):
        """
        Calculate trend using linear regression.
        
        Linear regression finds the best-fit straight line through
        the data points. The slope tells us the trend direction
        and magnitude.
        
        Returns:
        --------
        dict : Trend statistics including slope, r_squared, direction
        """
        prices = self.data[self.price_column].values
        x = np.arange(len(prices))  # Time index
        
        # Perform linear regression
        # y = mx + b, where m is slope (trend) and b is intercept
        slope, intercept, r_value, p_value, std_err = stats.linregress(x, prices)
        
        # Calculate trend line values
        self.trend_linear = slope * x + intercept
        
        # Store statistics
        self.linear_trend_stats = {
            'slope': slope,                    # Trend direction and magnitude
            'intercept': intercept,            # Starting point
            'r_squared': r_value ** 2,         # How well line fits data
            'p_value': p_value,                # Statistical significance
            'std_error': std_err,              # Uncertainty in slope
            'direction': 'Upward' if slope > 0 else 'Downward',
            'daily_change': slope              # Average daily price change
        }
        
        print("\n📊 Linear Regression Trend Analysis")
        print("-" * 50)
        print(f"   Slope:          {slope:.4f} NPR/day")
        print(f"   Intercept:      {intercept:.2f} NPR")
        print(f"   R-squared:      {r_value**2:.4f}")
        print(f"   P-value:        {p_value:.6f}")
        print(f"   Trend:          {self.linear_trend_stats['direction']}")
        
        # Interpretation
        if p_value < 0.05:
            print(f"\n   ✓ The trend is statistically significant (p < 0.05)")
        else:
            print(f"\n   ⚠ The trend is NOT statistically significant")
        
        # Annualized trend
        trading_days = 250  # Approximate trading days per year
        annual_trend = slope * trading_days
        annual_pct = (annual_trend / prices[0]) * 100
        
        print(f"\n   Projected annual change: {annual_trend:.2f} NPR ({annual_pct:+.2f}%)")
        
        return self.linear_trend_stats
    
    def polynomial_trend(self, degree=2):
        """
        Calculate trend using polynomial regression.
        
        Polynomial regression can capture non-linear trends
        (accelerating or decelerating patterns).
        
        Parameters:
        -----------
        degree : int
            Degree of polynomial (2=quadratic, 3=cubic, etc.)
        
        Returns:
        --------
        np.array : Polynomial trend values
        """
        prices = self.data[self.price_column].values
        x = np.arange(len(prices))
        
        # Fit polynomial
        # degree=2: y = ax² + bx + c
        # degree=3: y = ax³ + bx² + cx + d
        coefficients = np.polyfit(x, prices, degree)
        self.trend_poly = np.polyval(coefficients, x)
        
        # Store coefficients
        self.poly_coefficients = coefficients
        
        print(f"\n📊 Polynomial Trend (Degree = {degree})")
        print("-" * 50)
        print(f"   Polynomial equation: ", end="")
        
        for i, coef in enumerate(coefficients):
            power = degree - i
            if power > 1:
                print(f"{coef:.4f}x^{power} + ", end="")
            elif power == 1:
                print(f"{coef:.4f}x + ", end="")
            else:
                print(f"{coef:.4f}")
        
        return self.trend_poly
    
    def visualize_trends(self, window=20):
        """
        Visualize different trend estimation methods.
        
        This creates a comprehensive visualization comparing
        all trend estimation methods applied to the data.
        """
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        prices = self.data[self.price_column].values
        x = np.arange(len(prices))
        
        # ========================================
        # Plot 1: Original Data with SMA
        # ========================================
        ax = axes[0, 0]
        ax.plot(x, prices, label='Original Prices', alpha=0.7, linewidth=1)
        
        if hasattr(self, 'trend_sma'):
            ax.plot(
                x[window-1:], 
                self.trend_sma, 
                label=f'{window}-day SMA', 
                linewidth=2, 
                color='red'
            )
        
        ax.set_title('Moving Average Trend', fontsize=12, fontweight='bold')
        ax.set_xlabel('Trading Day')
        ax.set_ylabel('Price (NPR)')
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        # ========================================
        # Plot 2: Linear Regression Trend
        # ========================================
        ax = axes[0, 1]
        ax.plot(x, prices, label='Original Prices', alpha=0.7, linewidth=1)
        
        if hasattr(self, 'trend_linear'):
            ax.plot(
                x, 
                self.trend_linear, 
                label='Linear Trend', 
                linewidth=2, 
                color='green'
            )
            
            # Add trend annotation
            stats = self.linear_trend_stats
            annotation = f"Slope: {stats['slope']:.3f} NPR/day\nR²: {stats['r_squared']:.3f}"
            ax.annotate(
                annotation, 
                xy=(0.05, 0.95), 
                xycoords='axes fraction',
                verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5)
            )
        
        ax.set_title('Linear Regression Trend', fontsize=12, fontweight='bold')
        ax.set_xlabel('Trading Day')
        ax.set_ylabel('Price (NPR)')
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        # ========================================
        # Plot 3: Polynomial Trend
        # ========================================
        ax = axes[1, 0]
        ax.plot(x, prices, label='Original Prices', alpha=0.7, linewidth=1)
        
        if hasattr(self, 'trend_poly'):
            ax.plot(
                x, 
                self.trend_poly, 
                label='Polynomial Trend (deg=2)', 
                linewidth=2, 
                color='purple'
            )
        
        ax.set_title('Polynomial Trend', fontsize=12, fontweight='bold')
        ax.set_xlabel('Trading Day')
        ax.set_ylabel('Price (NPR)')
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        # ========================================
        # Plot 4: Comparison
        # ========================================
        ax = axes[1, 1]
        
        if hasattr(self, 'trend_sma'):
            ax.plot(x[window-1:], self.trend_sma, label='SMA', linewidth=2)
        if hasattr(self, 'trend_linear'):
            ax.plot(x, self.trend_linear, label='Linear', linewidth=2)
        if hasattr(self, 'trend_poly'):
            ax.plot(x, self.trend_poly, label='Polynomial', linewidth=2)
        
        ax.set_title('Trend Comparison', fontsize=12, fontweight='bold')
        ax.set_xlabel('Trading Day')
        ax.set_ylabel('Price (NPR)')
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # Print interpretation guide
        print("\n" + "=" * 70)
        print("TREND ANALYSIS INTERPRETATION GUIDE")
        print("=" * 70)
        print("""
        📈 Understanding Trend Analysis for NEPSE Stocks:
        
        1. MOVING AVERAGE TREND
           - Smooth line following the general price direction
           - Good for identifying current trend direction
           - Use for: Short to medium-term trend identification
        
        2. LINEAR REGRESSION TREND
           - Single straight line best fit through all data
           - Slope shows average daily change
           - Use for: Determining long-term trend direction
        
        3. POLYNOMIAL TREND
           - Curved line that can show acceleration/deceleration
           - More flexible than linear
           - Use for: Stocks with changing growth rates
        
        📊 For Trading Decisions:
           - Positive trend slope: Bullish (consider long positions)
           - Negative trend slope: Bearish (consider short or avoid)
           - Steep slope: Strong trend (may continue or reverse)
           - Flat slope: Sideways market (range trading)
        """)
    
    def detrend_data(self, method='linear'):
        """
        Remove trend from the data.
        
        Detrending is important for:
        - Analyzing cyclical components
        - Studying seasonality
        - Making the series stationary
        
        Parameters:
        -----------
        method : str
            Method to use ('linear', 'polynomial', 'difference')
        
        Returns:
        --------
        np.array : Detrended values
        """
        prices = self.data[self.price_column].values
        
        print(f"\n📊 Detrending using {method} method")
        print("-" * 50)
        
        if method == 'linear':
            # Subtract linear trend
            self.detrended = prices - self.trend_linear
            print("   Method: Subtract linear trend line")
            
        elif method == 'polynomial':
            # Subtract polynomial trend
            self.detrended = prices - self.trend_poly
            print("   Method: Subtract polynomial trend line")
            
        elif method == 'difference':
            # First differencing
            # This is the most common detrending method
            # New series: y_t' = y_t - y_{t-1}
            self.detrended = np.diff(prices)
            print("   Method: First differencing (y_t - y_{t-1})")
            
        else:
            raise ValueError(f"Unknown method: {method}")
        
        # Verify detrending effect
        print(f"\n   Original mean: {prices.mean():.2f}")
        print(f"   Detrended mean: {self.detrended.mean():.4f} (should be ~0)")
        
        return self.detrended


# ============================================================
# EXAMPLE USAGE WITH NEPSE DATA
# ============================================================

# Create sample NEPSE-like data with trend
def create_trending_nepse_data(n_days=300, trend_type='upward'):
    """
    Create sample NEPSE data with specific trend characteristics.
    
    Parameters:
    -----------
    n_days : int
        Number of trading days
    trend_type : str
        Type of trend ('upward', 'downward', 'sideways', 'nonlinear')
    
    Returns:
    --------
    pd.DataFrame : Sample stock data
    """
    np.random.seed(42)
    
    # Base price
    base = 500
    
    # Create trend component
    if trend_type == 'upward':
        trend = np.linspace(0, 100, n_days)  # Linear upward
    elif trend_type == 'downward':
        trend = np.linspace(0, -80, n_days)  # Linear downward
    elif trend_type == 'sideways':
        trend = np.zeros(n_days)  # No trend
    elif trend_type == 'nonlinear':
        trend = 50 * np.sin(np.linspace(0, 3*np.pi, n_days))  # Cyclical
    else:
        trend = np.zeros(n_days)
    
    # Add noise
    noise = np.random.normal(0, 5, n_days)
    
    # Price series
    close_prices = base + trend + noise
    
    # Create DataFrame
    data = pd.DataFrame({
        'S.No': range(1, n_days + 1),
        'Symbol': 'ABL',
        'Close': close_prices,
        'Open': close_prices * (1 + np.random.uniform(-0.01, 0.01, n_days)),
        'High': close_prices * (1 + np.abs(np.random.normal(0.01, 0.005, n_days))),
        'Low': close_prices * (1 - np.abs(np.random.normal(0.01, 0.005, n_days))),
        'Vol': np.random.randint(10000, 100000, n_days)
    })
    
    return data


# Analyze trends in NEPSE data
print("\n" + "=" * 70)
print("TREND ANALYSIS FOR NEPSE STOCK DATA")
print("=" * 70)

# Create sample data with upward trend
nepse_upward = create_trending_nepse_data(n_days=300, trend_type='upward')

# Initialize analyzer
analyzer = TrendAnalyzer(nepse_upward, price_column='Close')

# Show available methods
analyzer.identify_trend_methods()

# Apply different methods
analyzer.moving_average_trend(window=30)
linear_stats = analyzer.linear_regression_trend()
analyzer.polynomial_trend(degree=2)

# Visualize
analyzer.visualize_trends(window=30)

# Detrend
detrended = analyzer.detrend_data(method='linear')
```

**Detailed Explanation of Trend Analysis**:

The code above implements a comprehensive trend analysis system. Let me explain each component:

**1. Moving Average Trend**:
- **What it does**: Smooths out short-term fluctuations by averaging prices over a window
- **Formula**: SMA_t = (P_t + P_{t-1} + ... + P_{t-n+1}) / n
- **Why it works**: By averaging, random noise cancels out, revealing the underlying trend
- **NEPSE Application**: A 20-day SMA shows the monthly trend direction

**2. Linear Regression Trend**:
- **What it does**: Fits a straight line through all data points
- **Formula**: y = mx + b, where m is slope
- **Interpretation**:
  - Positive slope = Upward trend (bullish)
  - Negative slope = Downward trend (bearish)
  - Slope magnitude = Trend strength
- **R-squared**: How much of price movement is explained by trend alone

**3. Polynomial Trend**:
- **What it does**: Fits a curved line to capture non-linear patterns
- **When to use**: When prices are accelerating or decelerating
- **Caution**: Higher degrees can overfit (follow noise instead of trend)

**4. Detrending**:
- **Purpose**: Remove trend to study other components (seasonality, cycles)
- **Methods**:
  - Subtraction: y_detrended = y - trend
  - Differencing: y_detrended = y_t - y_{t-1}
- **Application**: Required for certain models that assume stationarity

#### **2.2.2 Seasonality**

**Definition**: Regular, predictable patterns that repeat over fixed periods (daily, weekly, monthly, quarterly, annually).

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

class SeasonalityAnalyzer:
    """
    Comprehensive seasonality analysis for time-series data.
    
    Seasonality refers to regular, predictable patterns that repeat
    over fixed time periods. Understanding seasonality is crucial
    for accurate forecasting.
    """
    
    def __init__(self, data, price_column='Close'):
        """
        Initialize the seasonality analyzer.
        
        Parameters:
        -----------
        data : pd.DataFrame
            Time-series data
        price_column : str
            Column containing values to analyze
        """
        self.data = data.copy()
        self.price_column = price_column
        self.seasonal_component = None
        self.deseasonalized = None
    
    def explain_seasonality_types(self):
        """
        Explain different types of seasonality patterns.
        """
        print("=" * 70)
        print("TYPES OF SEASONALITY IN TIME-SERIES DATA")
        print("=" * 70)
        
        seasonality_types = {
            'Daily Seasonality': {
                'description': 'Patterns that repeat within a day',
                'examples': ['Hourly web traffic', 'Hourly electricity demand'],
                'nepse_context': 'Intraday trading patterns (not in daily data)',
                'period': '24 hours'
            },
            'Weekly Seasonality': {
                'description': 'Patterns that repeat every week',
                'examples': ['Weekend retail sales', 'Monday blues in stocks'],
                'nepse_context': 'Day-of-week effects in trading',
                'period': '7 days'
            },
            'Monthly Seasonality': {
                'description': 'Patterns within a month',
                'examples': ['Month-end salary spending', 'Bill payment cycles'],
                'nepse_context': 'Monthly investment flows',
                'period': '~30 days'
            },
            'Quarterly Seasonality': {
                'description': 'Patterns every quarter (3 months)',
                'examples': ['Quarterly earnings reports', 'Tax payments'],
                'nepse_context': 'Quarterly results announcement effects',
                'period': '~90 days'
            },
            'Annual Seasonality': {
                'description': 'Patterns that repeat every year',
                'examples': ['Holiday shopping', 'Agricultural cycles'],
                'nepse_context': 'Fiscal year effects, festival trading',
                'period': '365 days'
            }
        }
        
        for name, info in seasonality_types.items():
            print(f"\n📅 {name}")
            print(f"   Description: {info['description']}")
            print(f"   Examples: {', '.join(info['examples'])}")
            print(f"   NEPSE Context: {info['nepse_context']}")
            print(f"   Period: {info['period']}")
    
    def analyze_weekly_seasonality(self):
        """
        Analyze day-of-week patterns in NEPSE data.
        
        This checks if certain days of the week consistently have
        higher or lower returns, which is a form of weekly seasonality.
        
        Returns:
        --------
        pd.DataFrame : Day-of-week statistics
        """
        print("\n" + "=" * 70)
        print("WEEKLY SEASONALITY ANALYSIS (Day-of-Week Effects)")
        print("=" * 70)
        
        # Calculate returns
        self.data['Return'] = self.data[self.price_column].pct_change()
        
        # If Date column exists, use it; otherwise create synthetic dates
        if 'Date' not in self.data.columns:
            # Create synthetic dates (assuming business days)
            # Nepal trading days: Sunday to Friday (Saturday closed)
            dates = pd.date_range(
                start='2024-01-01', 
                periods=len(self.data), 
                freq='B'  # Business days
            )
            self.data['Date'] = dates
        
        self.data['Date'] = pd.to_datetime(self.data['Date'])
        self.data['DayOfWeek'] = self.data['Date'].dt.dayofweek
        self.data['DayName'] = self.data['Date'].dt.day_name()
        
        # Calculate statistics by day
        day_stats = self.data.groupby('DayName').agg({
            'Return': ['mean', 'std', 'count'],
            'Vol': ['mean', 'sum']
        }).round(6)
        
        # Flatten column names
        day_stats.columns = ['Avg_Return', 'Return_Std', 'Count', 'Avg_Volume', 'Total_Volume']
        
        # Calculate return in percentage
        day_stats['Avg_Return_Pct'] = day_stats['Avg_Return'] * 100
        
        # Sort by day order
        day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
        day_stats = day_stats.reindex([d for d in day_order if d in day_stats.index])
        
        print("\n📊 Day-of-Week Statistics:")
        print("-" * 70)
        
        for day in day_stats.index:
            row = day_stats.loc[day]
            print(f"\n   {day}:")
            print(f"      Average Return:    {row['Avg_Return_Pct']:+.4f}%")
            print(f"      Return Std Dev:    {row['Return_Std']*100:.4f}%")
            print(f"      Trading Days:      {int(row['Count'])}")
            print(f"      Average Volume:    {int(row['Avg_Volume']):,}")
        
        # Test for significance
        print("\n📈 Interpretation:")
        
        best_day = day_stats['Avg_Return_Pct'].idxmax()
        worst_day = day_stats['Avg_Return_Pct'].idxmin()
        
        print(f"   Best performing day:  {best_day} ({day_stats.loc[best_day, 'Avg_Return_Pct']:+.4f}%)")
        print(f"   Worst performing day: {worst_day} ({day_stats.loc[worst_day, 'Avg_Return_Pct']:+.4f}%)")
        
        # Day-of-week effect significance
        # Using ANOVA-like comparison
        from scipy import stats
        
        day_returns = {}
        for day in day_stats.index:
            day_returns[day] = self.data[self.data['DayName'] == day]['Return'].dropna().values
        
        # Perform one-way ANOVA
        f_stat, p_value = stats.f_oneway(*[day_returns[d] for d in day_returns.keys()])
        
        print(f"\n   ANOVA F-statistic: {f_stat:.4f}")
        print(f"   ANOVA p-value:     {p_value:.4f}")
        
        if p_value < 0.05:
            print("   ✓ Day-of-week effect is statistically significant")
        else:
            print("   ⚠ Day-of-week effect is NOT statistically significant")
        
        self.weekly_stats = day_stats
        return day_stats
    
    def analyze_monthly_seasonality(self):
        """
        Analyze monthly patterns in the data.
        
        This checks if certain months consistently perform better
        or worse, which could indicate annual seasonality patterns.
        """
        print("\n" + "=" * 70)
        print("MONTHLY SEASONALITY ANALYSIS")
        print("=" * 70)
        
        if 'Date' not in self.data.columns:
            print("⚠ Date column required for monthly analysis")
            return None
        
        self.data['Month'] = self.data['Date'].dt.month
        self.data['MonthName'] = self.data['Date'].dt.month_name()
        
        # Monthly statistics
        month_stats = self.data.groupby('Month').agg({
            'Return': ['mean', 'std', 'count']
        }).round(6)
        
        month_stats.columns = ['Avg_Return', 'Return_Std', 'Count']
        month_stats['Avg_Return_Pct'] = month_stats['Avg_Return'] * 100
        
        print("\n📊 Monthly Statistics:")
        print("-" * 70)
        
        months = ['January', 'February', 'March', 'April', 'May', 'June',
                  'July', 'August', 'September', 'October', 'November', 'December']
        
        for i, month in enumerate(months, 1):
            if i in month_stats.index:
                row = month_stats.loc[i]
                bar = "█" * int(row['Avg_Return_Pct'] * 50) if row['Avg_Return_Pct'] > 0 else ""
                bar += "░" * int(abs(row['Avg_Return_Pct']) * 50) if row['Avg_Return_Pct'] < 0 else ""
                print(f"   {month:12s}: {row['Avg_Return_Pct']:+.4f}% {bar}")
        
        # Nepal-specific monthly patterns
        print("\n📅 Nepal-Specific Seasonal Patterns:")
        print("-" * 70)
        print("""
        Key periods affecting NEPSE:
        
        • Mid-July to Mid-August (Shrawan/Bhadra):
          - Many companies have AGMs
          - Dividend announcements
          - Often bullish period
        
        • October-November (Dashain/Tihar):
          - Major festivals
          - Reduced trading activity
          - Often volatile
        
        • April-May (Year-end approaching):
          - Book closing for dividends
          - Rights issue announcements
          - Increased activity
        
        • July (Fiscal Year End):
          - Fiscal year ends mid-July in Nepal
          - Tax-related selling
          - Portfolio rebalancing
        """)
        
        return month_stats
    
    def decompose_time_series(self, period=None, model='additive'):
        """
        Decompose time-series into trend, seasonal, and residual components.
        
        This is a fundamental technique that separates a time series into:
        - Trend: Long-term direction
        - Seasonal: Repeating patterns
        - Residual: Random fluctuations
        
        Parameters:
        -----------
        period : int, optional
            Period of seasonality (auto-detected if None)
        model : str
            'additive' or 'multiplicative'
            - Additive: y = trend + seasonal + residual
            - Multiplicative: y = trend × seasonal × residual
        
        Returns:
        --------
        DecomposeResult : Object containing components
        """
        print("\n" + "=" * 70)
        print(f"TIME-SERIES DECOMPOSITION ({model.upper()} MODEL)")
        print("=" * 70)
        
        # Get price series
        prices = self.data[self.price_column].dropna()
        
        # Auto-detect period if not provided
        if period is None:
            # Default to weekly seasonality (5 trading days)
            period = 5
        
        print(f"\n   Decomposition Parameters:")
        print(f"      Model:  {model}")
        print(f"      Period: {period} observations")
        print(f"      Data points: {len(prices)}")
        
        # Check if we have enough data
        if len(prices) < 2 * period:
            print(f"\n   ⚠ Warning: Need at least {2 * period} observations")
            print(f"   Current: {len(prices)}")
            return None
        
        # Perform decomposition
        # Additive: y = T + S + R
        # Multiplicative: y = T × S × R
        
        decomposition = seasonal_decompose(
            prices, 
            model=model, 
            period=period,
            extrapolate_trend='freq'
        )
        
        # Store components
        self.trend_component = decomposition.trend
        self.seasonal_component = decomposition.seasonal
        self.residual_component = decomposition.resid
        
        # Print component statistics
        print("\n   Component Statistics:")
        print(f"      Trend range:     {self.trend_component.min():.2f} to {self.trend_component.max():.2f}")
        print(f"      Seasonal range:  {self.seasonal_component.min():.2f} to {self.seasonal_component.max():.2f}")
        print(f"      Residual range:  {self.residual_component.min():.2f} to {self.residual_component.max():.2f}")
        
        # Variance explained
        total_var = prices.var()
        trend_var = self.trend_component.dropna().var()
        seasonal_var = self.seasonal_component.var()
        resid_var = self.residual_component.dropna().var()
        
        print("\n   Variance Explained:")
        print(f"      Trend:     {(trend_var/total_var)*100:.1f}%")
        print(f"      Seasonal:  {(seasonal_var/total_var)*100:.1f}%")
        print(f"      Residual:  {(resid_var/total_var)*100:.1f}%")
        
        return decomposition
    
    def visualize_decomposition(self, decomposition):
        """
        Visualize the decomposed components.
        
        Parameters:
        -----------
        decomposition : DecomposeResult
            Result from seasonal_decompose
        """
        fig, axes = plt.subplots(4, 1, figsize=(14, 12))
        
        prices = self.data[self.price_column].dropna()
        
        # Original
        axes[0].plot(prices.values, linewidth=1)
        axes[0].set_title('Original Time Series', fontsize=12, fontweight='bold')
        axes[0].set_ylabel('Price (NPR)')
        axes[0].grid(True, alpha=0.3)
        
        # Trend
        axes[1].plot(decomposition.trend, linewidth=2, color='blue')
        axes[1].set_title('Trend Component', fontsize=12, fontweight='bold')
        axes[1].set_ylabel('Price (NPR)')
        axes[1].grid(True, alpha=0.3)
        
        # Seasonal
        axes[2].plot(decomposition.seasonal, linewidth=1, color='green')
        axes[2].set_title('Seasonal Component', fontsize=12, fontweight='bold')
        axes[2].set_ylabel('Effect (NPR)')
        axes[2].axhline(y=0, color='black', linestyle='--', alpha=0.5)
        axes[2].grid(True, alpha=0.3)
        
        # Residual
        axes[3].plot(decomposition.resid, linewidth=1, color='red', alpha=0.7)
        axes[3].set_title('Residual Component (Noise)', fontsize=12, fontweight='bold')
        axes[3].set_ylabel('Residual (NPR)')
        axes[3].set_xlabel('Time Index')
        axes[3].axhline(y=0, color='black', linestyle='--', alpha=0.5)
        axes[3].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # Interpretation guide
        print("\n" + "=" * 70)
        print("DECOMPOSITION INTERPRETATION GUIDE")
        print("=" * 70)
        print("""
        📊 Understanding the Components:
        
        1. ORIGINAL SERIES
           - Raw price data as recorded
           - Contains all patterns mixed together
        
        2. TREND COMPONENT
           - Long-term direction of the series
           - Smoothed to remove short-term fluctuations
           - Use for: Identifying overall market direction
        
        3. SEASONAL COMPONENT
           - Repeating patterns at fixed intervals
           - Value shows deviation from trend
           - Positive: Above trend that period
           - Negative: Below trend that period
           - Use for: Timing entries/exits based on patterns
        
        4. RESIDUAL COMPONENT
           - What remains after removing trend and seasonality
           - Should appear random (white noise)
           - Large residuals may indicate anomalies
           - Use for: Anomaly detection, risk assessment
        
        💡 For NEPSE Trading:
           - Strong trend: Follow the trend direction
           - Strong seasonality: Time trades with seasonal patterns
           - High residual variance: Higher uncertainty/risk
        """)
    
    def create_seasonal_features(self):
        """
        Create features based on seasonality for machine learning.
        
        Seasonal features can help models learn cyclical patterns.
        
        Returns:
        --------
        pd.DataFrame : Data with seasonal features added
        """
        print("\n" + "=" * 70)
        print("CREATING SEASONAL FEATURES FOR ML MODELS")
        print("=" * 70)
        
        if 'Date' not in self.data.columns:
            print("⚠ Date column required")
            return None
        
        # ========================================
        # Cyclic Encoding for Periodic Features
        # ========================================
        # Instead of using raw values (1-12 for months), we use
        # sine and cosine to capture the cyclic nature
        
        print("\n📊 Cyclic Encoding Explanation:")
        print("-" * 50)
        print("""
        Problem: Month 12 (December) and month 1 (January) are adjacent,
        but raw values 12 and 1 are far apart numerically.
        
        Solution: Use sine and cosine encoding:
        
        sin(2π × month / 12)  and  cos(2π × month / 12)
        
        This creates a circular representation where December and January
        are close together, as they should be.
        """)
        
        # Day of week encoding
        self.data['DayOfWeek_sin'] = np.sin(2 * np.pi * self.data['DayOfWeek'] / 7)
        self.data['DayOfWeek_cos'] = np.cos(2 * np.pi * self.data['DayOfWeek'] / 7)
        
        # Month encoding
        self.data['Month_sin'] = np.sin(2 * np.pi * self.data['Month'] / 12)
        self.data['Month_cos'] = np.cos(2 * np.pi * self.data['Month'] / 12)
        
        # Quarter encoding
        self.data['Quarter'] = self.data['Date'].dt.quarter
        self.data['Quarter_sin'] = np.sin(2 * np.pi * self.data['Quarter'] / 4)
        self.data['Quarter_cos'] = np.cos(2 * np.pi * self.data['Quarter'] / 4)
        
        # Day of month (for monthly patterns)
        self.data['DayOfMonth'] = self.data['Date'].dt.day
        self.data['DayOfMonth_sin'] = np.sin(2 * np.pi * self.data['DayOfMonth'] / 31)
        self.data['DayOfMonth_cos'] = np.cos(2 * np.pi * self.data['DayOfMonth'] / 31)
        
        # Week of year
        self.data['WeekOfYear'] = self.data['Date'].dt.isocalendar().week
        self.data['Week_sin'] = np.sin(2 * np.pi * self.data['WeekOfYear'] / 52)
        self.data['Week_cos'] = np.cos(2 * np.pi * self.data['WeekOfYear'] / 52)
        
        print("\n✓ Seasonal features created:")
        print("   - DayOfWeek (sin, cos)")
        print("   - Month (sin, cos)")
        print("   - Quarter (sin, cos)")
        print("   - DayOfMonth (sin, cos)")
        print("   - WeekOfYear (sin, cos)")
        
        # Show sample
        print("\n📋 Sample of seasonal features:")
        seasonal_cols = [col for col in self.data.columns if 'sin' in col or 'cos' in col]
        print(self.data[['Date'] + seasonal_cols[:6]].head())
        
        return self.data


# ============================================================
# EXAMPLE USAGE
# ============================================================

# Create sample data with seasonality
def create_seasonal_nepse_data(n_days=500):
    """
    Create NEPSE-like data with realistic seasonal patterns.
    """
    np.random.seed(42)
    
    dates = pd.date_range(start='2023-01-01', periods=n_days, freq='B')
    
    # Base price with trend
    trend = np.linspace(500, 600, n_days)
    
    # Weekly seasonality (e.g., Monday effect)
    day_of_week = dates.dayofweek
    weekly_pattern = np.where(day_of_week == 0, -2,  # Monday dip
                       np.where(day_of_week == 4, 1.5,  # Friday up
                       np.where(day_of_week == 5, 0,  # Saturday closed
                       np.zeros(n_days))))  # Other days
    
    # Monthly seasonality
    month = dates.month
    monthly_pattern = 3 * np.sin(2 * np.pi * month / 12)  # Annual cycle
    
    # Random noise
    noise = np.random.normal(0, 5, n_days)
    
    # Combine
    close_prices = trend + weekly_pattern + monthly_pattern + noise
    
    data = pd.DataFrame({
        'Date': dates,
        'S.No': range(1, n_days + 1),
        'Symbol': 'ABL',
        'Close': close_prices,
        'Open': close_prices * (1 + np.random.uniform(-0.01, 0.01, n_days)),
        'High': close_prices * (1 + np.abs(np.random.normal(0.01, 0.005, n_days))),
        'Low': close_prices * (1 - np.abs(np.random.normal(0.01, 0.005, n_days))),
        'Vol': np.random.randint(10000, 100000, n_days)
    })
    
    return data


# Run seasonality analysis
print("\n" + "=" * 70)
print("SEASONALITY ANALYSIS FOR NEPSE DATA")
print("=" * 70)

seasonal_data = create_seasonal_nepse_data(n_days=500)
seasonal_analyzer = SeasonalityAnalyzer(seasonal_data, price_column='Close')

# Explain seasonality types
seasonal_analyzer.explain_seasonality_types()

# Analyze weekly seasonality
weekly_stats = seasonal_analyzer.analyze_weekly_seasonality()

# Analyze monthly seasonality
monthly_stats = seasonal_analyzer.analyze_monthly_seasonality()

# Decompose time series
decomposition = seasonal_analyzer.decompose_time_series(period=5, model='additive')

# Visualize
if decomposition:
    seasonal_analyzer.visualize_decomposition(decomposition)

# Create seasonal features for ML
seasonal_features = seasonal_analyzer.create_seasonal_features()
```

**Detailed Explanation of Seasonality Analysis**:

The seasonality analyzer above provides comprehensive tools for understanding repeating patterns. Key concepts:

**1. Types of Seasonality**:
- **Weekly**: Day-of-week effects (e.g., "Monday effect" in stocks)
- **Monthly**: Patterns within months (salary cycles, bill payments)
- **Quarterly**: Earnings announcements, fiscal quarters
- **Annual**: Yearly cycles (festivals, fiscal year end)

**2. Decomposition Methods**:
- **Additive**: y = Trend + Seasonal + Residual
  - Use when seasonal variation is constant over time
- **Multiplicative**: y = Trend × Seasonal × Residual
  - Use when seasonal variation grows with trend

**3. Cyclic Encoding**:
- Raw numeric values (1-12 for months) don't capture that December and January are adjacent
- Sine/cosine encoding preserves cyclic nature
- Formula: sin(2π × value / period) and cos(2π × value / period)

**NEPSE-Specific Seasonality**:
- **Dashain/Tihar**: Major festivals in Oct-Nov affecting trading
- **Fiscal Year End (mid-July)**: Tax-related adjustments
- **AGM Season**: July-August when many companies hold annual meetings
- **Book Closure**: April-May for dividend declarations

#### **2.2.3 Cyclicality**

**Definition**: Longer-term fluctuations that don't have a fixed period, often tied to economic or business cycles.

```python
class CyclicalityAnalyzer:
    """
    Analyze cyclical patterns in time-series data.
    
    Unlike seasonality (fixed period), cyclicality refers to
    longer-term fluctuations without a fixed frequency.
    """
    
    def __init__(self, data, price_column='Close'):
        self.data = data.copy()
        self.price_column = price_column
    
    def explain_cyclicality_vs_seasonality(self):
        """
        Explain the difference between cyclicality and seasonality.
        """
        print("=" * 70)
        print("CYCLICALITY VS SEASONALITY")
        print("=" * 70)
        
        print("""
        ┌──────────────────────────────────────────────────────────────────┐
        │                        SEASONALITY                               │
        ├──────────────────────────────────────────────────────────────────┤
        │ • Fixed, known period (weekly, monthly, annually)                │
        │ • Predictable timing                                             │
        │ • Example: Holiday shopping spike every December                 │
        │ • Example: Quarterly earnings announcements                      │
        │                                                                  │
        │   Pattern: ▲ ▼ ▲ ▼ ▲ ▼ ▲ ▼ ▲ ▼ (Fixed frequency)                │
        │                                                                  │
        │ NEPSE Example: Dashain festival effects every year               │
        └──────────────────────────────────────────────────────────────────┘
        
        ┌──────────────────────────────────────────────────────────────────┐
        │                        CYCLICALITY                               │
        ├──────────────────────────────────────────────────────────────────┤
        │ • Variable, unknown period                                       │
        │ • Tied to economic/business cycles                               │
        │ • Example: Business cycles (3-7 years)                           │
        │ • Example: Bull/bear market cycles                               │
        │                                                                  │
        │   Pattern: ▲▲▲──▼▼▼──▲▲▲──▼▼ (Variable frequency)               │
        │                                                                  │
        │ NEPSE Example: Market bull/bear cycles lasting months to years   │
        └──────────────────────────────────────────────────────────────────┘
        """)
    
    def detect_cycles_using_spectral_analysis(self):
        """
        Detect cyclical patterns using spectral analysis (FFT).
        
        Fast Fourier Transform identifies dominant frequencies
        in the time series, helping detect cycles.
        
        Returns:
        --------
        dict : Dominant cycle periods and their strengths
        """
        print("\n" + "=" * 70)
        print("CYCLE DETECTION USING SPECTRAL ANALYSIS")
        print("=" * 70)
        
        from scipy import signal
        from scipy.fft import fft, fftfreq
        
        prices = self.data[self.price_column].dropna().values
        
        # Remove trend first (cycles are easier to detect in detrended data)
        # Using differencing for detrending
        detrended = np.diff(prices)
        
        # Apply FFT
        n = len(detrended)
        
        # Compute FFT
        fft_values = fft(detrended)
        
        # Compute frequencies
        frequencies = fftfreq(n)
        
        # Get power spectrum (magnitude squared)
        power = np.abs(fft_values) ** 2
        
        # Only look at positive frequencies
        positive_freq_mask = frequencies > 0
        frequencies = frequencies[positive_freq_mask]
        power = power[positive_freq_mask]
        
        # Convert frequency to period (in days)
        periods = 1 / frequencies
        
        # Find top dominant periods
        top_indices = np.argsort(power)[-10:][::-1]
        
        print("\n📊 Top 10 Dominant Cycles:")
        print("-" * 50)
        print(f"{'Period (Days)':<15} {'Period (Weeks)':<15} {'Power':<15}")
        print("-" * 50)
        
        dominant_cycles = []
        for idx in top_indices:
            period_days = periods[idx]
            period_weeks = period_days / 5  # Trading days
            cycle_power = power[idx]
            
            # Filter out very long or very short cycles
            if 5 < period_days < n / 2:
                print(f"{period_days:<15.1f} {period_weeks:<15.1f} {cycle_power:<15.1f}")
                dominant_cycles.append({
                    'period_days': period_days,
                    'period_weeks': period_weeks,
                    'power': cycle_power
                })
        
        self.dominant_cycles = dominant_cycles
        
        # Interpretation
        print("\n📈 Cycle Interpretation for NEPSE:")
        print("-" * 50)
        
        if dominant_cycles:
            # Check for common cycle patterns
            for cycle in dominant_cycles[:3]:
                period = cycle['period_days']
                
                if 4 <= period <= 6:
                    print(f"   • {period:.0f}-day cycle: Weekly pattern detected")
                elif 20 <= period <= 25:
                    print(f"   • {period:.0f}-day cycle: Monthly pattern detected")
                elif 60 <= period <= 70:
                    print(f"   • {period:.0f}-day cycle: Quarterly pattern detected")
                elif 240 <= period <= 260:
                    print(f"   • {period:.0f}-day cycle: Annual pattern detected")
                else:
                    print(f"   • {period:.0f}-day cycle: Custom cycle detected")
        
        return dominant_cycles
    
    def identify_market_regimes(self, window=50):
        """
        Identify market regimes (bull/bear markets) as cyclical behavior.
        
        Market regimes represent longer-term cyclical patterns
        in stock markets.
        
        Parameters:
        -----------
        window : int
            Window for regime identification
        
        Returns:
        --------
        pd.DataFrame : Data with regime labels
        """
        print("\n" + "=" * 70)
        print("MARKET REGIME IDENTIFICATION")
        print("=" * 70)
        
        prices = self.data[self.price_column].values
        
        # Calculate rolling returns
        returns = pd.Series(prices).pct_change(window)
        
        # Define regimes based on rolling returns
        # Bull market: Strong positive returns
        # Bear market: Strong negative returns
        # Sideways: Near-zero returns
        
        def classify_regime(ret):
            if pd.isna(ret):
                return 'Unknown'
            elif ret > 0.05:  # >5% return over window
                return 'Bull'
            elif ret < -0.05:  # <-5% return over window
                return 'Bear'
            else:
                return 'Sideways'
        
        self.data['Regime'] = returns.apply(classify_regime)
        
        # Count regimes
        regime_counts = self.data['Regime'].value_counts()
        
        print(f"\n📊 Market Regimes (based on {window}-day returns):")
        print("-" * 50)
        
        for regime, count in regime_counts.items():
            pct = count / len(self.data) * 100
            bar = "█" * int(pct / 2)
            print(f"   {regime:12s}: {count:4d} days ({pct:5.1f}%) {bar}")
        
        # Calculate regime statistics
        print("\n📈 Regime Statistics:")
        print("-" * 50)
        
        for regime in ['Bull', 'Bear', 'Sideways']:
            regime_data = self.data[self.data['Regime'] == regime]
            if len(regime_data) > 0:
                avg_return = regime_data[self.price_column].pct_change().mean() * 100
                volatility = regime_data[self.price_column].pct_change().std() * 100
                
                print(f"\n   {regime}:")
                print(f"      Average daily return: {avg_return:+.4f}%")
                print(f"      Daily volatility:     {volatility:.4f}%")
        
        return self.data
    
    def visualize_regimes(self):
        """
        Visualize market regimes over time.
        """
        if 'Regime' not in self.data.columns:
            print("⚠ Run identify_market_regimes() first")
            return
        
        fig, ax = plt.subplots(figsize=(14, 6))
        
        prices = self.data[self.price_column].values
        
        # Plot prices
        ax.plot(prices, linewidth=1, color='black', alpha=0.7)
        
        # Color background by regime
        colors = {'Bull': 'lightgreen', 'Bear': 'lightcoral', 'Sideways': 'lightgray', 'Unknown': 'white'}
        
        current_regime = None
        start_idx = 0
        
        for i, regime in enumerate(self.data['Regime']):
            if regime != current_regime:
                if current_regime is not None:
                    ax.axvspan(start_idx, i-1, alpha=0.3, color=colors.get(current_regime, 'white'))
                current_regime = regime
                start_idx = i
        
        # Last segment
        if current_regime is not None:
            ax.axvspan(start_idx, len(self.data)-1, alpha=0.3, color=colors.get(current_regime, 'white'))
        
        ax.set_title('Market Regimes Over Time', fontsize=12, fontweight='bold')
        ax.set_xlabel('Trading Day')
        ax.set_ylabel('Price (NPR)')
        
        # Legend
        from matplotlib.patches import Patch
        legend_elements = [Patch(facecolor='lightgreen', label='Bull', alpha=0.3),
                          Patch(facecolor='lightcoral', label='Bear', alpha=0.3),
                          Patch(facecolor='lightgray', label='Sideways', alpha=0.3)]
        ax.legend(handles=legend_elements, loc='upper left')
        
        ax.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()


# Run cyclicality analysis
print("\n" + "=" * 70)
print("CYCLICALITY ANALYSIS FOR NEPSE DATA")
print("=" * 70)

cyclicality_analyzer = CyclicalityAnalyzer(seasonal_data, price_column='Close')
cyclicality_analyzer.explain_cyclicality_vs_seasonality()

# Detect cycles
cycles = cyclicality_analyzer.detect_cycles_using_spectral_analysis()

# Identify market regimes
regime_data = cyclicality_analyzer.identify_market_regimes(window=50)

# Visualize regimes
cyclicality_analyzer.visualize_regimes()
```

#### **2.2.4 Irregularity (Noise/Residual)**

**Definition**: Random, unpredictable fluctuations that remain after removing trend, seasonality, and cyclical components.

```python
class IrregularityAnalyzer:
    """
    Analyze irregular/random components in time-series data.
    
    Irregularity represents the unpredictable noise that remains
    after removing systematic patterns.
    """
    
    def __init__(self, data, price_column='Close'):
        self.data = data.copy()
        self.price_column = price_column
    
    def explain_noise_types(self):
        """
        Explain different types of noise in time-series.
        """
        print("=" * 70)
        print("TYPES OF IRREGULARITY IN TIME-SERIES DATA")
        print("=" * 70)
        
        print("""
        ┌──────────────────────────────────────────────────────────────────┐
        │                     WHITE NOISE                                  │
        ├──────────────────────────────────────────────────────────────────┤
        │ • Completely random, no pattern                                  │
        │ • Zero autocorrelation at all lags                               │
        │ • Constant mean and variance                                     │
        │ • Desirable residual after modeling                              │
        │                                                                  │
        │   Pattern: │││││││││││││││││ (Random, no structure)            │
        └──────────────────────────────────────────────────────────────────┘
        
        ┌──────────────────────────────────────────────────────────────────┐
        │                     RED NOISE                                    │
        ├──────────────────────────────────────────────────────────────────┤
        │ • Some autocorrelation at short lags                             │
        │ • Common in financial returns                                    │
        │ • Indicates model hasn't captured all patterns                   │
        │                                                                  │
        │   Pattern: ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁ (Some persistence)                  │
        └──────────────────────────────────────────────────────────────────┘
        
        ┌──────────────────────────────────────────────────────────────────┐
        │                HETEROSCEDASTIC NOISE                             │
        ├──────────────────────────────────────────────────────────────────┤
        │ • Variance changes over time                                     │
        │ • Common in financial data (volatility clustering)               │
        │ • Requires special handling (GARCH models)                       │
        │                                                                  │
        │   Pattern: │││││░░░░░░░░░░░░░░░░│││││││││ (Variable spread)    │
        └──────────────────────────────────────────────────────────────────┘
        """)
    
    def analyze_residuals(self, residuals):
        """
        Analyze the residual component for randomness.
        
        A good model should have residuals that are random (white noise).
        
        Parameters:
        -----------
        residuals : np.array
            Residual values from decomposition or model
        
        Returns:
        --------
        dict : Analysis results
        """
        print("\n" + "=" * 70)
        print("RESIDUAL ANALYSIS")
        print("=" * 70)
        
        residuals = residuals.dropna().values
        
        # ========================================
        # 1. Basic Statistics
        # ========================================
        print("\n📊 Basic Statistics:")
        print("-" * 50)
        print(f"   Mean:      {residuals.mean():.6f} (should be ~0)")
        print(f"   Std Dev:   {residuals.std():.4f}")
        print(f"   Min:       {residuals.min():.4f}")
        print(f"   Max:       {residuals.max():.4f}")
        print(f"   Skewness:  {pd.Series(residuals).skew():.4f} (should be ~0)")
        print(f"   Kurtosis:  {pd.Series(residuals).kurtosis():.4f} (should be ~0)")
        
        # ========================================
        # 2. Autocorrelation Test (Ljung-Box)
        # ========================================
        print("\n📊 Autocorrelation Test (Ljung-Box):")
        print("-" * 50)
        
        from statsmodels.stats.diagnostic import acorr_ljungbox
        
        # Test for autocorrelation at multiple lags
        lags = [5, 10, 20]
        lb_results = acorr_ljungbox(residuals, lags=lags)
        
        for lag in lags:
            p_value = lb_results.loc[lag, 'lb_pvalue']
            status = "✓ No autocorrelation" if p_value > 0.05 else "⚠ Autocorrelation detected"
            print(f"   Lag {lag}: p-value = {p_value:.4f} {status}")
        
        # ========================================
        # 3. Normality Test
        # ========================================
        print("\n📊 Normality Test (Jarque-Bera):")
        print("-" * 50)
        
        from scipy.stats import jarque_bera
        
        jb_stat, jb_pvalue = jarque_bera(residuals)
        
        print(f"   Jarque-Bera statistic: {jb_stat:.4f}")
        print(f"   p-value:               {jb_pvalue:.4f}")
        
        if jb_pvalue > 0.05:
            print("   ✓ Residuals appear normally distributed")
        else:
            print("   ⚠ Residuals may not be normally distributed")
        
        # ========================================
        # 4. Heteroscedasticity Test
        # ========================================
        print("\n📊 Heteroscedasticity Test:")
        print("-" * 50)
        
        # Split residuals into two halves and compare variances
        mid = len(residuals) // 2
        first_half_var = residuals[:mid].var()
        second_half_var = residuals[mid:].var()
        
        variance_ratio = second_half_var / first_half_var
        
        print(f"   First half variance:  {first_half_var:.4f}")
        print(f"   Second half variance: {second_half_var:.4f}")
        print(f"   Variance ratio:       {variance_ratio:.4f}")
        
        if 0.5 < variance_ratio < 2.0:
            print("   ✓ Variance appears relatively constant (homoscedastic)")
        else:
            print("   ⚠ Variance appears to change over time (heteroscedastic)")
        
        return {
            'mean': residuals.mean(),
            'std': residuals.std(),
            'is_random': all(lb_results['lb_pvalue'] > 0.05),
            'is_normal': jb_pvalue > 0.05,
            'is_homoscedastic': 0.5 < variance_ratio < 2.0
        }
    
    def check_model_quality(self, residuals):
        """
        Check if residuals indicate a good model fit.
        
        Good residuals should be:
        1. Random (no autocorrelation)
        2. Zero mean
        3. Constant variance
        4. Ideally normally distributed
        """
        print("\n" + "=" * 70)
        print("MODEL QUALITY ASSESSMENT")
        print("=" * 70)
        
        results = self.analyze_residuals(pd.Series(residuals))
        
        print("\n📋 Model Quality Checklist:")
        print("-" * 50)
        
        checks = [
            ("Random residuals (no pattern)", results['is_random']),
            ("Zero mean residuals", abs(results['mean']) < 0.01),
            ("Constant variance", results['is_homoscedastic']),
            ("Normal distribution", results['is_normal'])
        ]
        
        all_passed = True
        for check_name, passed in checks:
            status = "✓ PASS" if passed else "✗ FAIL"
            print(f"   {check_name:<35} {status}")
            all_passed = all_passed and passed
        
        print("\n" + "=" * 50)
        if all_passed:
            print("   🎉 EXCELLENT! Model residuals meet all quality criteria.")
        else:
            print("   ⚠ Model may need improvement based on residual analysis.")
        
        return all_passed


# Analyze irregularity
irregularity_analyzer = IrregularityAnalyzer(seasonal_data, price_column='Close')
irregularity_analyzer.explain_noise_types()

# Use residuals from decomposition
if hasattr(seasonal_analyzer, 'residual_component'):
    residual_results = irregularity_analyzer.analyze_residuals(seasonal_analyzer.residual_component)
    quality = irregularity_analyzer.check_model_quality(seasonal_analyzer.residual_component)
```

---

### **2.3 Time-Series Properties**

#### **2.3.1 Stationarity**

**Definition**: A time series is stationary if its statistical properties (mean, variance, autocorrelation) remain constant over time.

```python
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.tsa.stattools import acf, pacf

class StationarityAnalyzer:
    """
    Comprehensive stationarity analysis for time-series data.
    
    Stationarity is a critical property because many time-series
    models assume the data is stationary.
    """
    
    def __init__(self, data, price_column='Close'):
        self.data = data.copy()
        self.price_column = price_column
    
    def explain_stationarity(self):
        """
        Explain stationarity concepts and importance.
        """
        print("=" * 70)
        print("UNDERSTANDING STATIONARITY")
        print("=" * 70)
        
        print("""
        ┌──────────────────────────────────────────────────────────────────┐
        │                   STATIONARY TIME SERIES                         │
        ├──────────────────────────────────────────────────────────────────┤
        │ • Constant mean over time                                        │
        │ • Constant variance over time                                    │
        │ • Constant autocorrelation structure                             │
        │ • No trend or seasonality                                        │
        │                                                                  │
        │   Visual: ─────────────────────────────── (Flat, constant)       │
        │                                                                  │
        │   Example: Random walk with no drift, detrended data             │
        └──────────────────────────────────────────────────────────────────┘
        
        ┌──────────────────────────────────────────────────────────────────┐
        │                 NON-STATIONARY TIME SERIES                       │
        ├──────────────────────────────────────────────────────────────────┤
        │ • Mean changes over time (trend)                                 │
        │ • Variance changes over time (heteroscedasticity)                │
        │ • Seasonal patterns present                                      │
        │                                                                  │
        │   Visual: ────/────────/────────/────── (Trending)               │
        │            or /\//\//\//\//\//\//\\ (Seasonal)                   │
        │                                                                  │
        │   Example: Stock prices (typically trending), sales data         │
        └──────────────────────────────────────────────────────────────────┘
        
        ⚠ WHY STATIONARITY MATTERS:
        
        Many statistical models assume stationarity:
        • ARIMA models require stationary data
        • Linear regression requires stationary residuals
        • Granger causality tests need stationary variables
        
        Non-stationary data can lead to:
        • Spurious regression (false relationships)
        • Unreliable forecasts
        • Invalid statistical inference
        """)
    
    def adf_test(self, series=None, significance=0.05):
        """
        Perform Augmented Dickey-Fuller test for stationarity.
        
        Null Hypothesis (H0): Series has a unit root (non-stationary)
        Alternative Hypothesis (H1): Series is stationary
        
        If p-value < significance, reject H0 (series is stationary)
        
        Parameters:
        -----------
        series : np.array, optional
            Series to test (uses price column if None)
        significance : float
            Significance level (default 0.05)
        
        Returns:
        --------
        dict : Test results
        """
        print("\n" + "=" * 70)
        print("AUGMENTED DICKEY-FULLER (ADF) TEST")
        print("=" * 70)
        
        if series is None:
            series = self.data[self.price_column].dropna().values
        
        print("""
        📊 Test Explanation:
        
        The ADF test checks if a unit root is present in the time series.
        A unit root indicates non-stationarity.
        
        Hypotheses:
        • H0 (Null):     Series has unit root (non-stationary)
        • H1 (Alt):      Series has no unit root (stationary)
        
        Decision Rule:
        • If p-value < 0.05: Reject H0 → Series is stationary
        • If p-value >= 0.05: Cannot reject H0 → Series is non-stationary
        """)
        
        # Perform test
        result = adfuller(series, autolag='AIC')
        
        adf_statistic = result[0]
        p_value = result[1]
        critical_values = result[4]
        
        print("\n📊 Test Results:")
        print("-" * 50)
        print(f"   ADF Statistic:      {adf_statistic:.4f}")
        print(f"   p-value:            {p_value:.6f}")
        print(f"\n   Critical Values:")
        for key, value in critical_values.items():
            print(f"      {key}: {value:.4f}")
        
        # Interpretation
        print("\n📈 Interpretation:")
        print("-" * 50)
        
        if p_value < significance:
            print(f"   ✓ p-value ({p_value:.6f}) < {significance}")
            print("   → Reject null hypothesis")
            print("   → Series is STATIONARY")
        else:
            print(f"   ⚠ p-value ({p_value:.6f}) >= {significance}")
            print("   → Cannot reject null hypothesis")
            print("   → Series is NON-STATIONARY")
        
        # Compare ADF statistic with critical values
        if adf_statistic < critical_values['5%']:
            print(f"\n   ✓ ADF statistic ({adf_statistic:.4f}) < 5% critical value ({critical_values['5%']:.4f})")
            print("   → Strong evidence of stationarity")
        
        return {
            'adf_statistic': adf_statistic,
            'p_value': p_value,
            'critical_values': critical_values,
            'is_stationary': p_value < significance
        }
    
    def kpss_test(self, series=None, significance=0.05):
        """
        Perform KPSS test for stationarity.
        
        Opposite of ADF test:
        Null Hypothesis (H0): Series is stationary
        Alternative Hypothesis (H1): Series is non-stationary
        
        Parameters:
        -----------
        series : np.array, optional
            Series to test
        significance : float
            Significance level
        
        Returns:
        --------
        dict : Test results
        """
        print("\n" + "=" * 70)
        print("KPSS (KWIATKOWSKI-PHILLIPS-SCHMIDT-SHIN) TEST")
        print("=" * 70)
        
        if series is None:
            series = self.data[self.price_column].dropna().values
        
        print("""
        📊 Test Explanation:
        
        The KPSS test has opposite hypotheses from ADF:
        
        Hypotheses:
        • H0 (Null):     Series is stationary
        • H1 (Alt):      Series is non-stationary
        
        Decision Rule:
        • If p-value < 0.05: Reject H0 → Series is non-stationary
        • If p-value >= 0.05: Cannot reject H0 → Series is stationary
        """)
        
        # Perform test
        from statsmodels.tsa.stattools import kpss
        
        statistic, p_value, n_lags, critical_values = kpss(series, regression='c')
        
        print("\n📊 Test Results:")
        print("-" * 50)
        print(f"   KPSS Statistic:     {statistic:.4f}")
        print(f"   p-value:            {p_value:.6f}")
        print(f"   Lags used:          {n_lags}")
        print(f"\n   Critical Values:")
        for key, value in critical_values.items():
            print(f"      {key}: {value:.4f}")
        
        # Interpretation
        print("\n📈 Interpretation:")
        print("-" * 50)
        
        if p_value < significance:
            print(f"   ⚠ p-value ({p_value:.6f}) < {significance}")
            print("   → Reject null hypothesis")
            print("   → Series is NON-STATIONARY")
        else:
            print(f"   ✓ p-value ({p_value:.6f}) >= {significance}")
            print("   → Cannot reject null hypothesis")
            print("   → Series is STATIONARY")
        
        return {
            'kpss_statistic': statistic,
            'p_value': p_value,
            'critical_values': critical_values,
            'is_stationary': p_value >= significance
        }
    
    def make_stationary(self, method='difference', order=1):
        """
        Transform non-stationary series to stationary.
        
        Common methods:
        1. Differencing: y_t' = y_t - y_{t-1}
        2. Log transformation: y_t' = log(y_t)
        3. Log difference: y_t' = log(y_t) - log(y_{t-1})
        4. Detrending: Remove trend component
        
        Parameters:
        -----------
        method : str
            Transformation method
        order : int
            Order of differencing
        
        Returns:
        --------
        np.array : Transformed stationary series
        """
        print("\n" + "=" * 70)
        print(f"MAKING SERIES STATIONARY (Method: {method.upper()})")
        print("=" * 70)
        
        series = self.data[self.price_column].dropna().values.copy()
        original = series.copy()
        
        if method == 'difference':
            # First differencing
            for i in range(order):
                series = np.diff(series)
                ```python
            print(f"\n   Applied differencing of order {order}")
            print(f"   Original length: {len(original)}")
            print(f"   Transformed length: {len(series)}")
            
        elif method == 'log':
            # Log transformation (for multiplicative patterns)
            if (series <= 0).any():
                print("   ⚠ Warning: Series contains non-positive values")
                series = series + abs(series.min()) + 1
            series = np.log(series)
            print("   Applied log transformation")
            
        elif method == 'log_difference':
            # Log returns (common in finance)
            if (series <= 0).any():
                series = series + abs(series.min()) + 1
            series = np.diff(np.log(series))
            print("   Applied log differencing (log returns)")
            
        elif method == 'detrend':
            # Remove linear trend
            x = np.arange(len(series))
            slope, intercept = np.polyfit(x, series, 1)
            trend = slope * x + intercept
            series = series - trend
            print("   Removed linear trend")
            
        elif method == 'pct_change':
            # Percentage change
            series = np.diff(series) / series[:-1]
            print("   Applied percentage change transformation")
        
        # Test stationarity of transformed series
        print("\n📊 Testing Transformed Series for Stationarity:")
        print("-" * 50)
        
        # ADF test
        adf_result = adfuller(series, autolag='AIC')
        print(f"   ADF p-value: {adf_result[1]:.6f}")
        
        if adf_result[1] < 0.05:
            print("   ✓ Series is now STATIONARY")
        else:
            print("   ⚠ Series is still NON-STATIONARY")
            print("   Consider trying a different method or higher order")
        
        # Store transformed series
        self.stationary_series = series
        
        # Compare statistics
        print("\n📊 Comparison of Original vs Transformed:")
        print("-" * 50)
        print(f"   Original Mean:     {original.mean():.4f}")
        print(f"   Original Std:      {original.std():.4f}")
        print(f"   Transformed Mean:  {series.mean():.6f}")
        print(f"   Transformed Std:   {series.std():.4f}")
        
        return series
    
    def visualize_stationarity(self, original, transformed):
        """
        Visualize original vs transformed series.
        """
        fig, axes = plt.subplots(2, 2, figsize=(14, 8))
        
        # Original series
        axes[0, 0].plot(original, linewidth=1)
        axes[0, 0].set_title('Original Series', fontsize=11, fontweight='bold')
        axes[0, 0].set_xlabel('Time')
        axes[0, 0].set_ylabel('Price (NPR)')
        axes[0, 0].axhline(y=original.mean(), color='r', linestyle='--', label=f'Mean: {original.mean():.2f}')
        axes[0, 0].legend()
        axes[0, 0].grid(True, alpha=0.3)
        
        # Transformed series
        axes[0, 1].plot(transformed, linewidth=1, color='green')
        axes[0, 1].set_title('Transformed Series (Differenced)', fontsize=11, fontweight='bold')
        axes[0, 1].set_xlabel('Time')
        axes[0, 1].set_ylabel('Differenced Value')
        axes[0, 1].axhline(y=transformed.mean(), color='r', linestyle='--', label=f'Mean: {transformed.mean():.4f}')
        axes[0, 1].axhline(y=0, color='black', linestyle='-', alpha=0.3)
        axes[0, 1].legend()
        axes[0, 1].grid(True, alpha=0.3)
        
        # Original distribution
        axes[1, 0].hist(original, bins=50, edgecolor='black', alpha=0.7)
        axes[1, 0].set_title('Original Distribution', fontsize=11, fontweight='bold')
        axes[1, 0].axvline(x=original.mean(), color='r', linestyle='--', label='Mean')
        axes[1, 0].legend()
        axes[1, 0].grid(True, alpha=0.3)
        
        # Transformed distribution
        axes[1, 1].hist(transformed, bins=50, edgecolor='black', alpha=0.7, color='green')
        axes[1, 1].set_title('Transformed Distribution', fontsize=11, fontweight='bold')
        axes[1, 1].axvline(x=transformed.mean(), color='r', linestyle='--', label='Mean')
        axes[1, 1].legend()
        axes[1, 1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        print("""
        📈 Visual Assessment of Stationarity:
        
        Original Series (Non-Stationary):
        • Clear trend visible
        • Mean is not constant
        • Variance may change over time
        
        Transformed Series (Stationary):
        • Fluctuates around zero
        • Mean is approximately constant
        • Variance is more uniform
        """)


# ============================================================
# EXAMPLE USAGE
# ============================================================

# Create trending NEPSE data
nepse_trending = create_trending_nepse_data(n_days=500, trend_type='upward')

# Initialize analyzer
stationarity_analyzer = StationarityAnalyzer(nepse_trending, price_column='Close')

# Explain stationarity
stationarity_analyzer.explain_stationarity()

# Perform ADF test
adf_result = stationarity_analyzer.adf_test()

# Perform KPSS test  
kpss_result = stationarity_analyzer.kpss_test()

# Make stationary
transformed = stationarity_analyzer.make_stationary(method='difference', order=1)

# Visualize
stationarity_analyzer.visualize_stationarity(
    nepse_trending['Close'].values, 
    transformed
)
```

**Understanding the Code Output**:

When you run the above code, here's what you should observe:

1. **ADF Test on Original Data**:
   - ADF Statistic will likely be positive (e.g., +0.5)
   - p-value will be large (e.g., 0.8)
   - Conclusion: Series is **NON-STATIONARY**

2. **ADF Test on Differenced Data**:
   - ADF Statistic will be negative and large (e.g., -15)
   - p-value will be very small (e.g., 0.0001)
   - Conclusion: Series is **STATIONARY**

**Why Differencing Works**:
```
Original:    500, 502, 505, 501, 503, ...  (Trending upward)
Differenced:     2,   3,  -4,   2, ...  (Fluctuates around 0)
```

#### **2.3.2 Autocorrelation**

**Definition**: Autocorrelation measures the correlation between a time series and a lagged version of itself.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import acf, pacf

class AutocorrelationAnalyzer:
    """
    Analyze autocorrelation patterns in time-series data.
    
    Autocorrelation is fundamental to time-series analysis because:
    1. It reveals how current values relate to past values
    2. It helps identify model orders (AR, MA terms)
    3. It detects seasonality and cyclic patterns
    """
    
    def __init__(self, data, price_column='Close'):
        """
        Initialize the autocorrelation analyzer.
        
        Parameters:
        -----------
        data : pd.DataFrame
            Time-series data
        price_column : str
            Column to analyze
        """
        self.data = data.copy()
        self.price_column = price_column
        self.acf_values = None
        self.pacf_values = None
    
    def explain_autocorrelation(self):
        """
        Explain autocorrelation concepts in detail.
        """
        print("=" * 70)
        print("UNDERSTANDING AUTOCORRELATION")
        print("=" * 70)
        
        print("""
        📊 AUTOCORRELATION FUNCTION (ACF)
        
        Definition:
        ACF measures the correlation between a time series and
        its lagged values.
        
        Formula:
        ρ(k) = Cov(Y_t, Y_{t-k}) / Var(Y_t)
        
        Where:
        • k = lag (number of periods)
        • ρ(k) = autocorrelation at lag k
        • Values range from -1 to +1
        
        Interpretation:
        • ρ(k) = +1: Perfect positive correlation
        • ρ(k) = 0: No correlation
        • ρ(k) = -1: Perfect negative correlation
        
        ┌─────────────────────────────────────────────────────────────────┐
        │                  ACF PATTERNS AND MEANINGS                       │
        ├─────────────────────────────────────────────────────────────────┤
        │                                                                  │
        │  Pattern 1: Slow Decay                                          │
        │  ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                         │
        │  ████░░░░░░░░░░░░░░░░░░░░░░░░░░░                                 │
        │  ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                                 │
        │  █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                                 │
        │  Lag: 1  5  10 15 20 25                                         │
        │                                                                  │
        │  → Non-stationary series (has trend)                            │
        │  → Need differencing                                            │
        │                                                                  │
        ├─────────────────────────────────────────────────────────────────┤
        │                                                                  │
        │  Pattern 2: Sharp Cutoff                                        │
        │  ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                         │
        │  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                         │
        │  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                         │
        │  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                         │
        │                                                                  │
        │  → Stationary series, MA process                                │
        │  → Suggests MA(q) model where q = cutoff lag                    │
        │                                                                  │
        ├─────────────────────────────────────────────────────────────────┤
        │                                                                  │
        │  Pattern 3: Seasonal Spikes                                     │
        │  ██████░░░░░░░░░░░░░░░░██████░░░░░░░░░░░░░░░░██████             │
        │  ████░░░░░░░░░░░░░░░░░░████░░░░░░░░░░░░░░░░░░████░░             │
        │  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░             │
        │  Lag: 1    7    14   21                                         │
        │                                                                  │
        │  → Seasonal pattern with period = 7                             │
        │  → Need seasonal differencing or seasonal terms                 │
        │                                                                  │
        └─────────────────────────────────────────────────────────────────┘
        
        📊 PARTIAL AUTOCORRELATION FUNCTION (PACF)
        
        Definition:
        PACF measures the correlation between Y_t and Y_{t-k}
        AFTER removing the effect of intermediate lags.
        
        Why PACF?
        • ACF at lag k includes effects from all lags 1 to k-1
        • PACF isolates the direct effect of lag k
        
        Interpretation for Model Selection:
        
        ┌─────────────────────────────────────────────────────────────────┐
        │  AR(p) Process:                                                 │
        │  • ACF: Decays exponentially or as sine wave                    │
        │  • PACF: Sharp cutoff after lag p                               │
        │  → Look at PACF cutoff to determine p                           │
        │                                                                  │
        │  MA(q) Process:                                                 │
        │  • ACF: Sharp cutoff after lag q                                │
        │  • PACF: Decays exponentially or as sine wave                   │
        │  → Look at ACF cutoff to determine q                            │
        │                                                                  │
        │  ARMA(p,q) Process:                                             │
        │  • Both ACF and PACF decay gradually                            │
        │  → Need more sophisticated model selection                      │
        └─────────────────────────────────────────────────────────────────┘
        """)
    
    def calculate_acf(self, nlags=40):
        """
        Calculate autocorrelation function values.
        
        Parameters:
        -----------
        nlags : int
            Maximum number of lags to calculate
        
        Returns:
        --------
        np.array : ACF values
        """
        series = self.data[self.price_column].dropna().values
        
        # Calculate ACF
        # acf function returns values for lags 0 to nlags
        self.acf_values = acf(series, nlags=nlags, fft=True)
        
        print(f"\n📊 Autocorrelation Values (First {nlags} lags):")
        print("-" * 50)
        
        # Display with visual bars
        for i in range(min(15, nlags + 1)):
            val = self.acf_values[i]
            # Create visual bar
            if val >= 0:
                bar = "█" * int(val * 40)
            else:
                bar = "░" * int(abs(val) * 40)
            
            print(f"   Lag {i:2d}: {val:+.4f}  {bar}")
        
        # Confidence interval
        # For 95% confidence: ±1.96 / sqrt(n)
        n = len(series)
        conf_interval = 1.96 / np.sqrt(n)
        
        print(f"\n   95% Confidence Interval: ±{conf_interval:.4f}")
        print("   (Values outside this range are statistically significant)")
        
        return self.acf_values
    
    def calculate_pacf(self, nlags=40):
        """
        Calculate partial autocorrelation function values.
        
        Parameters:
        -----------
        nlags : int
            Maximum number of lags to calculate
        
        Returns:
        --------
        np.array : PACF values
        """
        series = self.data[self.price_column].dropna().values
        
        # Calculate PACF
        self.pacf_values = pacf(series, nlags=nlags, method='yw')
        
        print(f"\n📊 Partial Autocorrelation Values (First {nlags} lags):")
        print("-" * 50)
        
        # Display with visual bars
        for i in range(min(15, nlags + 1)):
            val = self.pacf_values[i]
            if val >= 0:
                bar = "█" * int(val * 40)
            else:
                bar = "░" * int(abs(val) * 40)
            
            print(f"   Lag {i:2d}: {val:+.4f}  {bar}")
        
        n = len(series)
        conf_interval = 1.96 / np.sqrt(n)
        
        print(f"\n   95% Confidence Interval: ±{conf_interval:.4f}")
        
        return self.pacf_values
    
    def plot_acf_pacf(self, nlags=40):
        """
        Create ACF and PACF plots.
        
        These plots are essential tools for:
        1. Identifying stationarity
        2. Determining model orders
        3. Detecting seasonality
        """
        series = self.data[self.price_column].dropna()
        
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        
        # ACF Plot
        plot_acf(series, lags=nlags, ax=axes[0], alpha=0.05)
        axes[0].set_title('Autocorrelation Function (ACF)', fontsize=12, fontweight='bold')
        axes[0].set_xlabel('Lag')
        axes[0].set_ylabel('Correlation')
        
        # PACF Plot
        plot_pacf(series, lags=nlags, ax=axes[1], alpha=0.05, method='yw')
        axes[1].set_title('Partial Autocorrelation Function (PACF)', fontsize=12, fontweight='bold')
        axes[1].set_xlabel('Lag')
        axes[1].set_ylabel('Correlation')
        
        plt.tight_layout()
        plt.show()
        
        # Interpretation guide
        print("\n" + "=" * 70)
        print("HOW TO INTERPRET ACF/PACF PLOTS")
        print("=" * 70)
        print("""
        📈 Reading the Plots:
        
        1. BLUE SHADED AREA = 95% Confidence Interval
           • Spikes outside this area are statistically significant
           • Spikes inside may just be random noise
        
        2. LAG 0 ALWAYS = 1.0
           • A series is perfectly correlated with itself
           • This is not informative, ignore it
        
        3. PATTERNS TO LOOK FOR:
        
           Slow decay in ACF → Non-stationary (trending)
           Quick cutoff in ACF → Suggests MA process
           Quick cutoff in PACF → Suggests AR process
           Spikes at regular intervals → Seasonality
        
        📊 Example Interpretations for NEPSE:
        
        Case 1: ACF decays slowly, PACF has spike at lag 1
        → Stock prices are non-stationary
        → Need differencing before modeling
        
        Case 2: After differencing, ACF cuts off at lag 2
        → Consider MA(2) model
        
        Case 3: After differencing, PACF cuts off at lag 1
        → Consider AR(1) model
        
        Case 4: Weekly pattern visible (spikes at lag 5)
        → Include weekly seasonality in model
        """)
    
    def lagrange_multiplier_test(self, maxlag=10):
        """
        Perform Ljung-Box test for autocorrelation.
        
        This test checks if autocorrelations are significantly
        different from zero as a group.
        
        Parameters:
        -----------
        maxlag : int
            Maximum lag to test
        """
        from statsmodels.stats.diagnostic import acorr_ljungbox
        
        print("\n" + "=" * 70)
        print("LJUNG-BOX TEST FOR AUTOCORRELATION")
        print("=" * 70)
        
        series = self.data[self.price_column].dropna()
        
        print("""
        📊 Test Explanation:
        
        Null Hypothesis (H0): No autocorrelation exists
        Alternative Hypothesis (H1): Autocorrelation exists
        
        Decision Rule:
        • If p-value < 0.05: Reject H0 → Autocorrelation present
        • If p-value >= 0.05: Cannot reject H0 → No autocorrelation
        """)
        
        # Perform test
        result = acorr_ljungbox(series, lags=range(1, maxlag + 1), return_df=True)
        
        print("\n📊 Test Results:")
        print("-" * 50)
        print(f"{'Lag':<8} {'Test Stat':<15} {'p-value':<15} {'Result':<20}")
        print("-" * 50)
        
        for lag in range(1, maxlag + 1):
            stat = result.loc[lag, 'lb_stat']
            pval = result.loc[lag, 'lb_pvalue']
            
            if pval < 0.05:
                result_text = "⚠ Autocorrelation"
            else:
                result_text = "✓ No autocorrelation"
            
            print(f"{lag:<8} {stat:<15.4f} {pval:<15.6f} {result_text}")
        
        # Interpretation
        print("\n📈 Implications for NEPSE Analysis:")
        print("-" * 50)
        
        # Check if autocorrelation is present
        has_autocorr = any(result['lb_pvalue'] < 0.05)
        
        if has_autocorr:
            print("""
        ✓ Autocorrelation is present in the data
        
        This means:
        • Past values can help predict future values
        • Time-series models will be effective
        • The data has exploitable patterns
        
        For modeling:
        • Consider AR/MA components
        • Include lagged features in ML models
        • Use appropriate time-series models
            """)
        else:
            print("""
        The data appears to be random (white noise)
        
        This means:
        • Past values don't help predict future values
        • Simple models may be as good as complex ones
        • Focus on other factors (external variables)
            """)
        
        return result
    
    def suggest_model_order(self):
        """
        Suggest ARIMA model orders based on ACF/PACF patterns.
        """
        print("\n" + "=" * 70)
        print("MODEL ORDER SUGGESTIONS")
        print("=" * 70)
        
        if self.acf_values is None:
            self.calculate_acf()
        if self.pacf_values is None:
            self.calculate_pacf()
        
        series = self.data[self.price_column].dropna().values
        n = len(series)
        conf_interval = 1.96 / np.sqrt(n)
        
        # Find significant lags in ACF
        sig_acf_lags = []
        for i in range(1, len(self.acf_values)):
            if abs(self.acf_values[i]) > conf_interval:
                sig_acf_lags.append(i)
        
        # Find significant lags in PACF
        sig_pacf_lags = []
        for i in range(1, len(self.pacf_values)):
            if abs(self.pacf_values[i]) > conf_interval:
                sig_pacf_lags.append(i)
        
        print(f"\n📊 Significant Lags:")
        print(f"   ACF:  {sig_acf_lags[:10]}{'...' if len(sig_acf_lags) > 10 else ''}")
        print(f"   PACF: {sig_pacf_lags[:10]}{'...' if len(sig_pacf_lags) > 10 else ''}")
        
        # Suggest models based on patterns
        print("\n📈 Model Suggestions:")
        print("-" * 50)
        
        # Check for slow decay (non-stationarity)
        if len(sig_acf_lags) > 10:
            print("""
        ⚠ Pattern detected: Slow ACF decay
        
        Suggestion:
        • Data is likely non-stationary
        • Apply differencing first
        • Then re-examine ACF/PACF
        
        Recommended: ARIMA(p, 1, q) where:
        • p = significant PACF lags after differencing
        • q = significant ACF lags after differencing
            """)
        else:
            # Check for AR pattern (PACF cutoff)
            if len(sig_pacf_lags) <= 3 and len(sig_pacf_lags) > 0:
                p_suggestion = max(sig_pacf_lags)
                print(f"\n   ✓ AR pattern detected")
                print(f"     Suggested AR order (p): {p_suggestion}")
            
            # Check for MA pattern (ACF cutoff)
            if len(sig_acf_lags) <= 3 and len(sig_acf_lags) > 0:
                q_suggestion = max(sig_acf_lags)
                print(f"\n   ✓ MA pattern detected")
                print(f"     Suggested MA order (q): {q_suggestion}")
        
        print("\n💡 Note: These are starting suggestions.")
        print("   Always validate models with proper testing.")


# ============================================================
# EXAMPLE USAGE
# ============================================================

# Use the trending NEPSE data
print("\n" + "=" * 70)
print("AUTOCORRELATION ANALYSIS FOR NEPSE DATA")
print("=" * 70)

# Create sample data
nepse_sample = create_trending_nepse_data(n_days=300, trend_type='upward')

# Initialize analyzer
autocorr_analyzer = AutocorrelationAnalyzer(nepse_sample, price_column='Close')

# Explain concepts
autocorr_analyzer.explain_autocorrelation()

# Calculate and display ACF
acf_values = autocorr_analyzer.calculate_acf(nlags=20)

# Calculate and display PACF
pacf_values = autocorr_analyzer.calculate_pacf(nlags=20)

# Plot ACF and PACF
autocorr_analyzer.plot_acf_pacf(nlags=30)

# Perform Ljung-Box test
lb_result = autocorr_analyzer.lagrange_multiplier_test(maxlag=10)

# Suggest model order
autocorr_analyzer.suggest_model_order()
```

**Detailed Explanation of Autocorrelation Analysis**:

The code above provides a comprehensive toolkit for understanding autocorrelation:

**1. ACF (Autocorrelation Function)**:
- Measures correlation at each lag
- **Slow decay** indicates non-stationarity
- **Sharp cutoff** suggests MA process
- **Seasonal spikes** indicate seasonality

**2. PACF (Partial Autocorrelation Function)**:
- Measures direct effect of each lag
- Removes effects of intermediate lags
- **Sharp cutoff** suggests AR process

**3. Model Selection Using ACF/PACF**:
| Pattern | ACF | PACF | Model Suggestion |
|---------|-----|------|------------------|
| AR(p) | Decays gradually | Cuts off after lag p | ARIMA(p, 0, 0) |
| MA(q) | Cuts off after lag q | Decays gradually | ARIMA(0, 0, q) |
| ARMA(p,q) | Decays gradually | Decays gradually | ARIMA(p, 0, q) |
| ARIMA(p,d,q) | Slow decay | Slow decay | Difference first |

#### **2.3.3 Heteroscedasticity**

**Definition**: When the variance of a time series changes over time (non-constant variance).

```python
class HeteroscedasticityAnalyzer:
    """
    Analyze heteroscedasticity in time-series data.
    
    Heteroscedasticity is common in financial data where volatility
    clusters - periods of high volatility followed by low volatility.
    """
    
    def __init__(self, data, price_column='Close'):
        self.data = data.copy()
        self.price_column = price_column
    
    def explain_heteroscedasticity(self):
        """
        Explain heteroscedasticity concepts.
        """
        print("=" * 70)
        print("UNDERSTANDING HETEROSCEDASTICITY")
        print("=" * 70)
        
        print("""
        📊 DEFINITION
        
        Heteroscedasticity occurs when the variance of errors
        (or the spread of data) changes over time.
        
        Opposite: Homoscedasticity (constant variance)
        
        ┌─────────────────────────────────────────────────────────────────┐
        │              HOMOSCEDASTIC (Constant Variance)                  │
        ├─────────────────────────────────────────────────────────────────┤
        │                                                                  │
        │     │  · · · · · · · · · · · · · · ·                            │
        │     │  · · · · · · · · · · · · · · ·                            │
        │     │  · · · · · · · · · · · · · · ·                            │
        │     │  · · · · · · · · · · · · · · ·                            │
        │     └──────────────────────────────────                        │
        │                                                                  │
        │     Spread is constant over time                                │
        │     Standard models work well                                   │
        │                                                                  │
        └─────────────────────────────────────────────────────────────────┘
        
        ┌─────────────────────────────────────────────────────────────────┐
        │            HETEROSCEDASTIC (Changing Variance)                  │
        ├─────────────────────────────────────────────────────────────────┤
        │                                                                  │
        │     │                    ·  ·  ·  ·                             │
        │     │              ·  ·  ·  ·  ·  ·  ·  ·                       │
        │     │        ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·                   │
        │     │  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·               │
        │     └────────────────────────────────────────────────────────── │
        │                                                                  │
        │     Spread increases over time (or varies)                      │
        │     Need special handling (GARCH models)                        │
        │                                                                  │
        └─────────────────────────────────────────────────────────────────┘
        
        📊 WHY IT MATTERS FOR NEPSE
        
        Stock prices often exhibit:
        1. Volatility clustering - high volatility periods cluster together
        2. Leverage effect - negative returns increase volatility more
        3. Mean reversion - volatility tends to return to average
        
        These patterns violate the constant variance assumption of
        many models, leading to:
        • Inefficient parameter estimates
        • Invalid standard errors
        • Poor prediction intervals
        """)
    
    def visualize_volatility_clustering(self):
        """
        Visualize volatility clustering in returns.
        
        Volatility clustering is a form of heteroscedasticity where
        large changes tend to be followed by large changes, and
        small changes by small changes.
        """
        series = self.data[self.price_column].dropna().values
        
        # Calculate returns
        returns = np.diff(np.log(series)) * 100  # Log returns in %
        
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        # Price series
        axes[0, 0].plot(series, linewidth=1)
        axes[0, 0].set_title('Price Series', fontsize=11, fontweight='bold')
        axes[0, 0].set_xlabel('Time')
        axes[0, 0].set_ylabel('Price (NPR)')
        axes[0, 0].grid(True, alpha=0.3)
        
        # Returns series
        axes[0, 1].plot(returns, linewidth=1, color='green')
        axes[0, 1].axhline(y=0, color='black', linestyle='-', alpha=0.3)
        axes[0, 1].set_title('Log Returns (%)', fontsize=11, fontweight='bold')
        axes[0, 1].set_xlabel('Time')
        axes[0, 1].set_ylabel('Return (%)')
        axes[0, 1].grid(True, alpha=0.3)
        
        # Highlight volatility clustering
        # Rolling standard deviation (volatility)
        volatility = pd.Series(returns).rolling(window=20).std().values
        
        axes[1, 0].plot(volatility, linewidth=1, color='red')
        axes[1, 0].set_title('Rolling Volatility (20-day Std Dev)', fontsize=11, fontweight='bold')
        axes[1, 0].set_xlabel('Time')
        axes[1, 0].set_ylabel('Volatility (%)')
        axes[1, 0].grid(True, alpha=0.3)
        
        # Squared returns (absolute value shows volatility)
        squared_returns = returns ** 2
        
        axes[1, 1].plot(squared_returns, linewidth=1, color='purple', alpha=0.7)
        axes[1, 1].set_title('Squared Returns (Volatility Indicator)', fontsize=11, fontweight='bold')
        axes[1, 1].set_xlabel('Time')
        axes[1, 1].set_ylabel('Squared Return')
        axes[1, 1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # Interpretation
        print("\n" + "=" * 70)
        print("VOLATILITY CLUSTERING ANALYSIS")
        print("=" * 70)
        
        print("""
        📈 What to Look For:
        
        1. RETURNS PLOT (Top Right):
           • Clusters of large swings indicate volatility clustering
           • Look for periods where returns swing widely vs. quietly
        
        2. ROLLING VOLATILITY (Bottom Left):
           • Peaks indicate high-volatility periods
           • Troughs indicate calm periods
           • Persistence suggests GARCH effects
        
        3. SQUARED RETURNS (Bottom Right):
           • Used in ARCH/GARCH models
           • Autocorrelation here indicates ARCH effects
        """)
        
        # Calculate statistics
        print(f"\n📊 Volatility Statistics:")
        print("-" * 50)
        print(f"   Mean return:          {returns.mean():.4f}%")
        print(f"   Std dev of returns:   {returns.std():.4f}%")
        print(f"   Min return:           {returns.min():.4f}%")
        print(f"   Max return:           {returns.max():.4f}%")
        print(f"   Mean volatility:      {np.nanmean(volatility):.4f}%")
        
        return returns, volatility
    
    def arch_test(self, returns=None, lags=10):
        """
        Perform Engle's ARCH test for heteroscedasticity.
        
        ARCH (Autoregressive Conditional Heteroscedasticity) test
        checks if variance of errors depends on past squared errors.
        
        Parameters:
        -----------
        returns : np.array, optional
            Return series (calculated if not provided)
        lags : int
            Number of lags to test
        
        Returns:
        --------
        dict : Test results
        """
        from statsmodels.stats.diagnostic import het_arch
        
        print("\n" + "=" * 70)
        print("ENGLE'S ARCH TEST FOR HETEROSCEDASTICITY")
        print("=" * 70)
        
        if returns is None:
            series = self.data[self.price_column].dropna().values
            returns = np.diff(np.log(series)) * 100
        
        print("""
        📊 Test Explanation:
        
        Null Hypothesis (H0): No ARCH effects (homoscedastic)
        Alternative Hypothesis (H1): ARCH effects present (heteroscedastic)
        
        ARCH effects mean that past squared residuals help predict
        current variance - indicating volatility clustering.
        
        Decision Rule:
        • If p-value < 0.05: Reject H0 → ARCH effects present
        • If p-value >= 0.05: Cannot reject H0 → No ARCH effects
        """)
        
        # Perform test
        lm_stat, lm_pvalue, f_stat, f_pvalue = het_arch(returns, nlags=lags)
        
        print("\n📊 Test Results:")
        print("-" * 50)
        print(f"   LM Statistic:     {lm_stat:.4f}")
        print(f"   LM p-value:       {lm_pvalue:.6f}")
        print(f"   F Statistic:      {f_stat:.4f}")
        print(f"   F p-value:        {f_pvalue:.6f}")
        
        print("\n📈 Interpretation:")
        print("-" * 50)
        
        if lm_pvalue < 0.05:
            print(f"   ⚠ p-value ({lm_pvalue:.6f}) < 0.05")
            print("   → Reject null hypothesis")
            print("   → ARCH EFFECTS ARE PRESENT")
            print("\n   Implications:")
            print("   • Volatility clustering exists")
            print("   • Consider GARCH models for volatility")
            print("   • Standard models may underestimate risk")
        else:
            print(f"   ✓ p-value ({lm_pvalue:.6f}) >= 0.05")
            print("   → Cannot reject null hypothesis")
            print("   → No significant ARCH effects")
            print("\n   Standard models should be adequate.")
        
        return {
            'lm_statistic': lm_stat,
            'lm_pvalue': lm_pvalue,
            'f_statistic': f_stat,
            'f_pvalue': f_pvalue,
            'has_arch_effects': lm_pvalue < 0.05
        }
    
    def rolling_volatility_analysis(self, window=20):
        """
        Analyze rolling volatility characteristics.
        
        Parameters:
        -----------
        window : int
            Rolling window for volatility calculation
        """
        print("\n" + "=" * 70)
        print("ROLLING VOLATILITY ANALYSIS")
        print("=" * 70)
        
        series = self.data[self.price_column].dropna().values
        returns = pd.Series(np.diff(np.log(series)) * 100)
        
        # Calculate rolling statistics
        rolling_mean = returns.rolling(window=window).mean()
        rolling_std = returns.rolling(window=window).std()
        rolling_var = returns.rolling(window=window).var()
        
        # Find high and low volatility periods
        vol_threshold_high = rolling_std.mean() + rolling_std.std()
        vol_threshold_low = rolling_std.mean() - rolling_std.std()
        
        high_vol_periods = (rolling_std > vol_threshold_high).sum()
        low_vol_periods = (rolling_std < vol_threshold_low).sum()
        
        print(f"\n📊 Rolling Volatility Statistics (Window = {window} days):")
        print("-" * 50)
        print(f"   Average volatility:    {rolling_std.mean():.4f}%")
        print(f"   Volatility std dev:    {rolling_std.std():.4f}%")
        print(f"   Min volatility:        {rolling_std.min():.4f}%")
        print(f"   Max volatility:        {rolling_std.max():.4f}%")
        print(f"   High volatility days:  {high_vol_periods} ({high_vol_periods/len(returns)*100:.1f}%)")
        print(f"   Low volatility days:   {low_vol_periods} ({low_vol_periods/len(returns)*100:.1f}%)")
        
        # Volatility of volatility
        vol_of_vol = rolling_std.std() / rolling_std.mean()
        print(f"\n   Volatility of Volatility: {vol_of_vol:.4f}")
        
        if vol_of_vol > 0.5:
            print("   → High variation in volatility over time")
            print("   → Strong heteroscedasticity present")
        else:
            print("   → Relatively stable volatility")
        
        return {
            'rolling_std': rolling_std,
            'avg_volatility': rolling_std.mean(),
            'high_vol_periods': high_vol_periods,
            'low_vol_periods': low_vol_periods
        }
    
    def suggest_volatility_model(self):
        """
        Suggest appropriate volatility model based on analysis.
        """
        print("\n" + "=" * 70)
        print("VOLATILITY MODEL RECOMMENDATIONS")
        print("=" * 70)
        
        series = self.data[self.price_column].dropna().values
        returns = np.diff(np.log(series)) * 100
        
        # Run ARCH test
        arch_result = self.arch_test(returns)
        
        print("""
        📊 Volatility Modeling Options:
        
        ┌─────────────────────────────────────────────────────────────────┐
        │  MODEL          │ USE CASE                         │ COMPLEXITY │
        ├─────────────────────────────────────────────────────────────────┤
        │  Constant Vol   │ No ARCH effects                  │ Low        │
        │  ARCH(q)        │ ARCH effects, simple             │ Medium     │
        │  GARCH(1,1)     │ ARCH effects, persistent vol     │ Medium     │
        │  EGARCH         │ Asymmetric effects               │ High       │
        │  GJR-GARCH      │ Leverage effect                  │ High       │
        │  TGARCH         │ Threshold effects                │ High       │
        └─────────────────────────────────────────────────────────────────┘
        """)
        
        if arch_result['has_arch_effects']:
            print("""
        ✓ RECOMMENDATION: Use GARCH-type models
        
        For NEPSE data with ARCH effects:
        
        1. START WITH: GARCH(1,1)
           - Captures volatility clustering
           - Most commonly used
           - Formula: σ²_t = ω + α·ε²_{t-1} + β·σ²_{t-1}
        
        2. IF ASYMMETRY: Use EGARCH or GJR-GARCH
           - Negative returns may increase volatility more
           - Common in stock markets
        
        3. IMPLEMENTATION:
        ```python
        from arch import arch_model
        
        # Fit GARCH(1,1)
        model = arch_model(returns, vol='Garch', p=1, q=1)
        results = model.fit()
        
        # Get conditional volatility
        conditional_vol = results.conditional_volatility
        ```
            """)
        else:
            print("""
        ✓ RECOMMENDATION: Standard models adequate
        
        No significant ARCH effects detected:
        • Use constant volatility assumption
        • Standard prediction intervals should be valid
        • No need for GARCH modeling
            """)


# ============================================================
# EXAMPLE USAGE
# ============================================================

print("\n" + "=" * 70)
print("HETEROSCEDASTICITY ANALYSIS FOR NEPSE DATA")
print("=" * 70)

# Create sample data with varying volatility
def create_heteroscedastic_data(n_days=500):
    """Create NEPSE-like data with volatility clustering."""
    np.random.seed(42)
    
    # Create base returns
    returns = np.zeros(n_days)
    
    # Simulate volatility clustering
    # High volatility periods alternate with low volatility
    volatility = np.ones(n_days) * 1.0  # Base volatility
    
    # Create volatility regimes
    regime = 0
    for i in range(n_days):
        if np.random.random() < 0.05:  # 5% chance to change regime
            regime = 1 - regime
        
        if regime == 1:
            volatility[i] = 3.0  # High volatility
        else:
            volatility[i] = 1.0  # Low volatility
    
    # Generate returns with varying volatility
    returns = np.random.normal(0.05, volatility)  # Mean 0.05%, varying std
    
    # Convert returns to prices
    prices = 500 * np.cumprod(1 + returns / 100)
    
    data = pd.DataFrame({
        'S.No': range(1, n_days + 1),
        'Symbol': 'ABL',
        'Close': prices
    })
    
    return data


hetero_data = create_heteroscedastic_data(n_days=500)

# Initialize analyzer
hetero_analyzer = HeteroscedasticityAnalyzer(hetero_data, price_column='Close')

# Explain concepts
hetero_analyzer.explain_heteroscedasticity()

# Visualize volatility clustering
returns, volatility = hetero_analyzer.visualize_volatility_clustering()

# Perform ARCH test
arch_result = hetero_analyzer.arch_test(returns)

# Rolling volatility analysis
rolling_results = hetero_analyzer.rolling_volatility_analysis(window=20)

# Suggest model
hetero_analyzer.suggest_volatility_model()
```

---

### **2.4 Common Data Challenges**

Time-series data comes with various challenges that must be addressed for successful prediction.

```python
class DataChallengeAnalyzer:
    """
    Analyze and address common data challenges in time-series.
    
    Understanding these challenges is crucial for building
    robust prediction systems.
    """
    
    def __init__(self, data, price_column='Close'):
        self.data = data.copy()
        self.price_column = price_column
    
    def explain_challenges(self):
        """
        Explain common time-series data challenges.
        """
        print("=" * 70)
        print("COMMON DATA CHALLENGES IN TIME-SERIES")
        print("=" * 70)
        
        challenges = {
            'Missing Values': {
                'description': 'Gaps in the time series where data is absent',
                'causes': ['System downtime', 'Trading holidays', 'Data entry errors', 'API failures'],
                'impact': 'Breaks temporal continuity, affects lag calculations',
                'solutions': ['Forward fill', 'Interpolation', 'Model-based imputation']
            },
            'Outliers': {
                'description': 'Extreme values that deviate significantly from pattern',
                'causes': ['Market crashes', 'Data errors', 'Corporate actions', 'News events'],
                'impact': 'Distorts statistics, affects model training',
                'solutions': ['Winsorization', 'Removal', 'Robust methods', 'Domain rules']
            },
            'Non-Stationarity': {
                'description': 'Statistical properties change over time',
                'causes': ['Trends', 'Seasonality', 'Structural breaks', 'Market regime changes'],
                'impact': 'Many models assume stationarity',
                'solutions': ['Differencing', 'Detrending', 'Transformation']
            },
            'Irregular Sampling': {
                'description': 'Non-uniform time intervals between observations',
                'causes': ['Missing data', 'Different data sources', 'Schedule changes'],
                'impact': 'Complicates modeling, affects lag calculations',
                'solutions': ['Resampling', 'Interpolation', 'Time-aware models']
            },
            'Noise': {
                'description': 'Random fluctuations that obscure signal',
                'causes': ['Measurement errors', 'Market microstructure', 'Random trading'],
                'impact': 'Reduces predictability, affects accuracy',
                'solutions': ['Smoothing', 'Filtering', 'Aggregation']
            },
            'Limited History': {
                'description': 'Insufficient historical data for training',
                'causes': ['New stocks', 'Recent IPOs', 'Data loss'],
                'impact': 'Cannot train complex models, poor generalization',
                'solutions': ['Transfer learning', 'Simple models', 'Similar stock data']
            }
        }
        
        for challenge, info in challenges.items():
            print(f"\n📊 {challenge.upper()}")
            print(f"   Description: {info['description']}")
            print(f"   Causes: {', '.join(info['causes'])}")
            print(f"   Impact: {info['impact']}")
            print(f"   Solutions: {', '.join(info['solutions'])}")
    
    def analyze_missing_patterns(self):
        """
        Analyze patterns of missing data.
        
        Understanding the type of missing data helps choose
        the right imputation strategy.
        """
        print("\n" + "=" * 70)
        print("MISSING DATA PATTERN ANALYSIS")
        print("=" * 70)
        
        print("""
        📊 TYPES OF MISSING DATA:
        
        1. MCAR (Missing Completely At Random)
           • Missing values are random, no pattern
           • Safe to delete or impute
           • Example: Random system glitches
        
        2. MAR (Missing At Random)
           • Missing depends on observed variables
           • Can be modeled using other features
           • Example: Missing volume on low-activity days
        
        3. MNAR (Missing Not At Random)
           • Missing depends on unobserved values
           • Most problematic, requires domain knowledge
           • Example: Missing prices during market crashes
        
        📈 NEPSE-SPECIFIC MISSING PATTERNS:
        
        1. Weekend/Holiday Gaps
           - Expected and systematic
           - Use forward fill or ignore
        
        2. Trading Halts
           - During extreme volatility
           - May need special handling
        
        3. Data Feed Issues
           - Random, typically MCAR
           - Can interpolate
        """)
        
        # Simulate missing data analysis
        series = self.data[self.price_column].copy()
        
        # Check for missing values
        missing_count = series.isnull().sum()
        missing_pct = missing_count / len(series) * 100
        
        print(f"\n📊 Missing Data Summary:")
        print("-" * 50)
        print(f"   Total observations: {len(series)}")
        print(f"   Missing values:     {missing_count}")
        print(f"   Missing percentage: {missing_pct:.2f}%")
        
        if missing_count > 0:
            # Check for patterns
            # Are missing values clustered?
            missing_indices = series[series.isnull()].index
            
            if len(missing_indices) > 1:
                gaps = np.diff(missing_indices)
                avg_gap = gaps.mean()
                print(f"\n   Missing value gaps:")
                print(f"   Average gap: {avg_gap:.1f} observations")
                print(f"   Max gap:     {gaps.max()} observations")
                
                if avg_gap < 2:
                    print("   → Missing values are clustered together")
                else:
                    print("   → Missing values are scattered")
        
        return {
            'missing_count': missing_count,
            'missing_pct': missing_pct
        }
    
    def demonstrate_imputation(self):
        """
        Demonstrate different imputation methods.
        """
        print("\n" + "=" * 70)
        print("TIME-SERIES IMPUTATION METHODS")
        print("=" * 70)
        
        print("""
        📊 COMMON IMPUTATION METHODS:
        
        1. FORWARD FILL (ffill)
           • Use last known value
           • Best for prices (maintains level)
           • Code: data.fillna(method='ffill')
        
        2. BACKWARD FILL (bfill)
           • Use next known value
           • Good for post-processing
           • Code: data.fillna(method='bfill')
        
        3. LINEAR INTERPOLATION
           • Connect known values with line
           • Good for smooth data
           • Code: data.interpolate(method='linear')
        
        4. SPLINE INTERPOLATION
           • Smooth curve fitting
           • Better for non-linear patterns
           • Code: data.interpolate(method='spline', order=3)
        
        5. MEAN/MEDIAN IMPUTATION
           • Use historical average
           • Simple but can distort trends
           • Code: data.fillna(data.mean())
        
        📈 NEPSE EXAMPLE:
        """)
        
        # Create example with missing values
        np.random.seed(42)
        n = 20
        prices = 500 + np.cumsum(np.random.randn(n) * 5)
        prices_with_missing = prices.copy()
        missing_indices = [5, 6, 7, 12, 13]
        prices_with_missing[missing_indices] = np.nan
        
        # Create DataFrame
        example_df = pd.DataFrame({
            'Original': prices,
            'With_Missing': prices_with_missing,
            'Forward_Fill': pd.Series(prices_with_missing).fillna(method='ffill'),
            'Linear_Interp': pd.Series(prices_with_missing).interpolate(method='linear'),
            'Mean_Fill': pd.Series(prices_with_missing).fillna(np.nanmean(prices_with_missing))
        })
        
        print("\n" + "-" * 70)
        print(example_df.round(2))
        print("-" * 70)
        
        print("""
        💡 RECOMMENDATIONS FOR NEPSE:
        
        For Price Data:
        • Use forward fill (most recent price is best estimate)
        • Or ignore missing rows (for model training)
        
        For Volume Data:
        • Use mean of recent days (volume patterns repeat)
        • Or set to 0 (no trading occurred)
        
        For Derived Features:
        • Recalculate after filling raw data
        • Don't fill derived features directly
        """)
        
        return example_df
    
    def analyze_outliers(self, threshold=3):
        """
        Detect and analyze outliers in the time series.
        
        Parameters:
        -----------
        threshold : float
            Z-score threshold for outlier detection
        """
        print("\n" + "=" * 70)
        print("OUTLIER DETECTION AND ANALYSIS")
        print("=" * 70)
        
        series = self.data[self.price_column].dropna()
        returns = series.pct_change().dropna()
        
        # Method 1: Z-score based
        z_scores = (returns - returns.mean()) / returns.std()
        z_outliers = returns[np.abs(z_scores) > threshold]
        
        # Method 2: IQR based
        Q1 = returns.quantile(0.25)
        Q3 = returns.quantile(0.75)
        IQR = Q3 - Q1
        iqr_outliers = returns[(returns < Q1 - 1.5*IQR) | (returns > Q3 + 1.5*IQR)]
        
        # Method 3: Rolling Z-score
        rolling_mean = returns.rolling(window=20).mean()
        rolling_std = returns.rolling(window=20).std()
        rolling_z = (returns - rolling_mean) / rolling_std
        rolling_outliers = returns[np.abs(rolling_z) > threshold]
        
        print(f"\n📊 Outlier Detection Results:")
        print("-" * 50)
        print(f"   Z-score method (threshold={threshold}):  {len(z_outliers)} outliers")
        print(f"   IQR method:                              {len(iqr_outliers)} outliers")
        print(f"   Rolling Z-score method:                  {len(rolling_outliers)} outliers")
        
        print(f"\n📊 Outlier Statistics:")
        print("-" * 50)
        print(f"   Return mean:     {returns.mean()*100:.4f}%")
        print(f"   Return std:      {returns.std()*100:.4f}%")
        print(f"   Return min:      {returns.min()*100:.4f}%")
        print(f"   Return max:      {returns.max()*100:.4f}%")
        
        if len(z_outliers) > 0:
            print(f"\n📊 Extreme Outliers (Z > {threshold}):")
            print("-" * 50)
            for idx in z_outliers.index[:5]:
                ret = returns.loc[idx] * 100
                print(f"   Index {idx}: {ret:+.2f}%")
        
        print("""
        💡 OUTLIER HANDLING STRATEGIES:
        
        1. INVESTIGATE FIRST
           • Check news/events on outlier days
           • Verify data accuracy
           • Understand the cause
        
        2. FOR DATA ERRORS
           • Correct if possible
           • Remove if uncorrectable
        
        3. FOR GENUINE OUTLIERS
           • Keep for risk modeling
           • Winsorize (cap at threshold)
           • Use robust statistics
        
        4. FOR NEPSE
           • Check for stock splits, dividends
           • Verify against official sources
           • Consider regulatory announcements
        """)
        
        return {
            'z_outliers': z_outliers,
            'iqr_outliers': iqr_outliers,
            'rolling_outliers': rolling_outliers
        }


# ============================================================
# EXAMPLE USAGE
# ============================================================

print("\n" + "=" * 70)
print("DATA CHALLENGE ANALYSIS FOR NEPSE")
print("=" * 70)

# Create sample data
sample_data = create_trending_nepse_data(n_days=300, trend_type='upward')

# Initialize analyzer
challenge_analyzer = DataChallengeAnalyzer(sample_data, price_column='Close')

# Explain challenges
challenge_analyzer.explain_challenges()

# Analyze missing patterns
missing_results = challenge_analyzer.analyze_missing_patterns()

# Demonstrate imputation
imputation_example = challenge_analyzer.demonstrate_imputation()

# Analyze outliers
outlier_results = challenge_analyzer.analyze_outliers(threshold=2.5)
```

---

### **2.5 Exploring Your First Time-Series Dataset**

Now let's create a comprehensive exploration function for NEPSE data:

```python
class NEPSEDataExplorer:
    """
    Comprehensive data exploration for NEPSE time-series data.
    
    This class provides a systematic approach to understanding
    time-series data before modeling.
    """
    
    def __init__(self, data_path=None, data=None):
        """
        Initialize the explorer.
        
        Parameters:
        -----------
        data_path : str, optional
            Path to NEPSE CSV file
        data : pd.DataFrame, optional
            Pre-loaded data
        """
        if data is not None:
            self.data = data.copy()
        elif data_path is not None:
            self.data = pd.read_csv(data_path)
        else:
            raise ValueError("Provide either data_path or data")
        
        self.numeric_columns = None
        self.exploration_results = {}
    
    def initial_overview(self):
        """
        Generate initial overview of the dataset.
        """
        print("=" * 70)
        print("NEPSE TIME-SERIES DATA EXPLORATION")
        print("=" * 70)
        
        print("\n📊 DATASET OVERVIEW")
        print("-" * 70)
        
        # Basic info
        print(f"   Shape:           {self.data.shape}")
        print(f"   Rows:            {self.data.shape[0]:,}")
        print(f"   Columns:         {self.data.shape[1]}")
        print(f"   Memory usage:    {self.data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
        
        # Column types
        print(f"\n   Column Types:")
        print(f"   - Numeric:       {self.data.select_dtypes(include=[np.number]).shape[1]}")
        print(f"   - Categorical:   {self.data.select_dtypes(include=['object']).shape[1]}")
        print(f"   - DateTime:      {self.data.select_dtypes(include=['datetime']).shape[1]}")
        
        # Unique stocks
        if 'Symbol' in self.data.columns:
            print(f"\n   Stock Information:")
            print(f"   - Unique stocks: {self.data['Symbol'].nunique()}")
            print(f"   - Records/stock: {self.data.groupby('Symbol').size().describe()['mean']:.1f} avg")
        
        # Sample data
        print("\n📋 SAMPLE DATA (First 5 Rows):")
        print("-" * 70)
        print(self.data.head())
        
        # Data types
        print("\n📋 DATA TYPES:")
        print("-" * 70)
        print(self.data.dtypes.to_string())
        
        return self.data.info()
    
    def identify_numeric_columns(self):
        """
        Identify numeric columns for analysis.
        """
        self.numeric_columns = self.data.select_dtypes(
            include=[np.number]
        ).columns.tolist()
        
        # Exclude index-like columns
        exclude_cols = ['S.No']
        self.numeric_columns = [c for c in self.numeric_columns 
                                if c not in exclude_cols]
        
        print(f"\n📊 Numeric Columns for Analysis:")
        print(f"   {self.numeric_columns}")
        
        return self.numeric_columns
    
    def statistical_summary(self):
        """
        Generate comprehensive statistical summary.
        """
        print("\n" + "=" * 70)
        print("STATISTICAL SUMMARY")
        print("=" * 70)
        
        if self.numeric_columns is None:
            self.identify_numeric_columns()
        
        summary = self.data[self.numeric_columns].describe()
        
        print("\n" + "-" * 70)
        print(summary.round(2))
        print("-" * 70)
        
        # Additional statistics
        print("\n📊 Additional Statistics:")
        print("-" * 70)
        
        for col in self.numeric_columns[:5]:  # Top 5 columns
            data = self.data[col].dropna()
            print(f"\n   {col}:")
            print(f"      Skewness: {data.skew():.4f}")
            print(f"      Kurtosis: {data.kurtosis():.4f}")
            print(f"      CV:       {data.std()/data.mean()*100:.2f}%")
        
        self.exploration_results['summary'] = summary
        return summary
    
    def time_range_analysis(self):
        """
        Analyze the time range and data completeness.
        """
        print("\n" + "=" * 70)
        print("TIME RANGE ANALYSIS")
        print("=" * 70)
        
        if 'S.No' in self.data.columns:
            print(f"\n   Serial Number Range:")
            print(f"   - Min:     {self.data['S.No'].min()}")
            print(f"   - Max:     {self.data['S.No'].max()}")
            print(f"   - Records: {self.data['S.No'].nunique()}")
        
        if 'Date' in self.data.columns:
            self.data['Date'] = pd.to_datetime(self.data['Date'])
            print(f"\n   Date Range:")
            print(f"   - Start:   {self.data['Date'].min()}")
            print(f"   - End:     {self.data['Date'].max()}")
            print(f"   - Days:    {(self.data['Date'].max() - self.data['Date'].min()).days}")
        
        # Records per stock
        if 'Symbol' in self.data.columns:
            print(f"\n   Records per Stock:")
            records_per_stock = self.data.groupby('Symbol').size()
            print(f"   - Min:     {records_per_stock.min()}")
            print(f"   - Max:     {records_per_stock.max()}")
            print(f"   - Mean:    {records_per_stock.mean():.1f}")
    
    def correlation_analysis(self):
        """
        Analyze correlations between numeric variables.
        """
        print("\n" + "=" * 70)
        print("CORRELATION ANALYSIS")
        print("=" * 70)
        
        if self.numeric_columns is None:
            self.identify_numeric_columns()
        
        # Calculate correlation matrix
        corr_matrix = self.data[self.numeric_columns].corr()
        
        print("\n📊 Correlation Matrix (Top 5 columns):")
        print("-" * 70)
        print(corr_matrix.iloc[:5, :5].round(3))
        
        # Find highly correlated pairs
        print("\n📊 Highly Correlated Pairs (|r| > 0.8):")
        print("-" * 70)
        
        for i in range(len(corr_matrix.columns)):
            for j in range(i+1, len(corr_matrix.columns)):
                col1 = corr_matrix.columns[i]
                col2 = corr_matrix.columns[j]
                corr_val = corr_matrix.iloc[i, j]
                
                if abs(corr_val) > 0.8:
                    print(f"   {col1} <-> {col2}: {corr_val:.3f}")
        
        return corr_matrix
    
    def run_full_exploration(self):
        """
        Run complete data exploration pipeline.
        """
        self.initial_overview()
        self.statistical_summary()
        self.time_range_analysis()
        self.correlation_analysis()
        
        print("\n" + "=" * 70)
        print("EXPLORATION COMPLETE")
        print("=" * 70)
        
        return self.exploration_results


# ============================================================
# COMPLETE EXAMPLE
# ============================================================

# Create comprehensive NEPSE sample data
def create_comprehensive_nepse_data():
    """
    Create comprehensive NEPSE-like dataset for exploration.
    """
    np.random.seed(42)
    
    symbols = ['ABL', 'ADBL', 'NABIL', 'NICA', 'SCB']
    all_data = []
    
    for symbol in symbols:
        n_days = np.random.randint(200, 400)
        
        # Generate realistic price data
        base_price = np.random.uniform(300, 800)
        trend = np.random.uniform(-0.0002, 0.0005)
        volatility = np.random.uniform(0.015, 0.025)
        
        returns = np.random.normal(trend, volatility, n_days)
        prices = base_price * np.cumprod(1 + returns)
        
        for i in range(n_days):
            close = prices[i]
            high = close * (1 + np.abs(np.random.normal(0, 0.01)))
            low = close * (1 - np.abs(np.random.normal(0, 0.01)))
            open_price = low + (high - low) * np.random.random()
            
            volume = int(np.random.lognormal(9 + np.random.uniform(-1, 1), 0.5))
            vwap = (high + low + close) / 3 * (1 + np.random.normal(0, 0.002))
            
            record = {
                'S.No': i + 1,
                'Symbol': symbol,
                'Conf.': 'Standard',
                'Open': round(open_price, 2),
                'High': round(high, 2),
                'Low': round(low, 2),
                'Close': round(close, 2),
                'LTP': round(close, 2),
                'Close - LTP': 0.0,
                'Close - LTP %': 0.0,
                'VWAP': round(vwap, 2),
                ```python
                'Vol': volume,
                'Prev. Close': round(prices[i-1] if i > 0 else base_price, 2),
                'Turnover': int(volume * close),
                'Trans.': int(volume * 0.1),
                'Diff': round(close - prices[i-1] if i > 0 else 0, 2),
                'Range': round(high - low, 2),
                'Diff %': round((close - prices[i-1]) / prices[i-1] * 100 if i > 0 else 0, 2),
                'Range %': round((high - low) / close * 100, 2),
                'VWAP %': round((close - vwap) / vwap * 100, 2),
                '52 Weeks High': round(close * 1.3, 2),
                '52 Weeks Low': round(close * 0.7, 2)
            }
            all_data.append(record)
    
    df = pd.DataFrame(all_data)
    return df


# Run complete exploration
print("\n" + "=" * 70)
print("RUNNING COMPLETE NEPSE DATA EXPLORATION")
print("=" * 70)

nepse_full_data = create_comprehensive_nepse_data()
explorer = NEPSEDataExplorer(data=nepse_full_data)
exploration_results = explorer.run_full_exploration()
```

---

### **2.6 Visual Inspection Techniques**

Visual inspection is one of the most powerful tools for understanding time-series data. A good visualization can reveal patterns that statistics alone might miss.

```python
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from pandas.plotting import register_matplotlib_converters

register_matplotlib_converters()


class TimeSeriesVisualizer:
    """
    Comprehensive visualization toolkit for time-series data.
    
    Visualization is crucial because:
    1. Reveals patterns invisible in summary statistics
    2. Helps identify data quality issues
    3. Guides modeling decisions
    4. Communicates findings effectively
    """
    
    def __init__(self, data, price_column='Close', date_column=None):
        """
        Initialize the visualizer.
        
        Parameters:
        -----------
        data : pd.DataFrame
            Time-series data
        price_column : str
            Column containing values to visualize
        date_column : str, optional
            Date column (creates synthetic dates if None)
        """
        self.data = data.copy()
        self.price_column = price_column
        self.date_column = date_column
        
        # Ensure we have a time index
        if date_column and date_column in data.columns:
            self.data[date_column] = pd.to_datetime(self.data[date_column])
            self.data = self.data.sort_values(date_column)
        
        # Set style
        plt.style.use('seaborn-v0_8-whitegrid')
    
    def line_plot_overview(self, symbol=None):
        """
        Create basic line plot of the time series.
        
        Line plots are fundamental for:
        - Seeing overall trends
        - Identifying abrupt changes
        - Spotting outliers visually
        """
        fig, ax = plt.subplots(figsize=(14, 6))
        
        if symbol and 'Symbol' in self.data.columns:
            plot_data = self.data[self.data['Symbol'] == symbol]
            title = f'Price Series for {symbol}'
        else:
            plot_data = self.data
            title = 'Price Series Overview'
        
        ax.plot(plot_data[self.price_column].values, linewidth=1, alpha=0.8)
        
        ax.set_title(title, fontsize=14, fontweight='bold')
        ax.set_xlabel('Time Index', fontsize=12)
        ax.set_ylabel('Price (NPR)', fontsize=12)
        
        # Add mean line
        mean_price = plot_data[self.price_column].mean()
        ax.axhline(y=mean_price, color='red', linestyle='--', alpha=0.5, 
                   label=f'Mean: {mean_price:.2f}')
        ax.legend()
        
        plt.tight_layout()
        plt.show()
        
        print("""
        📈 What to Look For:
        
        1. TREND: Is the line generally going up, down, or sideways?
        2. VOLATILITY: Are the swings large or small?
        3. OUTLIERS: Are there any extreme spikes or dips?
        4. REGIME CHANGES: Are there periods with different behavior?
        5. GAPS: Are there any missing periods?
        """)
    
    def candlestick_style_plot(self, symbol=None, n_points=100):
        """
        Create OHLC-style visualization.
        
        For NEPSE data, we can visualize the Open, High, Low, Close
        to understand daily price action.
        """
        if symbol and 'Symbol' in self.data.columns:
            plot_data = self.data[self.data['Symbol'] == symbol].tail(n_points)
        else:
            plot_data = self.data.tail(n_points)
        
        fig, axes = plt.subplots(2, 1, figsize=(14, 8), 
                                  gridspec_kw={'height_ratios': [3, 1]})
        
        # Price plot with High-Low range
        x = range(len(plot_data))
        
        # Plot High-Low range as vertical lines
        for i, (idx, row) in enumerate(plot_data.iterrows()):
            color = 'green' if row['Close'] >= row['Open'] else 'red'
            # High-Low line
            axes[0].plot([i, i], [row['Low'], row['High']], color=color, linewidth=1)
            # Open-Close rectangle (represented by thicker line)
            axes[0].plot([i, i], [row['Open'], row['Close']], color=color, linewidth=3)
        
        axes[0].set_title(f'Price Action (OHLC) - Last {n_points} Periods', 
                          fontsize=14, fontweight='bold')
        axes[0].set_ylabel('Price (NPR)')
        axes[0].grid(True, alpha=0.3)
        
        # Volume plot
        colors = ['green' if c >= o else 'red' 
                  for c, o in zip(plot_data['Close'], plot_data['Open'])]
        axes[1].bar(x, plot_data['Vol'], color=colors, alpha=0.7)
        axes[1].set_title('Volume', fontsize=12)
        axes[1].set_ylabel('Volume')
        axes[1].set_xlabel('Time Index')
        
        plt.tight_layout()
        plt.show()
        
        print("""
        📊 Reading OHLC Charts:
        
        GREEN bars: Close > Open (Bullish day)
        RED bars:   Close < Open (Bearish day)
        
        Vertical line: High to Low range
        Thick part: Open to Close range
        
        Volume bars show trading activity
        """)
    
    def distribution_analysis(self, symbol=None):
        """
        Analyze the distribution of prices and returns.
        """
        if symbol and 'Symbol' in self.data.columns:
            plot_data = self.data[self.data['Symbol'] == symbol]
        else:
            plot_data = self.data
        
        prices = plot_data[self.price_column]
        returns = prices.pct_change().dropna()
        
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        # Price distribution
        axes[0, 0].hist(prices, bins=50, edgecolor='black', alpha=0.7, density=True)
        prices.plot(kind='kde', ax=axes[0, 0], color='red', linewidth=2)
        axes[0, 0].set_title('Price Distribution', fontsize=12, fontweight='bold')
        axes[0, 0].set_xlabel('Price (NPR)')
        axes[0, 0].axvline(prices.mean(), color='blue', linestyle='--', label='Mean')
        axes[0, 0].legend()
        
        # Return distribution
        axes[0, 1].hist(returns, bins=50, edgecolor='black', alpha=0.7, density=True)
        returns.plot(kind='kde', ax=axes[0, 1], color='red', linewidth=2)
        axes[0, 1].set_title('Return Distribution', fontsize=12, fontweight='bold')
        axes[0, 1].set_xlabel('Daily Return')
        axes[0, 1].axvline(0, color='blue', linestyle='--')
        
        # QQ plot for normality check
        from scipy import stats
        stats.probplot(returns, dist="norm", plot=axes[1, 0])
        axes[1, 0].set_title('Q-Q Plot (Normality Check)', fontsize=12, fontweight='bold')
        axes[1, 0].grid(True, alpha=0.3)
        
        # Box plot by period (e.g., month)
        if 'Date' in plot_data.columns:
            plot_data['Month'] = pd.to_datetime(plot_data['Date']).dt.month
            plot_data.boxplot(column=self.price_column, by='Month', ax=axes[1, 1])
            axes[1, 1].set_title('Price by Month', fontsize=12, fontweight='bold')
        else:
            # Rolling statistics instead
            rolling_mean = prices.rolling(window=20).mean()
            rolling_std = prices.rolling(window=20).std()
            axes[1, 1].plot(rolling_mean.values, label='Rolling Mean')
            axes[1, 1].fill_between(range(len(rolling_mean)), 
                                     rolling_mean - 2*rolling_std,
                                     rolling_mean + 2*rolling_std, 
                                     alpha=0.2, label='±2 Std Dev')
            axes[1, 1].set_title('Rolling Statistics (20-period)', fontsize=12, fontweight='bold')
            axes[1, 1].legend()
        
        plt.tight_layout()
        plt.show()
        
        print("""
        📊 Distribution Analysis Insights:
        
        1. PRICE DISTRIBUTION:
           - Bell-shaped: Normal distribution
           - Skewed right: More low prices, few high prices
           - Multiple peaks: Multiple regimes or stocks
        
        2. RETURN DISTRIBUTION:
           - Should be centered around 0
           - Fat tails: Extreme returns more common than normal
           - Common in stock markets (leptokurtic)
        
        3. Q-Q PLOT:
           - Points on line: Normal distribution
           - S-curves: Heavy tails
           - Deviations at ends: Outliers present
        """)
    
    def seasonal_subseries_plot(self, period=5):
        """
        Create subseries plot to visualize seasonal patterns.
        
        Each subseries shows the values for a specific "season"
        (e.g., day of week) across time.
        """
        prices = self.data[self.price_column].values
        n = len(prices)
        
        # Group by period position
        fig, axes = plt.subplots(1, period, figsize=(14, 4), sharey=True)
        
        for i in range(period):
            # Get all values at position i in each period
            indices = range(i, n, period)
            values = prices[list(indices)]
            
            axes[i].plot(values, linewidth=1)
            axes[i].axhline(y=values.mean(), color='red', linestyle='--', linewidth=2)
            axes[i].set_title(f'Position {i+1}')
            axes[i].set_xlabel('Period Number')
        
        axes[0].set_ylabel('Price (NPR)')
        fig.suptitle('Subseries Plot (by Position in Period)', fontsize=14, fontweight='bold')
        
        plt.tight_layout()
        plt.show()
        
        print("""
        📈 Subseries Plot Interpretation:
        
        Each panel shows all values at a specific position within each period.
        - For period=5 (trading week): Shows Mon, Tue, Wed, Thu, Fri patterns
        - Horizontal red line: Mean for that position
        - Consistent patterns across positions suggest seasonality
        """)
    
    def lag_scatter_plot(self, lags=[1, 2, 5, 10]):
        """
        Create scatter plots of y(t) vs y(t-k).
        
        This visualizes autocorrelation structure.
        """
        prices = self.data[self.price_column].dropna()
        
        fig, axes = plt.subplots(2, 2, figsize=(12, 10))
        
        for i, lag in enumerate(lags):
            row, col = i // 2, i % 2
            ax = axes[row, col]
            
            y = prices.values[lag:]
            y_lag = prices.values[:-lag]
            
            ax.scatter(y_lag, y, alpha=0.3, s=10)
            
            # Add regression line
            z = np.polyfit(y_lag, y, 1)
            p = np.poly1d(z)
            ax.plot(y_lag, p(y_lag), "r--", alpha=0.8, linewidth=2)
            
            # Calculate correlation
            corr = np.corrcoef(y_lag, y)[0, 1]
            
            ax.set_title(f'Lag {lag}: Correlation = {corr:.3f}', fontsize=12)
            ax.set_xlabel(f'y(t-{lag})')
            ax.set_ylabel('y(t)')
            ax.grid(True, alpha=0.3)
        
        fig.suptitle('Lag Scatter Plots', fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()
        
        print("""
        📊 Lag Scatter Plot Interpretation:
        
        Strong diagonal pattern = High autocorrelation
        Scattered points = Low autocorrelation
        
        For NEPSE:
        - Lag 1 usually shows strong positive correlation
        - Correlation decreases as lag increases
        - Patterns may suggest model type (AR, MA, etc.)
        """)
    
    def comprehensive_visualization(self, symbol=None):
        """
        Create a comprehensive multi-panel visualization.
        """
        if symbol and 'Symbol' in self.data.columns:
            plot_data = self.data[self.data['Symbol'] == symbol].copy()
            title_suffix = f' - {symbol}'
        else:
            plot_data = self.data.copy()
            title_suffix = ''
        
        prices = plot_data[self.price_column]
        returns = prices.pct_change().dropna()
        
        fig = plt.figure(figsize=(16, 12))
        
        # 1. Price time series
        ax1 = plt.subplot(3, 2, 1)
        ax1.plot(prices.values, linewidth=1)
        ax1.set_title(f'Price Series{title_suffix}', fontweight='bold')
        ax1.set_ylabel('Price (NPR)')
        ax1.grid(True, alpha=0.3)
        
        # 2. Returns
        ax2 = plt.subplot(3, 2, 2)
        ax2.plot(returns.values, linewidth=1, color='green')
        ax2.axhline(0, color='black', linestyle='-', alpha=0.3)
        ax2.set_title('Daily Returns', fontweight='bold')
        ax2.set_ylabel('Return')
        ax2.grid(True, alpha=0.3)
        
        # 3. Return distribution
        ax3 = plt.subplot(3, 2, 3)
        ax3.hist(returns, bins=50, edgecolor='black', alpha=0.7, density=True)
        ax3.set_title('Return Distribution', fontweight='bold')
        ax3.set_xlabel('Return')
        ax3.set_ylabel('Density')
        ax3.grid(True, alpha=0.3)
        
        # 4. ACF
        ax4 = plt.subplot(3, 2, 4)
        from statsmodels.graphics.tsaplots import plot_acf
        plot_acf(returns.dropna(), lags=30, ax=ax4)
        ax4.set_title('ACF of Returns', fontweight='bold')
        
        # 5. Rolling volatility
        ax5 = plt.subplot(3, 2, 5)
        volatility = returns.rolling(window=20).std()
        ax5.plot(volatility.values, linewidth=1, color='red')
        ax5.set_title('Rolling Volatility (20-day)', fontweight='bold')
        ax5.set_ylabel('Std Dev')
        ax5.set_xlabel('Time Index')
        ax5.grid(True, alpha=0.3)
        
        # 6. Price vs Volume
        ax6 = plt.subplot(3, 2, 6)
        if 'Vol' in plot_data.columns:
            ax6_twin = ax6.twinx()
            ax6.plot(prices.values, linewidth=1, label='Price')
            ax6_twin.bar(range(len(plot_data)), plot_data['Vol'].values, 
                         alpha=0.3, color='orange', label='Volume')
            ax6.set_title('Price vs Volume', fontweight='bold')
            ax6.set_ylabel('Price (NPR)')
            ax6_twin.set_ylabel('Volume')
            ax6.legend(loc='upper left')
            ax6.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        print(f"""
        📊 Comprehensive Visualization Summary{title_suffix}
        
        Panel 1 (Price Series):
        - Shows overall trend and level
        
        Panel 2 (Returns):
        - Centered around 0 for stationary returns
        - Clusters indicate volatility clustering
        
        Panel 3 (Distribution):
        - Should be roughly bell-shaped
        - Fat tails common in stock returns
        
        Panel 4 (ACF):
        - Shows autocorrelation structure
        - Helps determine model order
        
        Panel 5 (Volatility):
        - Shows changing volatility over time
        - Peaks indicate turbulent periods
        
        Panel 6 (Price vs Volume):
        - High volume often accompanies large moves
        - Can reveal support/resistance levels
        """)


# ============================================================
# EXAMPLE USAGE
# ============================================================

print("\n" + "=" * 70)
print("TIME-SERIES VISUALIZATION FOR NEPSE DATA")
print("=" * 70)

# Create sample data
viz_data = create_comprehensive_nepse_data()
single_stock_data = viz_data[viz_data['Symbol'] == 'ABL'].copy()

# Initialize visualizer
visualizer = TimeSeriesVisualizer(single_stock_data, price_column='Close')

# Create visualizations
visualizer.line_plot_overview()
visualizer.candlestick_style_plot(n_points=50)
visualizer.distribution_analysis()
visualizer.seasonal_subseries_plot(period=5)
visualizer.lag_scatter_plot()
visualizer.comprehensive_visualization()
```

---

### **2.7 Statistical Summary and Diagnostics**

The final section of this chapter covers systematic statistical diagnostics.

```python
class TimeSeriesDiagnostics:
    """
    Comprehensive statistical diagnostics for time-series data.
    
    This class provides systematic methods to evaluate
    time-series properties and generate diagnostic reports.
    """
    
    def __init__(self, data, price_column='Close'):
        self.data = data.copy()
        self.price_column = price_column
        self.diagnostics = {}
    
    def generate_full_report(self):
        """
        Generate a comprehensive diagnostic report.
        """
        print("=" * 70)
        print("TIME-SERIES DIAGNOSTIC REPORT")
        print("=" * 70)
        
        prices = self.data[self.price_column].dropna()
        returns = prices.pct_change().dropna()
        
        # ========================================
        # 1. Basic Statistics
        # ========================================
        print("\n📊 SECTION 1: BASIC STATISTICS")
        print("-" * 70)
        
        self.diagnostics['basic_stats'] = {
            'n_observations': len(prices),
            'mean': prices.mean(),
            'std': prices.std(),
            'min': prices.min(),
            'max': prices.max(),
            'median': prices.median(),
            'skewness': prices.skew(),
            'kurtosis': prices.kurtosis()
        }
        
        for key, value in self.diagnostics['basic_stats'].items():
            if isinstance(value, float):
                print(f"   {key:20s}: {value:.4f}")
            else:
                print(f"   {key:20s}: {value}")
        
        # ========================================
        # 2. Stationarity Tests
        # ========================================
        print("\n📊 SECTION 2: STATIONARITY TESTS")
        print("-" * 70)
        
        # ADF test on prices
        adf_price = adfuller(prices, autolag='AIC')
        print(f"\n   ADF Test on Prices:")
        print(f"      Statistic: {adf_price[0]:.4f}")
        print(f"      p-value:   {adf_price[1]:.6f}")
        print(f"      Result:    {'Stationary' if adf_price[1] < 0.05 else 'Non-Stationary'}")
        
        # ADF test on returns
        adf_return = adfuller(returns.dropna(), autolag='AIC')
        print(f"\n   ADF Test on Returns:")
        print(f"      Statistic: {adf_return[0]:.4f}")
        print(f"      p-value:   {adf_return[1]:.6f}")
        print(f"      Result:    {'Stationary' if adf_return[1] < 0.05 else 'Non-Stationary'}")
        
        self.diagnostics['stationarity'] = {
            'price_adf_pvalue': adf_price[1],
            'return_adf_pvalue': adf_return[1],
            'price_stationary': adf_price[1] < 0.05,
            'return_stationary': adf_return[1] < 0.05
        }
        
        # ========================================
        # 3. Autocorrelation Analysis
        # ========================================
        print("\n📊 SECTION 3: AUTOCORRELATION ANALYSIS")
        print("-" * 70)
        
        # Ljung-Box test
        lb_result = acorr_ljungbox(returns.dropna(), lags=[5, 10, 20], return_df=True)
        
        print("\n   Ljung-Box Test on Returns:")
        for lag in [5, 10, 20]:
            pval = lb_result.loc[lag, 'lb_pvalue']
            result = 'Significant' if pval < 0.05 else 'Not Significant'
            print(f"      Lag {lag}: p-value = {pval:.6f} ({result})")
        
        # ACF of squared returns (for ARCH effects)
        sq_returns = returns.dropna() ** 2
        acf_sq = acf(sq_returns, nlags=10, fft=True)
        
        print(f"\n   ACF of Squared Returns (ARCH indicator):")
        print(f"      Lag 1: {acf_sq[1]:.4f}")
        print(f"      Lag 5: {acf_sq[5]:.4f}")
        
        if acf_sq[1] > 0.1:
            print("      → ARCH effects may be present")
        
        # ========================================
        # 4. Normality Tests
        # ========================================
        print("\n📊 SECTION 4: NORMALITY TESTS")
        print("-" * 70)
        
        from scipy.stats import jarque_bera, shapiro, normaltest
        
        # Jarque-Bera test
        jb_stat, jb_pvalue = jarque_bera(returns.dropna())
        print(f"\n   Jarque-Bera Test:")
        print(f"      Statistic: {jb_stat:.4f}")
        print(f"      p-value:   {jb_pvalue:.6f}")
        print(f"      Result:    {'Not Normal' if jb_pvalue < 0.05 else 'Normal'}")
        
        # Shapiro-Wilk test (for smaller samples)
        if len(returns) < 5000:
            sw_stat, sw_pvalue = shapiro(returns.dropna().values[:5000])
            print(f"\n   Shapiro-Wilk Test:")
            print(f"      Statistic: {sw_stat:.4f}")
            print(f"      p-value:   {sw_pvalue:.6f}")
            print(f"      Result:    {'Not Normal' if sw_pvalue < 0.05 else 'Normal'}")
        
        print(f"\n   Skewness: {returns.skew():.4f}")
        print(f"   Kurtosis: {returns.kurtosis():.4f}")
        print("   (Normal = skewness ≈ 0, kurtosis ≈ 0)")
        
        # ========================================
        # 5. Volatility Analysis
        # ========================================
        print("\n📊 SECTION 5: VOLATILITY ANALYSIS")
        print("-" * 70)
        
        # ARCH test
        arch_result = het_arch(returns.dropna(), nlags=5)
        
        print(f"\n   Engle's ARCH Test:")
        print(f"      LM Statistic: {arch_result[0]:.4f}")
        print(f"      p-value:      {arch_result[1]:.6f}")
        print(f"      Result:       {'ARCH Effects Present' if arch_result[1] < 0.05 else 'No ARCH Effects'}")
        
        # Rolling volatility stats
        rolling_vol = returns.rolling(window=20).std()
        
        print(f"\n   Rolling Volatility Statistics:")
        print(f"      Mean:   {rolling_vol.mean():.6f}")
        print(f"      Std:    {rolling_vol.std():.6f}")
        print(f"      Min:    {rolling_vol.min():.6f}")
        print(f"      Max:    {rolling_vol.max():.6f}")
        
        # ========================================
        # 6. Summary and Recommendations
        # ========================================
        print("\n" + "=" * 70)
        print("DIAGNOSTIC SUMMARY AND RECOMMENDATIONS")
        print("=" * 70)
        
        recommendations = []
        
        # Check stationarity
        if not self.diagnostics['stationarity']['price_stationary']:
            recommendations.append({
                'issue': 'Non-stationary prices',
                'impact': 'Cannot use standard AR models directly',
                'solution': 'Apply differencing or use ARIMA models'
            })
        
        # Check for ARCH effects
        if arch_result[1] < 0.05:
            recommendations.append({
                'issue': 'ARCH effects detected',
                'impact': 'Volatility clustering present',
                'solution': 'Consider GARCH models for volatility forecasting'
            })
        
        # Check normality
        if jb_pvalue < 0.05:
            recommendations.append({
                'issue': 'Non-normal returns',
                'impact': 'Standard errors may be unreliable',
                'solution': 'Use robust standard errors or bootstrap'
            })
        
        # Check autocorrelation
        if lb_result.loc[5, 'lb_pvalue'] < 0.05:
            recommendations.append({
                'issue': 'Significant autocorrelation in returns',
                'impact': 'Predictable patterns exist',
                'solution': 'Use AR/MA terms or lagged features'
            })
        
        print("\n📋 Issues Detected and Solutions:")
        print("-" * 70)
        
        if recommendations:
            for i, rec in enumerate(recommendations, 1):
                print(f"\n   Issue {i}: {rec['issue']}")
                print(f"      Impact:   {rec['impact']}")
                print(f"      Solution: {rec['solution']}")
        else:
            print("\n   ✓ No major issues detected!")
        
        # Model suggestions
        print("\n📋 Suggested Modeling Approaches:")
        print("-" * 70)
        
        if self.diagnostics['stationarity']['return_stationary']:
            print("   ✓ Returns are stationary → Can use ARMA models")
        
        if arch_result[1] < 0.05:
            print("   ✓ ARCH effects present → Consider GARCH modeling")
        
        print("\n   Recommended models for NEPSE price prediction:")
        print("   1. ARIMA for point forecasts (after differencing)")
        print("   2. GARCH for volatility forecasting")
        print("   3. Machine Learning (Random Forest, XGBoost) with lag features")
        print("   4. LSTM/Transformer for sequence modeling")
        
        return self.diagnostics


# ============================================================
# FINAL EXAMPLE - RUNNING COMPLETE DIAGNOSTICS
# ============================================================

print("\n" + "=" * 70)
print("COMPLETE TIME-SERIES DIAGNOSTICS FOR NEPSE DATA")
print("=" * 70)

# Use previously created data
diagnostic_data = create_comprehensive_nepse_data()
single_stock = diagnostic_data[diagnostic_data['Symbol'] == 'ABL'].copy()

# Run diagnostics
diagnostics = TimeSeriesDiagnostics(single_stock, price_column='Close')
report = diagnostics.generate_full_report()


# ============================================================
# CHAPTER SUMMARY
# ============================================================

print("\n" + "=" * 70)
print("CHAPTER 2 SUMMARY")
print("=" * 70)

print("""
📚 KEY CONCEPTS COVERED:

1. TIME-SERIES CHARACTERISTICS
   • Temporal ordering - sequence matters
   • Time intervals - regular vs irregular
   • Sequential dependence - past affects future

2. TIME-SERIES COMPONENTS
   • Trend - long-term direction
   • Seasonality - repeating patterns (fixed period)
   • Cyclicality - longer-term fluctuations (variable period)
   • Irregularity - random noise/residual

3. TIME-SERIES PROPERTIES
   • Stationarity - constant statistical properties
   • Autocorrelation - correlation with lagged self
   • Heteroscedasticity - changing variance

4. DATA CHALLENGES
   • Missing values and imputation
   • Outliers and their treatment
   • Non-stationarity and transformation

5. ANALYSIS TECHNIQUES
   • Visual inspection methods
   • Statistical tests (ADF, KPSS, ARCH, Ljung-Box)
   • Diagnostic reporting

💡 FOR NEPSE PREDICTION:

1. ALWAYS CHECK STATIONARITY FIRST
   - Prices are typically non-stationary
   - Returns are typically stationary
   - Use differencing for modeling

2. ANALYZE AUTOCORRELATION
   - Helps determine model order
   - Identifies exploitable patterns
   - Guides feature engineering

3. CONSIDER VOLATILITY CLUSTERING
   - Common in stock markets
   - May need GARCH models
   - Important for risk management

4. VISUALIZE YOUR DATA
   - Patterns visible in plots
   - Helps identify issues
   - Guides modeling decisions

📖 NEXT STEPS:

In Chapter 3, we will set up the development environment
for building time-series prediction systems, including:
- Python installation and configuration
- Essential libraries for time-series analysis
- IDE setup and best practices
- Project structure for prediction systems
""")
```

---



<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='1. introduction_to_time_series_prediction_systems.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='3. setting_up_your_development_environment.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
