# **Chapter 7: Exploratory Data Analysis**

---

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Conduct systematic univariate analysis to understand individual feature distributions
- Perform bivariate and multivariate analysis to identify relationships between variables
- Apply time-series specific EDA techniques (trend, seasonality, autocorrelation)
- Decompose time-series into constituent components
- Create publication-quality visualizations following best practices
- Generate automated EDA reports for rapid data understanding
- Communicate data insights effectively to technical and non-technical stakeholders
- Develop a comprehensive EDA checklist for production systems

---

## **Prerequisites**

- Completed Chapter 6: Data Cleaning and Preprocessing
- Understanding of statistical concepts (mean, variance, correlation)
- Familiarity with matplotlib and seaborn basics
- NEPSE dataset loaded and cleaned from previous chapters

---

## **7.1 The EDA Process**

Exploratory Data Analysis (EDA) is the critical bridge between data collection and model building. For time-series prediction systems, EDA must go beyond standard statistical summaries to uncover temporal patterns, regime changes, and feature interactions that drive predictive signals.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set visualization style for professional output
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

class NEPSEEDAFramework:
    """
    Comprehensive EDA framework for NEPSE time-series data.
    
    The EDA process follows these phases:
    1. Understanding (What do we have?)
    2. Cleaning validation (Did our cleaning work?)
    3. Univariate analysis (How does each feature behave?)
    4. Bivariate analysis (How do features relate?)
    5. Temporal analysis (How does it change over time?)
    6. Quality assessment (Is this data suitable for modeling?)
    """
    
    def __init__(self, df: pd.DataFrame, symbol: str):
        self.df = df.copy()
        self.symbol = symbol
        self.report = {
            'symbol': symbol,
            'start_date': df.index.min() if isinstance(df.index, pd.DatetimeIndex) else None,
            'end_date': df.index.max() if isinstance(df.index, pd.DatetimeIndex) else None,
            'findings': []
        }
        
        # Ensure datetime index
        if 'Date' in self.df.columns:
            self.df['Date'] = pd.to_datetime(self.df['Date'])
            self.df.set_index('Date', inplace=True)
        
        print(f"EDA Framework initialized for {symbol}")
        print(f"Data shape: {df.shape}")
        print(f"Date range: {self.report['start_date']} to {self.report['end_date']}")
    
    def add_finding(self, category: str, description: str, severity: str = 'info'):
        """Log findings for final report."""
        self.report['findings'].append({
            'category': category,
            'description': description,
            'severity': severity,
            'timestamp': pd.Timestamp.now()
        })
        print(f"[{severity.upper()}] {category}: {description}")

# Initialize with NEPSE data
# Creating comprehensive sample data for this chapter
np.random.seed(42)
dates = pd.date_range('2022-01-01', '2024-01-01', freq='B')  # 2 years of business days
n = len(dates)

# Generate realistic NEPSE data with trends and seasonality
trend = np.linspace(2800, 3200, n)
seasonal = 100 * np.sin(2 * np.pi * np.arange(n) / 252)  # Annual seasonality
noise = np.cumsum(np.random.randn(n) * 5)  # Random walk
volume_trend = np.linspace(100000, 150000, n)

# OHLC with realistic relationships
close = trend + seasonal + noise
open_price = close + np.random.randn(n) * 10
high = np.maximum(open_price, close) + np.random.uniform(10, 50, n)
low = np.minimum(open_price, close) - np.random.uniform(10, 50, n)
volume = volume_trend + np.random.randint(-20000, 20000, n)

nepse_eda = pd.DataFrame({
    'Open': open_price,
    'High': high,
    'Low': low,
    'Close': close,
    'Volume': volume,
    'Symbol': 'NABIL'
}, index=dates)

# Add some features for analysis
nepse_eda['Returns'] = nepse_eda['Close'].pct_change()
nepse_eda['Log_Returns'] = np.log(nepse_eda['Close'] / nepse_eda['Close'].shift(1))
nepse_eda['Volatility'] = nepse_eda['Returns'].rolling(20).std()
nepse_eda['MA_20'] = nepse_eda['Close'].rolling(20).mean()
nepse_eda['MA_50'] = nepse_eda['Close'].rolling(50).mean()

eda_framework = NEPSEEDAFramework(nepse_eda, 'NABIL')
eda_framework.add_finding('Data Loading', f'Loaded {len(nepse_eda)} records', 'info')
```

**Explanation:**
- **The EDA Framework** provides structure to the exploration process. Without structure, EDA becomes random plotting without clear hypotheses.
- **Phases of EDA**:
  1. **Understanding**: Basic shape, types, ranges
  2. **Cleaning validation**: Verify preprocessing worked (no unexpected nulls, ranges correct)
  3. **Univariate**: Distribution of each variable (normal? skewed? bimodal?)
  4. **Bivariate**: Relationships between pairs (correlation, causation hints)
  5. **Temporal**: How things change over time (trends, cycles, anomalies)
  6. **Quality assessment**: Suitability for modeling (stationarity, feature engineering opportunities)
- **Data Generation**: We create 2 years of synthetic NEPSE data with realistic components:
  - **Trend**: Long-term upward drift (2800 → 3200)
  - **Seasonality**: Annual cycle (252 trading days)
  - **Noise**: Random walk (more realistic than white noise for prices)
  - **OHLC Logic**: High ≥ max(Open, Close), Low ≤ min(Open, Close)

---

## **7.2 Univariate Analysis**

Understanding individual variables is the foundation of EDA. For time-series, we must examine both the distribution of values and the distribution over time.

### **7.2.1 Distribution Analysis**

```python
class UnivariateAnalyzer:
    """Analyze individual features in isolation."""
    
    def __init__(self, df: pd.DataFrame):
        self.df = df
    
    def numerical_summary(self, column: str) -> pd.Series:
        """
        Comprehensive statistical summary beyond basic describe().
        
        Includes moments, percentiles, and distribution shape metrics.
        """
        data = self.df[column].dropna()
        
        summary = {
            'count': len(data),
            'mean': data.mean(),
            'std': data.std(),
            'min': data.min(),
            'max': data.max(),
            'range': data.max() - data.min(),
            'skewness': stats.skew(data),
            'kurtosis': stats.kurtosis(data),
            'median': data.median(),
            'iqr': data.quantile(0.75) - data.quantile(0.25),
            'cv': data.std() / data.mean(),  # Coefficient of variation
            'jarque_bera_pvalue': stats.jarque_bera(data)[1],  # Normality test
            'shapiro_pvalue': stats.shapiro(data.sample(min(5000, len(data))))[1] if len(data) > 3 else np.nan
        }
        
        # Add percentiles
        for p in [1, 5, 10, 25, 50, 75, 90, 95, 99]:
            summary[f'p{p}'] = data.quantile(p/100)
        
        return pd.Series(summary)
    
    def plot_distribution(self, column: str, figsize=(15, 10)):
        """
        Create comprehensive distribution visualization.
        
        Includes: histogram, KDE, box plot, Q-Q plot, and time series.
        """
        data = self.df[column].dropna()
        
        fig, axes = plt.subplots(2, 2, figsize=figsize)
        fig.suptitle(f'Univariate Analysis: {column}', fontsize=16, fontweight='bold')
        
        # 1. Histogram with KDE
        ax1 = axes[0, 0]
        sns.histplot(data, kde=True, ax=ax1, color='skyblue', alpha=0.7)
        ax1.axvline(data.mean(), color='red', linestyle='--', label=f'Mean: {data.mean():.2f}')
        ax1.axvline(data.median(), color='green', linestyle='--', label=f'Median: {data.median():.2f}')
        ax1.set_title('Distribution with Central Tendencies')
        ax1.legend()
        
        # 2. Box plot
        ax2 = axes[0, 1]
        sns.boxplot(y=data, ax=ax2, color='lightcoral')
        ax2.set_title('Box Plot (Quartiles & Outliers)')
        
        # 3. Q-Q plot for normality assessment
        ax3 = axes[1, 0]
        stats.probplot(data, dist="norm", plot=ax3)
        ax3.set_title('Q-Q Plot vs Normal Distribution')
        
        # 4. Time series plot
        ax4 = axes[1, 1]
        ax4.plot(data.index, data, alpha=0.7, color='steelblue')
        ax4.set_title('Time Series View')
        ax4.set_xlabel('Date')
        ax4.tick_params(axis='x', rotation=45)
        
        plt.tight_layout()
        return fig

# Analyze Close price distribution
analyzer = UnivariateAnalyzer(nepse_eda)
close_stats = analyzer.numerical_summary('Close')
print("Close Price Statistical Summary:")
print(close_stats)

# Create visualization
fig = analyzer.plot_distribution('Close')
plt.show()

# Analyze returns (more interesting distribution)
returns_stats = analyzer.numerical_summary('Returns')
print("\nReturns Statistical Summary:")
print(returns_stats)
print(f"\nSkewness: {returns_stats['skewness']:.4f} (0=symmetric, >0=right tail)")
print(f"Kurtosis: {returns_stats['kurtosis']:.4f} (0=normal, >0=fat tails)")
print(f"Jarque-Bera p-value: {returns_stats['jarque_bera_pvalue']:.2e} (<0.05 = reject normality)")
```

**Explanation:**
- **Statistical Moments**:
  - **Skewness**: Measures asymmetry. NEPSE returns often negative skew (crashes are sudden, gains gradual).
  - **Kurtosis**: Measures tail thickness. Financial data typically has excess kurtosis (>0, "fat tails") meaning extreme events are more likely than normal distribution predicts.
  - **Jarque-Bera Test**: Statistical test for normality. Low p-value (< 0.05) means data is not normally distributed—critical for choosing statistical models.
- **Visualization Components**:
  - **Histogram + KDE**: Shows shape (unimodal, bimodal?), central tendency, spread
  - **Box Plot**: Identifies outliers (points beyond whiskers), shows IQR (box)
  - **Q-Q Plot**: Compares data quantiles to theoretical normal distribution. Straight line = normal. Curved = skewed. Heavy tails = S-shaped.
  - **Time Series**: Shows if distribution changes over time (non-stationarity)

---

### **7.2.2 Statistical Summaries**

```python
def generate_comprehensive_summary(df: pd.DataFrame) -> pd.DataFrame:
    """
    Generate publication-ready summary statistics table.
    
    Separate handling for price levels vs returns (different interpretations).
    """
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    
    summary_rows = []
    
    for col in numeric_cols:
        data = df[col].dropna()
        
        if 'return' in col.lower() or col == 'Returns':
            # Returns interpretation
            row = {
                'Variable': col,
                'Mean (%)': f"{data.mean()*100:.4f}",
                'Std Dev (%)': f"{data.std()*100:.4f}",
                'Min (%)': f"{data.min()*100:.2f}",
                'Max (%)': f"{data.max()*100:.2f}",
                'Annualized Vol (%)': f"{data.std() * np.sqrt(252) * 100:.2f}",
                'Sharpe Ratio': f"{data.mean() / data.std() * np.sqrt(252):.3f}" if data.std() > 0 else "N/A",
                'Skewness': f"{stats.skew(data):.3f}",
                'Max Drawdown (%)': f"{(data.cumsum().cummax() - data.cumsum()).max()*100:.2f}"
            }
        else:
            # Price/Volume interpretation
            row = {
                'Variable': col,
                'Mean': f"{data.mean():.2f}",
                'Std Dev': f"{data.std():.2f}",
                'Min': f"{data.min():.2f}",
                'Max': f"{data.max():.2f}",
                'CV (%)': f"{(data.std()/data.mean())*100:.2f}",
                'P5-P95 Range': f"{data.quantile(0.95) - data.quantile(0.05):.2f}",
                'Trend (daily change)': f"{np.polyfit(range(len(data)), data, 1)[0]:.4f}",
                'N': len(data)
            }
        
        summary_rows.append(row)
    
    return pd.DataFrame(summary_rows)

# Generate summary
summary_table = generate_comprehensive_summary(nepse_eda)
print("Comprehensive NEPSE Data Summary:")
print(summary_table.to_string(index=False))

# Special analysis for trading days
def analyze_trading_patterns(df: pd.DataFrame):
    """Analyze patterns specific to trading days."""
    df = df.copy()
    df['DayOfWeek'] = df.index.dayofweek  # Monday=0, Friday=4
    df['Month'] = df.index.month
    df['Quarter'] = df.index.quarter
    
    print("\n" + "="*60)
    print("TRADING PATTERN ANALYSIS")
    print("="*60)
    
    # Day of week effect (Monday effect, Weekend effect)
    dow_stats = df.groupby('DayOfWeek')['Returns'].agg(['mean', 'std', 'count'])
    dow_stats.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
    print("\nReturns by Day of Week:")
    print(dow_stats)
    
    # Monthly seasonality
    monthly_stats = df.groupby('Month')['Returns'].mean()
    print(f"\nBest performing month: {monthly_stats.idxmax()} ({monthly_stats.max()*100:.3f}%)")
    print(f"Worst performing month: {monthly_stats.idxmin()} ({monthly_stats.min()*100:.3f}%)")

analyze_trading_patterns(nepse_eda)
```

**Explanation:**
- **Comprehensive Summary**: Different metrics for different variable types:
  - **Returns**: Annualized volatility, Sharpe ratio (risk-adjusted return), maximum drawdown (worst peak-to-trough decline)
  - **Prices**: Coefficient of variation (relative dispersion), trend slope (linear trend coefficient)
- **Trading Patterns**:
  - **Day-of-Week Effect**: Historical anomaly where Mondays often have lower returns (weekend effect)
  - **Monthly Seasonality**: "Sell in May and go away" or end-of-year effects
  - These calendar effects may be predictive features for models

---

### **7.2.3 Visualization Techniques**

```python
class EDAVisualizations:
    """Advanced visualization techniques for financial EDA."""
    
    def __init__(self, df: pd.DataFrame):
        self.df = df
    
    def plot_ohlc_evolution(self, n_days: int = 60):
        """
        Plot OHLC evolution with volume overlay.
        
        Professional financial chart style.
        """
        data = self.df.tail(n_days).copy()
        
        fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10), 
                                       gridspec_kw={'height_ratios': [3, 1]},
                                       sharex=True)
        
        # Price evolution with fill between high-low
        ax1.fill_between(data.index, data['Low'], data['High'], 
                        alpha=0.3, color='gray', label='Daily Range')
        ax1.plot(data.index, data['Close'], color='black', linewidth=1.5, label='Close')
        ax1.plot(data.index, data['Open'], color='blue', linewidth=1, alpha=0.7, label='Open')
        
        # Color code by return (green up, red down)
        for i in range(len(data)-1):
            if data['Close'].iloc[i+1] >= data['Close'].iloc[i]:
                color = 'green'
            else:
                color = 'red'
            ax1.plot(data.index[i:i+2], data['Close'].iloc[i:i+2], 
                    color=color, linewidth=2, alpha=0.6)
        
        ax1.set_title(f'NEPSE OHLC Evolution (Last {n_days} days)', fontsize=14, fontweight='bold')
        ax1.legend()
        ax1.grid(True, alpha=0.3)
        ax1.set_ylabel('Price (NPR)')
        
        # Volume bars
        colors = ['green' if data['Close'].iloc[i] >= data['Open'].iloc[i] else 'red' 
                 for i in range(len(data))]
        ax2.bar(data.index, data['Volume'], color=colors, alpha=0.7, width=0.8)
        ax2.set_ylabel('Volume')
        ax2.set_xlabel('Date')
        ax2.grid(True, alpha=0.3)
        
        plt.tight_layout()
        return fig
    
    def distribution_comparison(self, columns: list):
        """
        Compare distributions of multiple variables.
        """
        fig, axes = plt.subplots(2, len(columns), figsize=(5*len(columns), 8))
        
        if len(columns) == 1:
            axes = axes.reshape(-1, 1)
        
        for i, col in enumerate(columns):
            data = self.df[col].dropna()
            
            # Histogram
            axes[0, i].hist(data, bins=30, alpha=0.7, color='steelblue', edgecolor='black')
            axes[0, i].set_title(f'{col} Distribution')
            axes[0, i].axvline(data.mean(), color='red', linestyle='--', label='Mean')
            axes[0, i].axvline(data.median(), color='green', linestyle='--', label='Median')
            
            # Time series
            axes[1, i].plot(data.index, data, alpha=0.7, color='coral')
            axes[1, i].set_title(f'{col} Time Series')
            axes[1, i].tick_params(axis='x', rotation=45)
        
        plt.tight_layout()
        return fig

# Create visualizations
viz = EDAVisualizations(nepse_eda)

# OHLC chart
fig1 = viz.plot_ohlc_evolution(n_days=60)
plt.show()

# Distribution comparison
fig2 = viz.distribution_comparison(['Close', 'Returns', 'Volume'])
plt.show()
```

**Explanation:**
- **OHLC Evolution**: Professional financial chart showing:
  - Gray shaded area between High and Low (daily range)
  - Black line for Close (most important price)
  - Color-coded segments (green for up days, red for down days)
  - Volume bars synchronized below
- **Distribution Comparison**: Side-by-side comparison of price levels (trending, non-stationary), returns (mean-reverting, stationary), and volume (often right-skewed).

---

## **7.3 Bivariate Analysis**

Understanding relationships between pairs of variables reveals potential predictive features and multicollinearity issues.

### **7.3.1 Correlation Analysis**

```python
class CorrelationAnalyzer:
    """Analyze pairwise relationships between variables."""
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.select_dtypes(include=[np.number])
    
    def correlation_matrix(self, method: str = 'pearson') -> pd.DataFrame:
        """
        Calculate correlation matrix with multiple methods.
        
        Methods:
        - pearson: Linear correlation (sensitive to outliers)
        - spearman: Rank correlation (monotonic relationships, robust)
        - kendall: Concordance correlation (good for small samples)
        """
        if method == 'pearson':
            corr = self.df.corr()
        elif method == 'spearman':
            corr = self.df.corr(method='spearman')
        elif method == 'kendall':
            corr = self.df.corr(method='kendall')
        
        return corr
    
    def plot_correlation_heatmap(self, method: str = 'pearson', figsize=(12, 10)):
        """
        Create annotated heatmap of correlations.
        """
        corr = self.correlation_matrix(method)
        
        # Create mask for upper triangle (redundant)
        mask = np.triu(np.ones_like(corr, dtype=bool))
        
        fig, ax = plt.subplots(figsize=figsize)
        
        sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', 
                   cmap='RdBu_r', center=0, ax=ax,
                   square=True, linewidths=0.5,
                   cbar_kws={"shrink": 0.8})
        
        ax.set_title(f'{method.capitalize()} Correlation Matrix', fontsize=14, fontweight='bold')
        plt.tight_layout()
        return fig
    
    def find_high_correlations(self, threshold: float = 0.8) -> pd.DataFrame:
        """
        Find pairs with correlation above threshold.
        
        Important for detecting multicollinearity in features.
        """
        corr = self.correlation_matrix('pearson').abs()
        
        # Get upper triangle (avoid duplicates)
        upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
        
        # Find pairs above threshold
        high_corr = []
        for col in upper.columns:
            for idx in upper.index:
                if upper.loc[idx, col] > threshold:
                    high_corr.append({
                        'Feature_1': idx,
                        'Feature_2': col,
                        'Correlation': corr.loc[idx, col],
                        'Issue': 'Multicollinearity risk' if upper.loc[idx, col] > 0.9 else 'High correlation'
                    })
        
        return pd.DataFrame(high_corr)

# Analyze correlations
corr_analyzer = CorrelationAnalyzer(nepse_eda)

# Pearson (linear)
fig_corr = corr_analyzer.plot_correlation_heatmap('pearson')
plt.show()

# Find problematic correlations
high_corr_pairs = corr_analyzer.find_high_correlations(threshold=0.7)
print("High Correlation Pairs (potential redundancy):")
print(high_corr_pairs)

# Compare Pearson vs Spearman for Returns vs Volume
pearson_corr = nepse_eda[['Returns', 'Volume']].corr().iloc[0,1]
spearman_corr = nepse_eda[['Returns', 'Volume']].corr(method='spearman').iloc[0,1]

print(f"\nReturns vs Volume:")
print(f"Pearson (linear): {pearson_corr:.4f}")
print(f"Spearman (rank): {spearman_corr:.4f}")
print("Difference suggests non-linear relationship or outliers")
```

**Explanation:**
- **Correlation Methods**:
  - **Pearson**: Measures linear relationships (-1 to 1). Sensitive to outliers (one bad tick can ruin correlation).
  - **Spearman**: Measures monotonic relationships (rank-based). Robust to outliers. Good for financial data with fat tails.
  - **Kendall**: Measures concordance (how often pairs agree in ranking). Good for small samples.
- **Multicollinearity**: High correlation between features (>0.9) causes problems in linear models (unstable coefficients). Heatmap helps identify redundant features (e.g., Open and Close are highly correlated—maybe only need Close and Range).
- **Divergence between Pearson and Spearman**: Indicates non-linear relationship or heavy influence of outliers.

---

### **7.3.2 Scatter Plots and Relationships**

```python
def plot_bivariate_relationships(df: pd.DataFrame, target: str = 'Returns'):
    """
    Create scatter plot matrix for key relationships.
    """
    # Select features for analysis
    features = ['Open', 'High', 'Low', 'Close', 'Volume', 'MA_20', target]
    features = [f for f in features if f in df.columns]
    
    data = df[features].dropna()
    
    # Create pairplot
    g = sns.pairplot(data, 
                     diag_kind='kde',
                     plot_kws={'alpha': 0.6, 's': 20},
                     diag_kws={'fill': True},
                     height=2.5,
                     aspect=1)
    
    g.fig.suptitle(f'Bivariate Relationships (Target: {target})', 
                   y=1.02, fontsize=16, fontweight='bold')
    
    plt.tight_layout()
    return g

# Create scatter matrix
g = plot_bivariate_relationships(nepse_eda, 'Returns')
plt.show()

# Specific analysis: Volume vs Returns (common hypothesis: high volume accompanies high returns)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Scatter plot with regression line
ax1 = axes[0]
sns.regplot(data=nepse_eda, x='Volume', y='Returns', 
           scatter_kws={'alpha': 0.5}, ax=ax1)
ax1.set_title('Volume vs Returns (with regression line)')
ax1.set_ylabel('Daily Returns')

# Hexbin for density (better for large datasets)
ax2 = axes[1]
hb = ax2.hexbin(nepse_eda['Volume'], nepse_eda['Returns'], 
                gridsize=30, cmap='Blues', mincnt=1)
ax2.set_xlabel('Volume')
ax2.set_ylabel('Returns')
ax2.set_title('Volume vs Returns (density plot)')
plt.colorbar(hb, ax=ax2, label='Count')

plt.tight_layout()
plt.show()

# Calculate rolling correlation to see time-varying relationships
rolling_corr = nepse_eda['Returns'].rolling(60).corr(nepse_eda['Volume'])

plt.figure(figsize=(12, 5))
plt.plot(rolling_corr.index, rolling_corr, color='purple')
plt.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
plt.title('60-Day Rolling Correlation: Returns vs Volume')
plt.ylabel('Correlation')
plt.grid(True, alpha=0.3)
plt.show()
```

**Explanation:**
- **Pairplot**: Matrix of all pairwise scatter plots. Diagonal shows univariate distributions (KDE). Off-diagonal shows relationships. Quickly identifies linear/non-linear patterns.
- **Volume-Return Relationship**: Common hypothesis is that high volume accompanies large price moves (information arrival). The scatter plot tests this.
- **Hexbin**: For large datasets, scatter plots become overplotted (black blob). Hexbin aggregates points into hexagonal bins, color-coded by density.
- **Rolling Correlation**: Relationships in financial markets are not static—they change over time (regime changes). Rolling correlation shows if Volume-Return relationship is strengthening or weakening.

---

### **7.3.3 Cross-Correlation**

```python
def cross_correlation_analysis(df: pd.DataFrame, x: str, y: str, max_lags: int = 10):
    """
    Analyze lead-lag relationships between two time series.
    
    Important for feature engineering: does past volume predict future returns?
    """
    from scipy.signal import correlate
    
    # Remove NaN
    data = df[[x, y]].dropna()
    x_series = data[x]
    y_series = data[y]
    
    # Normalize
    x_norm = (x_series - x_series.mean()) / x_series.std()
    y_norm = (y_series - y_series.mean()) / y_series.std()
    
    # Calculate cross-correlation
    correlation = correlate(x_norm, y_norm, mode='full')
    lags = np.arange(-len(x_norm) + 1, len(x_norm))
    
    # Normalize by length
    correlation = correlation / len(x_norm)
    
    # Keep only relevant lags
    mask = (lags >= -max_lags) & (lags <= max_lags)
    correlation = correlation[mask]
    lags = lags[mask]
    
    # Plot
    plt.figure(figsize=(12, 6))
    plt.bar(lags, correlation, color='steelblue', alpha=0.7)
    plt.axvline(x=0, color='red', linestyle='--', label='Contemporaneous')
    plt.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
    plt.xlabel('Lag (days)')
    plt.ylabel('Cross-Correlation')
    plt.title(f'Cross-Correlation: {x} vs {y}\n(Positive lag = {x} leads {y})')
    plt.grid(True, alpha=0.3)
    plt.legend()
    
    # Find significant peaks
    max_corr_idx = np.argmax(np.abs(correlation))
    max_lag = lags[max_corr_idx]
    max_corr = correlation[max_corr_idx]
    
    plt.annotate(f'Max: {max_corr:.3f} at lag {max_lag}', 
                xy=(max_lag, max_corr), 
                xytext=(max_lag+1, max_corr+0.05),
                arrowprops=dict(arrowstyle='->', color='red'))
    
    plt.show()
    
    return pd.DataFrame({'Lag': lags, 'Correlation': correlation})

# Analyze: Does past volume predict future returns?
print("Analyzing lead-lag relationship: Volume -> Returns")
ccf = cross_correlation_analysis(nepse_eda, 'Volume', 'Returns', max_lags=5)
print(ccf)

# Interpretation
max_row = ccf.loc[ccf['Correlation'].abs().idxmax()]
print(f"\nStrongest relationship: Lag {max_row['Lag']} with correlation {max_row['Correlation']:.4f}")
if max_row['Lag'] > 0:
    print("Volume LEADS returns (predictive signal)")
elif max_row['Lag'] < 0:
    print("Returns LEAD volume (reactionary)")
else:
    print("Contemporaneous relationship")
```

**Explanation:**
- **Cross-Correlation**: Measures correlation between two series at different time lags. Unlike regular correlation (lag 0), this detects lead-lag relationships.
- **Interpretation**:
  - **Lag +1**: Yesterday's Volume correlates with today's Returns (predictive!)
  - **Lag -1**: Yesterday's Returns correlate with today's Volume (reaction)
  - **Lag 0**: Same-day relationship (simultaneous)
- **Feature Engineering**: If Volume at lag 1 predicts Returns, include Volume.shift(1) as a feature in your model.

---

## **7.4 Multivariate Analysis**

When dealing with many features (technical indicators, fundamental data), we need dimensionality reduction and latent structure discovery.

### **7.4.1 Principal Component Analysis**

```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

class PCAnalyzer:
    """
    PCA for dimensionality reduction and feature understanding.
    
    Useful when you have many correlated technical indicators
    and want to reduce to uncorrelated principal components.
    """
    
    def __init__(self, df: pd.DataFrame):
        # Select numeric features, drop target
        self.features = df.select_dtypes(include=[np.number]).drop(['Returns', 'Log_Returns'], axis=1, errors='ignore')
        self.feature_names = self.features.columns
        
        # Standardize (PCA requires scaling)
        self.scaler = StandardScaler()
        self.scaled_data = self.scaler.fit_transform(self.features.dropna())
    
    def fit_pca(self, n_components: int = None):
        """Fit PCA model."""
        self.pca = PCA(n_components=n_components)
        self.components = self.pca.fit_transform(self.scaled_data)
        
        print(f"PCA fitted")
        print(f"Explained variance ratio: {self.pca.explained_variance_ratio_}")
        print(f"Cumulative variance: {np.cumsum(self.pca.explained_variance_ratio_)}")
    
    def plot_variance_explained(self):
        """Scree plot for component selection."""
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
        
        # Individual variance
        ax1.bar(range(1, len(self.pca.explained_variance_ratio_) + 1), 
               self.pca.explained_variance_ratio_,
               alpha=0.7, color='steelblue')
        ax1.set_xlabel('Principal Component')
        ax1.set_ylabel('Explained Variance Ratio')
        ax1.set_title('Scree Plot')
        ax1.grid(True, alpha=0.3)
        
        # Cumulative variance
        cumvar = np.cumsum(self.pca.explained_variance_ratio_)
        ax2.plot(range(1, len(cumvar) + 1), cumvar, 'bo-', linewidth=2, markersize=8)
        ax2.axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
        ax2.set_xlabel('Number of Components')
        ax2.set_ylabel('Cumulative Explained Variance')
        ax2.set_title('Cumulative Variance Explained')
        ax2.legend()
        ax2.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    
    def plot_component_weights(self, n_components: int = 3):
        """Show feature loadings for top components."""
        fig, axes = plt.subplots(1, n_components, figsize=(5*n_components, 6))
        
        if n_components == 1:
            axes = [axes]
        
        for i in range(n_components):
            weights = self.pca.components_[i]
            feature_weights = pd.Series(weights, index=self.feature_names).sort_values()
            
            colors = ['red' if w < 0 else 'green' for w in feature_weights]
            feature_weights.plot(kind='barh', ax=axes[i], color=colors, alpha=0.7)
            axes[i].set_title(f'PC{i+1} ({self.pca.explained_variance_ratio_[i]:.1%} variance)')
            axes[i].axvline(x=0, color='black', linestyle='-', linewidth=0.5)
        
        plt.tight_layout()
        plt.show()

# Prepare features for PCA
# Create technical indicators
features_df = nepse_eda.copy()
features_df['RSI'] = 50 + np.random.randn(len(features_df)) * 10  # Mock RSI
features_df['MACD'] = np.random.randn(len(features_df)) * 5
features_df['BB_Upper'] = features_df['Close'] + np.random.uniform(20, 50, len(features_df))
features_df['BB_Lower'] = features_df['Close'] - np.random.uniform(20, 50, len(features_df))

pca_analysis = PCAnalyzer(features_df)
pca_analysis.fit_pca(n_components=5)
pca_analysis.plot_variance_explained()
pca_analysis.plot_component_weights(n_components=3)
```

**Explanation:**
- **PCA**: Transforms correlated features into uncorrelated principal components. First PC captures maximum variance, second captures maximum remaining variance, etc.
- **Scree Plot**: Shows how much variance each component explains. "Elbow" indicates optimal number of components (diminishing returns after that point).
- **Cumulative Variance**: Typically want 95% of variance retained. If first 3 components explain 95%, you can reduce from N features to 3 without much information loss.
- **Component Weights**: Shows which original features contribute to each PC. PC1 might be "trend" (positive weights on MA, Close), PC2 might be "volatility" (weights on High-Low range).

---

## **7.5 Time-Series Specific Analysis**

### **7.5.1 Trend Analysis**

```python
from scipy.signal import detrend

class TrendAnalyzer:
    """Analyze deterministic trends in time-series."""
    
    def __init__(self, df: pd.DataFrame):
        self.df = df
    
    def analyze_trend(self, column: str = 'Close'):
        """
        Decompose trend using multiple methods.
        """
        data = self.df[column].dropna()
        
        # 1. Linear trend (OLS)
        x = np.arange(len(data))
        slope, intercept, r_value, p_value, std_err = stats.linregress(x, data)
        
        # 2. Moving average trend
        ma_trend = data.rolling(window=50, center=True).mean()
        
        # 3. HP Filter trend
        from statsmodels.tsa.filters.hp_filter import hpfilter
        cycle, hp_trend = hpfilter(data, lamb=1600)
        
        # Plotting
        fig, axes = plt.subplots(2, 1, figsize=(14, 10), sharex=True)
        
        # Original with trends
        axes[0].plot(data.index, data, label='Original', alpha=0.7, color='gray')
        axes[0].plot(data.index, slope * x + intercept, label=f'Linear (slope={slope:.2f}/day)', 
                    color='red', linestyle='--')
        axes[0].plot(data.index, ma_trend, label='MA(50)', color='blue', linewidth=2)
        axes[0].plot(data.index, hp_trend, label='HP Filter', color='green', linewidth=2)
        axes[0].set_title('Trend Decomposition')
        axes[0].legend()
        axes[0].grid(True, alpha=0.3)
        
        # Detrended series
        detrended = data - hp_trend
        axes[1].plot(data.index, detrended, color='purple', alpha=0.7)
        axes[1].axhline(y=0, color='black', linestyle='-', linewidth=0.5)
        axes[1].set_title('Detrended Series (Cyclical Component)')
        axes[1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        print(f"Linear Trend: {slope:.4f} per day ({slope*252:.2f} per year)")
        print(f"R-squared: {r_value**2:.4f} ({r_value**2*100:.1f}% of variance explained by trend)")
        print(f"Trend significance: p-value = {p_value:.2e}")
        
        return {
            'slope': slope,
            'annualized_return': slope * 252,
            'r_squared': r_value**2,
            'p_value': p_value
        }

trend_analysis = TrendAnalyzer(nepse_eda)
trend_stats = trend_analysis.analyze_trend('Close')
```

**Explanation:**
- **Linear Trend**: Simple regression of price on time. Slope shows average daily change. Annualized by multiplying by 252 trading days.
- **R-squared**: Percentage of price variance explained by time trend (0-1). High R-squared means strong trend; low means noisy/random walk.
- **HP Filter**: Smoother trend that adapts to local curvature (better for financial trends that aren't perfectly linear).
- **Detrending**: Removing trend to analyze cyclical components. Essential for mean-reversion strategies.

---

### **7.5.2 Seasonality Detection**

```python
class SeasonalityAnalyzer:
    """Detect and visualize seasonal patterns."""
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        self.df['DayOfWeek'] = self.df.index.dayofweek
        self.df['Month'] = self.df.index.month
        self.df['Quarter'] = self.df.index.quarter
        self.df['DayOfYear'] = self.df.index.dayofyear
    
    def plot_seasonal_decomposition(self, column: str = 'Close'):
        """
        Decompose into trend, seasonal, and residual.
        """
        from statsmodels.tsa.seasonal import seasonal_decompose
        
        data = self.df[column].dropna()
        
        # Decompose (additive: Y = Trend + Seasonal + Residual)
        decomposition = seasonal_decompose(data, model='additive', period=252)  # Annual seasonality
        
        fig, axes = plt.subplots(4, 1, figsize=(14, 12), sharex=True)
        
        decomposition.observed.plot(ax=axes[0], title='Original')
        decomposition.trend.plot(ax=axes[1], title='Trend')
        decomposition.seasonal.plot(ax=axes[2], title='Seasonal')
        decomposition.resid.plot(ax=axes[3], title='Residual')
        
        for ax in axes:
            ax.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    
    def seasonal_heatmap(self, column: str = 'Returns'):
        """Create heatmap of average returns by month and year."""
        pivot = self.df.pivot_table(values=column, 
                                   index=self.df.index.month, 
                                   columns=self.df.index.year, 
                                   aggfunc='mean')
        
        plt.figure(figsize=(12, 6))
        sns.heatmap(pivot, annot=True, fmt='.2%', cmap='RdYlGn', center=0,
                   cbar_kws={'label': 'Average Return'})
        plt.title('Seasonal Heatmap: Average Returns by Month and Year')
        plt.ylabel('Month')
        plt.xlabel('Year')
        plt.show()

seasonality = SeasonalityAnalyzer(nepse_eda)
seasonality.plot_seasonal_decomposition('Close')
seasonality.seasonal_heatmap('Returns')
```

**Explanation:**
- **Decomposition**: Separates time-series into:
  - **Trend**: Long-term direction
  - **Seasonal**: Repeating pattern (annual, quarterly)
  - **Residual**: Irregular noise
- **Period**: 252 trading days = annual seasonality for stocks.
- **Heatmap**: Shows if certain months consistently outperform (e.g., January effect) and how seasonality varies by year.

---

### **7.5.3 Decomposition Methods**

```python
def compare_decomposition_methods(df: pd.DataFrame, column: str):
    """
    Compare STL vs Classical decomposition.
    """
    from statsmodels.tsa.seasonal import STL
    
    data = df[column].dropna()
    
    # Classical decomposition
    classical = seasonal_decompose(data, model='additive', period=252)
    
    # STL decomposition (more robust)
    stl = STL(data, seasonal=253)  # Seasonal period
    stl_result = stl.fit()
    
    # Plot comparison
    fig, axes = plt.subplots(3, 2, figsize=(14, 10), sharex=True)
    
    # Classical
    classical.trend.plot(ax=axes[0,0], title='Classical Trend')
    classical.seasonal.plot(ax=axes[1,0], title='Classical Seasonal')
    classical.resid.plot(ax=axes[2,0], title='Classical Residual')
    
    # STL
    stl_result.trend.plot(ax=axes[0,1], title='STL Trend')
    stl_result.seasonal.plot(ax=axes[1,1], title='STL Seasonal')
    stl_result.resid.plot(ax=axes[2,1], title='STL Residual')
    
    plt.tight_layout()
    plt.show()

compare_decomposition_methods(nepse_eda, 'Close')
```

**Explanation:**
- **Classical Decomposition**: Simple moving averages. Fast but assumes seasonal pattern is constant.
- **STL (Seasonal and Trend decomposition using Loess)**: More flexible, handles changing seasonality over time, robust to outliers. Preferred for modern time-series analysis.

---

## **7.6 Visualization Best Practices**

```python
def create_publication_quality_chart(df: pd.DataFrame):
    """
    Demonstrate best practices for financial charts.
    """
    fig, ax = plt.subplots(figsize=(12, 6))
    
    # Data
    data = df['Close'].tail(100)
    ma20 = df['MA_20'].tail(100)
    ma50 = df['MA_50'].tail(100)
    
    # Plotting
    ax.plot(data.index, data, label='NABIL Close', linewidth=1.5, color='black')
    ax.plot(ma20.index, ma20, label='MA(20)', linewidth=1, color='blue', alpha=0.8)
    ax.plot(ma50.index, ma50, label='MA(50)', linewidth=1, color='red', alpha=0.8)
    
    # Annotations
    max_idx = data.idxmax()
    min_idx = data.idxmin()
    ax.annotate(f'High: {data.max():.0f}', 
                xy=(max_idx, data.max()), 
                xytext=(10, 10), textcoords='offset points',
                bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.7),
                arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0'))
    
    # Styling
    ax.set_title('NEPSE Stock Price with Moving Averages', fontsize=14, fontweight='bold', pad=20)
    ax.set_xlabel('Date', fontsize=12)
    ax.set_ylabel('Price (NPR)', fontsize=12)
    ax.grid(True, alpha=0.3, linestyle='--')
    ax.legend(loc='upper left', framealpha=0.9)
    
    # Remove top and right spines
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    
    # Format y-axis as currency
    ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'Rs.{x:,.0f}'))
    
    plt.tight_layout()
    plt.show()

create_publication_quality_chart(nepse_eda)
```

**Explanation:**
- **Best Practices**:
  - Descriptive title with units
  - Clear legend with meaningful labels
  - Grid for readability (light, dashed)
  - Annotations for key events/extremes
  - Remove chart junk (top/right borders)
  - Proper formatting (currency symbols, dates)
  - Color blindness friendly (avoid red/green only distinctions)

---

## **7.7 Interactive Visualization**

```python
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

def create_interactive_chart(df: pd.DataFrame):
    """
    Create Plotly interactive chart for web dashboards.
    """
    fig = make_subplots(rows=2, cols=1, shared_xaxes=True, 
                       vertical_spacing=0.03, 
                       row_heights=[0.7, 0.3],
                       subplot_titles=('Price', 'Volume'))
    
    # Candlestick
    fig.add_trace(go.Candlestick(
        x=df.index,
        open=df['Open'],
        high=df['High'],
        low=df['Low'],
        close=df['Close'],
        name='OHLC'
    ), row=1, col=1)
    
    # Volume
    colors = ['green' if df['Close'].iloc[i] >= df['Open'].iloc[i] else 'red' 
             for i in range(len(df))]
    
    fig.add_trace(go.Bar(
        x=df.index,
        y=df['Volume'],
        marker_color=colors,
        name='Volume'
    ), row=2, col=1)
    
    # Layout
    fig.update_layout(
        title='NEPSE Interactive Chart',
        yaxis_title='Price (NPR)',
        xaxis_title='Date',
        xaxis_rangeslider_visible=False,  # Hide range slider
        height=600
    )
    
    fig.show()

# Uncomment to run (requires plotly)
# create_interactive_chart(nepse_eda.tail(100))
```

**Explanation:**
- **Plotly**: Interactive web-based visualization. Users can zoom, pan, hover for tooltips, toggle traces.
- **Candlestick**: Standard financial chart showing OHLC in single visual element.
- **Subplots**: Share x-axis (time) between price and volume.
- **Dashboards**: Export to HTML or integrate with Dash/Streamlit for interactive dashboards.

---

## **7.8 Automated EDA Reports**

```python
def generate_automated_report(df: pd.DataFrame, symbol: str):
    """
    Generate comprehensive automated report using pandas-profiling style logic.
    (Simplified version for demonstration)
    """
    from pandas_profiling import ProfileReport  # pip install pandas-profiling
    
    # Create profile
    profile = ProfileReport(df, title=f'NEPSE {symbol} Data Profile', 
                           explorative=True)
    
    # Save to HTML
    profile.to_file(f'nepse_{symbol}_report.html')
    print(f"Report saved to nepse_{symbol}_report.html")

# Alternative: Sweetviz
def generate_sweetviz_report(df: pd.DataFrame, symbol: str):
    """Generate comparison report using sweetviz."""
    import sweetviz as sv
    
    report = sv.analyze(df)
    report.show_html(f'nepse_{symbol}_sweetviz.html')

print("Automated report functions defined")
print("These libraries generate comprehensive HTML reports with:")
print("- Distribution analysis for all columns")
print("- Correlation matrices")
print("- Missing value analysis")
print("- Sample data preview")
```

**Explanation:**
- **Automated EDA**: Libraries like `pandas-profiling` (now ydata-profiling) and `sweetviz` generate comprehensive HTML reports with one line of code.
- **Use Cases**: Quick data quality checks, sharing initial findings with stakeholders, documentation.
- **Limitations**: Generic—may miss domain-specific insights (e.g., OHLC relationships). Always supplement with manual analysis.

---

## **7.9 Communicating Insights**

```python
def generate_executive_summary(df: pd.DataFrame, symbol: str, findings: list):
    """
    Generate text summary for non-technical stakeholders.
    """
    summary = f"""
    NEPSE STOCK ANALYSIS EXECUTIVE SUMMARY
    ======================================
    Symbol: {symbol}
    Analysis Period: {df.index.min().strftime('%Y-%m-%d')} to {df.index.max().strftime('%Y-%m-%d')}
    Trading Days: {len(df)}
    
    KEY METRICS:
    - Average Closing Price: Rs.{df['Close'].mean():,.2f}
    - Volatility (Annualized): {df['Returns'].std() * np.sqrt(252) * 100:.1f}%
    - Total Return: {((df['Close'].iloc[-1] / df['Close'].iloc[0]) - 1) * 100:.1f}%
    - Sharpe Ratio: {(df['Returns'].mean() / df['Returns'].std()) * np.sqrt(252):.2f}
    
    KEY FINDINGS:
    """
    
    for i, finding in enumerate(findings, 1):
        summary += f"{i}. {finding}\n"
    
    summary += f"""
    RECOMMENDATIONS:
    - {'Strong trend detected' if abs(df['Close'].corr(pd.Series(range(len(df))))) > 0.7 else 'No strong trend detected'}
    - {'High seasonality observed' if df.groupby(df.index.month)['Returns'].std().mean() > 0.02 else 'Seasonal effects minimal'}
    - Data Quality: {'Excellent' if df.isnull().sum().sum() == 0 else 'Requires attention'}
    """
    
    return summary

# Generate summary
findings = [
    "Price shows strong upward trend with 15% annual growth",
    "Volume spikes correlate with positive returns (correlation: 0.35)",
    "October historically shows highest volatility",
    "No significant multicollinearity detected in features"
]

summary = generate_executive_summary(nepse_eda, 'NABIL', findings)
print(summary)
```

**Explanation:**
- **Executive Summary**: Translates technical findings into business language.
- **Key Elements**:
  - Period and scope
  - Key metrics (returns, risk, Sharpe)
  - Actionable findings
  - Data quality assessment
- **Audience Adaptation**: Technical details (autocorrelation, kurtosis) for data scientists; trends and risks for executives.

---

## **7.10 EDA Checklist**

```python
class EDAChecklist:
    """
    Comprehensive checklist for EDA completion.
    """
    
    def __init__(self):
        self.checklist = {
            'Data Understanding': [
                'Loaded data shape verified',
                'Column types inspected',
                'Date range confirmed',
                'Missing values quantified',
                'Duplicates checked'
            ],
            'Univariate Analysis': [
                'Distributions plotted (histograms, box plots)',
                'Summary statistics calculated',
                'Normality tests performed',
                'Outliers identified and explained'
            ],
            'Bivariate Analysis': [
                'Correlation matrix computed',
                'Scatter plots for key relationships',
                'Cross-correlations for time lags',
                'Multicollinearity assessed'
            ],
            'Time-Series Specific': [
                'Trend visualized and quantified',
                'Seasonality detected',
                'Stationarity tested (ADF test)',
                'Autocorrelation analyzed (ACF/PACF)',
                'Volatility clustering checked'
            ],
            'Quality Assessment': [
                'Data cleaning validated',
                'Feature engineering opportunities identified',
                'Missing data mechanism understood',
                'Outlier treatment justified'
            ],
            'Documentation': [
                'Visualizations saved',
                'Key findings documented',
                'Executive summary written',
                'Code commented and reproducible'
            ]
        }
    
    def display_checklist(self):
        """Display formatted checklist."""
        print("="*60)
        print("EDA COMPLETION CHECKLIST")
        print("="*60)
        
        for category, items in self.checklist.items():
            print(f"\n{category}:")
            print("-" * len(category))
            for item in items:
                print(f"  [ ] {item}")
        
        print("\n" + "="*60)
        print("Tip: Print this checklist and mark items as you complete them")

checklist = EDAChecklist()
checklist.display_checklist()
```

**Explanation:**
- **Structured Approach**: Ensures no critical step is missed.
- **Categories**: Organized from basic (data loading) to advanced (time-series specific).
- **Reproducibility**: Checklist ensures analysis can be replicated or audited.

---

## **Chapter Summary**

In this chapter, we covered comprehensive EDA for time-series prediction:

### **Key Takeaways:**

1. **Systematic Process**: EDA follows phases—understanding, univariate, bivariate, temporal, quality assessment.

2. **Univariate Analysis**: Examine distributions (histograms, Q-Q plots), calculate moments (skewness, kurtosis), test normality. Financial data is rarely normal.

3. **Bivariate Analysis**: Correlation matrices (Pearson vs Spearman), scatter plots, cross-correlations for lead-lag relationships. Watch for multicollinearity.

4. **Multivariate Analysis**: PCA for dimensionality reduction, understanding feature loadings.

5. **Time-Series Specific**:
   - **Trend**: Linear regression, HP filter, detrending
   - **Seasonality**: Decomposition (STL preferred), seasonal heatmaps
   - **Decomposition**: Separate trend, seasonal, residual components

6. **Visualization**: Publication-quality standards (clear labels, annotations, removed chart junk), interactive Plotly for dashboards.

7. **Automation**: Use pandas-profiling for rapid assessment, but supplement with domain-specific analysis.

8. **Communication**: Executive summaries translate technical findings into actionable business insights.

9. **Checklist**: Structured approach ensures completeness and reproducibility.

### **Next Steps:**

In Chapter 8, we will cover **Data Storage and Management**, including:
- Storage architecture decisions (files vs databases)
- Time-series database optimization (InfluxDB, TimescaleDB)
- Data partitioning strategies
- Backup and recovery procedures
- Cloud storage solutions

---

**End of Chapter 7**

---

*This chapter provided the analytical foundation for understanding NEPSE data before building models. The visualizations and statistical tests demonstrated here should be run before any model development to ensure data quality and to identify predictive features.*