# Data Exploration and Preprocessing
## Exploratory Data Analysis (EDA), Feature Correlation, and Preprocessing

This notebook covers the essential steps for understanding and preparing economic data for modeling:
- **Exploratory Data Analysis (EDA)**: Uncover patterns, anomalies, and insights.
- **Feature Engineering**: Create new variables to improve model performance.
- **Correlation Analysis**: Understand relationships between variables.
- **Data Preprocessing**: Prepare data for machine learning models.

### Key Features:
- Comprehensive summary statistics and data quality checks.
- Advanced visualizations for distribution, correlation, and time-series analysis.
- Interactive plots for deep-dive analysis.
- Standard preprocessing pipelines (scaling, transformations).

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context('talk')

print("📚 Libraries imported successfully")

## 1. Data Loading and Initial Inspection

Load the dataset and perform an initial high-level inspection to understand its structure, data types, and basic properties.

In [None]:
# Parameters (can be overridden by Streamlit)
correlation_threshold = 0.7
outlier_threshold = 2.5

# Load data from a sample file (replace with actual data source)
def load_sample_data():
    """Load sample economic data"""
    # In a real scenario, load from data/raw or data/cleaned
    # For this notebook, we generate synthetic data
    dates = pd.date_range('2010-01-01', periods=120, freq='M')
    data = pd.DataFrame({
        'GDP': 1000 + np.arange(120) * 10 + np.random.normal(0, 20, 120),
        'Inflation': 2 + np.sin(np.arange(120) / 12) * 0.5 + np.random.normal(0, 0.2, 120),
        'Unemployment': 5 - np.cos(np.arange(120) / 24) * 1.5 + np.random.normal(0, 0.3, 120),
        'Interest_Rate': 1.5 + 0.5 * (2 + np.sin(np.arange(120) / 12) * 0.5) + np.random.normal(0, 0.1, 120),
        'Public_Debt': 500 * np.exp(np.cumsum(np.random.normal(0.01, 0.005, 120))),
        'Trade_Balance': -10 + 20 * np.sin(np.arange(120) / 12 + np.pi/4) + np.random.normal(0, 8, 120),
        'Region': np.random.choice(['North', 'South', 'East', 'West'], 120)
    }, index=dates)
    
    # Add some missing values and outliers
    for col in ['GDP', 'Inflation']:
        data.loc[data.sample(frac=0.05).index, col] = np.nan
    data.loc[data.sample(1).index, 'Unemployment'] = 20 # Outlier
    
    return data

df = load_sample_data()

print("📊 DATA OVERVIEW")
print("="*50)
print(f"Shape of data: {df.shape}")
print(f"Date range: {df.index.min().strftime('%Y-%m-%d')} to {df.index.max().strftime('%Y-%m-%d')}")

print("\n📋 FIRST 5 ROWS:")
print(df.head())

print("\nℹ️ DATA INFO:")
df.info()

## 2. Data Quality and Cleaning

Assess data quality by checking for missing values, duplicates, and inconsistencies. Apply cleaning techniques to prepare the data for analysis.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})

print("🗑️ MISSING VALUES ANALYSIS")
print("="*50)
print(missing_df[missing_df['Missing Values'] > 0])

# Visualize missing values
fig = px.imshow(df.isnull(), title='Missing Value Heatmap', color_continuous_scale='gray_r')
fig.show()

# Handle missing values (imputation)
imputer = SimpleImputer(strategy='mean')
numeric_cols = df.select_dtypes(include=np.number).columns
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])

print("\n✅ Missing values handled using mean imputation.")

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\n📋 DUPLICATE ROWS: {duplicates}")
if duplicates > 0:
    df = df.drop_duplicates()
    print("✅ Duplicate rows removed.")

## 3. Exploratory Data Analysis (EDA)

Perform EDA to understand the distribution of each variable, identify trends, and uncover relationships between variables.

In [None]:
# Summary statistics
print("📊 SUMMARY STATISTICS")
print("="*50)
print(df.describe().round(2))

# Distribution of numeric variables
numeric_cols = df.select_dtypes(include=np.number).columns
n_cols = 3
n_rows = (len(numeric_cols) - 1) // n_cols + 1

fig = make_subplots(rows=n_rows, cols=n_cols, subplot_titles=numeric_cols)

for i, col in enumerate(numeric_cols):
    row = i // n_cols + 1
    col_pos = i % n_cols + 1
    fig.add_trace(go.Histogram(x=df[col], name=col, nbinsx=30), row=row, col=col_pos)

fig.update_layout(title='Distribution of Economic Indicators', height=300*n_rows, showlegend=False)
fig.show()

# Time series plots
fig = make_subplots(rows=n_rows, cols=n_cols, subplot_titles=numeric_cols)

for i, col in enumerate(numeric_cols):
    row = i // n_cols + 1
    col_pos = i % n_cols + 1
    fig.add_trace(go.Scatter(x=df.index, y=df[col], name=col, mode='lines'), row=row, col=col_pos)

fig.update_layout(title='Time Series of Economic Indicators', height=300*n_rows, showlegend=False)
fig.show()

## 4. Correlation and Relationship Analysis

Analyze the correlation between variables to identify multicollinearity and understand key relationships.

In [None]:
# Correlation matrix
corr_matrix = df[numeric_cols].corr()

print("🔗 CORRELATION MATRIX")
print("="*50)
print(corr_matrix.round(2))

# Visualize correlation matrix
fig = px.imshow(corr_matrix, 
                title='Correlation Matrix of Economic Indicators',
                color_continuous_scale='RdBu_r',
                zmin=-1, zmax=1,
                text_auto=True)
fig.update_layout(height=600)
fig.show()

# Identify highly correlated pairs
high_corr_pairs = corr_matrix.abs().unstack().sort_values(ascending=False)
high_corr_pairs = high_corr_pairs[high_corr_pairs < 1]
high_corr_pairs = high_corr_pairs[high_corr_pairs > correlation_threshold]

print(f"\n🎯 HIGHLY CORRELATED PAIRS (Threshold > {correlation_threshold}):")
print(high_corr_pairs.head(10))

# Pair plot for detailed relationship analysis
print("\n📈 Generating pair plot (this may take a moment)...")
fig = px.scatter_matrix(df, dimensions=numeric_cols, 
                        title='Pair Plot of Economic Indicators',
                        color='Region')
fig.update_layout(height=1000)
fig.show()

## 5. Feature Engineering

Create new features from existing data to enhance model performance.

In [None]:
print("🛠️ FEATURE ENGINEERING")
print("="*50)

df_eng = df.copy()

# Time-based features
df_eng['Year'] = df_eng.index.year
df_eng['Month'] = df_eng.index.month
df_eng['Quarter'] = df_eng.index.quarter
print("✅ Added time-based features (Year, Month, Quarter)")

# Lag features
for col in ['GDP', 'Inflation']:
    for lag in [1, 3, 6]:
        df_eng[f'{col}_lag_{lag}'] = df_eng[col].shift(lag)
print("✅ Added lag features for GDP and Inflation")

# Rolling window features
for col in ['Unemployment', 'Interest_Rate']:
    for window in [3, 6]:
        df_eng[f'{col}_rolling_mean_{window}'] = df_eng[col].rolling(window=window).mean()
        df_eng[f'{col}_rolling_std_{window}'] = df_eng[col].rolling(window=window).std()
print("✅ Added rolling window features for Unemployment and Interest Rate")

# Interaction features
df_eng['Debt_to_GDP'] = df_eng['Public_Debt'] / df_eng['GDP']
print("✅ Added interaction feature: Debt_to_GDP ratio")

# Drop rows with NaNs created by feature engineering
df_eng = df_eng.dropna()

print(f"\n📊 Shape of engineered data: {df_eng.shape}")
print("\n📋 ENGINEERED FEATURES PREVIEW:")
print(df_eng.head())

## 6. Data Preprocessing

Prepare the data for machine learning models through scaling, encoding, and transformations.

In [None]:
print("🔄 DATA PREPROCESSING")
print("="*50)

df_processed = df_eng.copy()

# 1. Encoding categorical variables
encoder = OneHotEncoder(sparse_output=False, drop='first')
region_encoded = encoder.fit_transform(df_processed[['Region']])
region_df = pd.DataFrame(region_encoded, columns=encoder.get_feature_names_out(['Region']), index=df_processed.index)
df_processed = pd.concat([df_processed.drop('Region', axis=1), region_df], axis=1)
print("✅ Encoded 'Region' using One-Hot Encoding")

# 2. Scaling numerical features
numeric_cols_to_scale = df_processed.select_dtypes(include=np.number).columns
scaler = StandardScaler()
df_processed[numeric_cols_to_scale] = scaler.fit_transform(df_processed[numeric_cols_to_scale])
print("✅ Scaled numerical features using StandardScaler")

print("\n📊 PROCESSED DATA PREVIEW:")
print(df_processed.head().round(2))

# 3. Principal Component Analysis (PCA) for dimensionality reduction
pca = PCA(n_components=0.95) # Retain 95% of variance
df_pca = pca.fit_transform(df_processed)

print(f"\n🔬 PCA ANALYSIS:")
print(f"   Original number of features: {df_processed.shape[1]}")
print(f"   Number of components to retain 95% variance: {pca.n_components_}")

# Visualize explained variance
explained_variance = np.cumsum(pca.explained_variance_ratio_)
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(1, len(explained_variance) + 1), y=explained_variance, mode='lines+markers'))
fig.add_hline(y=0.95, line_dash='dash', line_color='red', annotation_text='95% Variance')
fig.update_layout(title='PCA Explained Variance', xaxis_title='Number of Components', yaxis_title='Cumulative Explained Variance')
fig.show()

## 7. Summary and Next Steps

Key findings from the data exploration and preprocessing steps, and recommendations for modeling.

In [None]:
print("🎯 DATA EXPLORATION SUMMARY")
print("="*50)

print("📊 DATA QUALITY:")
print("  - Initial dataset contained missing values, which were imputed.")
print("  - No duplicate rows were found after initial cleaning.")

print("📈 KEY INSIGHTS:")
print("  - Most economic indicators show clear time-dependent trends and seasonality.")
print("  - Strong correlation observed between Interest Rate and Inflation.")
  - 'Unemployment' variable contained a significant outlier, which should be handled during modeling.")

print("🛠️ FEATURE ENGINEERING:")
print("  - Created time-based, lag, and rolling window features to capture temporal dynamics.")
print("  - Engineered 'Debt_to_GDP' ratio, a critical economic indicator.")

print("🔄 PREPROCESSING:")
print("  - Categorical features were one-hot encoded.")
print("  - Numerical features were standardized to have zero mean and unit variance.")
print("  - PCA suggests that dimensionality can be significantly reduced while retaining most of the variance.")

print("\n💡 NEXT STEPS & RECOMMENDATIONS:")
print("  1. **Modeling**: Use the preprocessed data to train predictive models.")
print("  2. **Feature Selection**: Use techniques like RFE or feature importance from tree-based models to select the most relevant features.")
print("  3. **Outlier Handling**: Implement robust scaling or outlier removal techniques before training sensitive models like linear regression.")
print("  4. **Cross-Validation**: Use time-series cross-validation (e.g., TimeSeriesSplit) to evaluate models robustly.")

print("\n✅ Data exploration and preprocessing complete!")