# Comprehensive Economic Data Exploration

## Exploratory Data Analysis (EDA) of Kenyan Economic Indicators

This notebook provides a deep and immersive exploratory analysis of key Kenyan economic datasets. We will load, clean, and visualize multiple data sources to uncover trends, patterns, and relationships.

### Key Objectives:
- **Data Ingestion & Cleaning**: Load multiple CSV files, handle inconsistencies, and prepare data for analysis.
- **Time-Series Analysis**: Visualize trends in GDP, public debt, exchange rates, and more.
- **Correlation & Feature Analysis**: Understand the interplay between different economic variables.
- **Advanced Visualization**: Utilize Plotly for interactive and insightful charts.
- **Preprocessing**: Prepare a cleaned, merged dataset for advanced modeling.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context('talk')

print("📚 Libraries imported successfully")

## 1. Data Loading and Initial Inspection

Load the dataset and perform an initial high-level inspection to understand its structure, data types, and basic properties.

In [None]:
# Parameters
data_dir = '../data/raw/'

# --- Data Loading and Cleaning Functions ---

def load_gdp_data(path):
    """Loads and cleans the annual GDP data."""
    gdp = pd.read_csv(path)
    gdp.columns = ['Year', 'Nominal_GDP_Ksh_M', 'Real_GDP_Ksh_M', 'Real_GDP_Growth']
    gdp['Year'] = pd.to_datetime(gdp['Year'], format='%Y')
    gdp.set_index('Year', inplace=True)
    for col in gdp.columns:
        gdp[col] = pd.to_numeric(gdp[col].astype(str).str.replace(',', ''), errors='coerce')
    return gdp

def load_public_debt_data(path):
    """Loads and cleans the public debt data."""
    debt = pd.read_csv(path, skiprows=3) # Adjust skiprows as needed
    debt = debt.iloc[:, :2] # Select first two columns
    debt.columns = ['Date', 'Total_Debt_Ksh_B']
    debt['Date'] = pd.to_datetime(debt['Date'], errors='coerce')
    debt.dropna(subset=['Date'], inplace=True)
    debt.set_index('Date', inplace=True)
    debt['Total_Debt_Ksh_B'] = pd.to_numeric(debt['Total_Debt_Ksh_B'].astype(str).str.replace(',', ''), errors='coerce')
    return debt.resample('A').last() # Resample to annual

def load_exchange_rate_data(path):
    """Loads and cleans the monthly exchange rate data."""
    fx = pd.read_csv(path, skiprows=3)
    fx = fx.iloc[:, :2]
    fx.columns = ['Date', 'KES_USD']
    fx['Date'] = pd.to_datetime(fx['Date'], errors='coerce')
    fx.dropna(subset=['Date'], inplace=True)
    fx.set_index('Date', inplace=True)
    fx['KES_USD'] = pd.to_numeric(fx['KES_USD'].astype(str).str.replace(',', ''), errors='coerce')
    return fx.resample('A').mean() # Resample to annual average

# --- Load and Merge Data ---
gdp_df = load_gdp_data(f'{data_dir}Annual GDP.csv')
debt_df = load_public_debt_data(f'{data_dir}Public Debt.csv')
fx_df = load_exchange_rate_data(f'{data_dir}Monthly exchange rate (end period).csv')

# Merge datasets
df = gdp_df.join(debt_df, how='inner').join(fx_df, how='inner')
df['Debt_to_GDP_Ratio'] = (df['Total_Debt_Ksh_B'] * 1000) / df['Nominal_GDP_Ksh_M']

print("📊 DATA OVERVIEW")
print("="*50)
print(f"Shape of merged data: {df.shape}")
print(f"Date range: {df.index.min().strftime('%Y-%m-%d')} to {df.index.max().strftime('%Y-%m-%d')}")

print("\n📋 FIRST 5 ROWS:")
print(df.head())

print("\nℹ️ DATA INFO:")
df.info()

## 2. Data Quality and Cleaning

Assess data quality by checking for missing values, duplicates, and inconsistencies. Apply cleaning techniques to prepare the data for analysis.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})

print("🗑️ MISSING VALUES ANALYSIS")
print("="*50)
print(missing_df[missing_df['Missing Values'] > 0])

# Visualize missing values
fig = px.imshow(df.isnull(), title='Missing Value Heatmap', color_continuous_scale='gray_r')
fig.show()

# Handle missing values (imputation)
imputer = SimpleImputer(strategy='mean')
numeric_cols = df.select_dtypes(include=np.number).columns
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])

print("\n✅ Missing values handled using mean imputation.")

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\n📋 DUPLICATE ROWS: {duplicates}")
if duplicates > 0:
    df = df.drop_duplicates()
    print("✅ Duplicate rows removed.")

## 3. Exploratory Data Analysis (EDA)

Perform EDA to understand the distribution of each variable, identify trends, and uncover relationships between variables.

In [None]:
# Summary statistics
print("📊 SUMMARY STATISTICS")
print("="*50)
print(df.describe().round(2))

# --- Advanced Time-Series Visualization ---

fig = make_subplots(
    rows=3, cols=1,
    shared_xaxes=True,
    subplot_titles=('Real GDP Growth vs. KES/USD Exchange Rate', 'Public Debt (Ksh Billions)', 'Debt-to-GDP Ratio'),
    vertical_spacing=0.1
)

# Plot 1: Real GDP Growth and FX
fig.add_trace(go.Bar(x=df.index, y=df['Real_GDP_Growth'], name='Real GDP Growth %', marker_color='lightblue'), row=1, col=1)
fig.add_trace(go.Scatter(x=df.index, y=df['KES_USD'], name='KES/USD Rate', mode='lines', marker_color='purple'), secondary_y=True, row=1, col=1)

# Plot 2: Public Debt
fig.add_trace(go.Scatter(x=df.index, y=df['Total_Debt_Ksh_B'], name='Total Debt (Ksh B)', mode='lines+markers', fill='tozeroy', marker_color='green'), row=2, col=1)

# Plot 3: Debt-to-GDP Ratio
fig.add_trace(go.Scatter(x=df.index, y=df['Debt_to_GDP_Ratio'], name='Debt-to-GDP Ratio', mode='lines', marker_color='red'), row=3, col=1)


fig.update_layout(
    title='Key Kenyan Economic Indicators Over Time',
    height=900,
    showlegend=True,
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
)
fig.update_yaxes(title_text="Percentage / KES per USD", row=1, col=1)
fig.update_yaxes(title_text="Ksh Billions", row=2, col=1)
fig.update_yaxes(title_text="Ratio", row=3, col=1)

fig.show()

## 4. Correlation and Relationship Analysis

Analyze the correlation between variables to identify multicollinearity and understand key relationships.

In [None]:
# Correlation matrix
numeric_cols = df.select_dtypes(include=np.number).columns
corr_matrix = df[numeric_cols].corr()

print("🔗 CORRELATION MATRIX")
print("="*50)
print(corr_matrix.round(2))

# Visualize correlation matrix
fig = px.imshow(corr_matrix, 
                title='Correlation Matrix of Key Economic Indicators',
                color_continuous_scale='RdBu_r',
                zmin=-1, zmax=1,
                text_auto=True)
fig.update_layout(height=600)
fig.show()

# Pair plot for detailed relationship analysis
print("\n📈 Generating pair plot for immersive analysis...")
fig = px.scatter_matrix(df, dimensions=numeric_cols,
                        title='Pair Plot of Economic Indicators',
                        height=800)
fig.update_traces(diagonal_visible=False)
fig.show()

## 5. Feature Engineering

Create new features from existing data to enhance model performance.

In [None]:
print("🛠️ FEATURE ENGINEERING")
print("="*50)

df_eng = df.copy()

# Year-over-Year changes
for col in ['Nominal_GDP_Ksh_M', 'Total_Debt_Ksh_B', 'KES_USD']:
    df_eng[f'{col}_YoY_Change'] = df_eng[col].pct_change() * 100
    print(f"✅ Added {col} YoY Change")

# Lag features
for lag in [1, 2]:
    df_eng[f'Real_GDP_Growth_Lag_{lag}'] = df_eng['Real_GDP_Growth'].shift(lag)
    print(f"✅ Added Real GDP Growth Lag {lag}")

# Interaction term
df_eng['FX_x_DebtRatio'] = df_eng['KES_USD'] * df_eng['Debt_to_GDP_Ratio']
print("✅ Added interaction feature: FX_x_DebtRatio")

# Drop rows with NaNs created by feature engineering
df_eng = df_eng.dropna()

print(f"\n📊 Shape of engineered data: {df_eng.shape}")
print("\n📋 ENGINEERED FEATURES PREVIEW:")
print(df_eng.head())

## 6. Data Preprocessing

Prepare the data for machine learning models through scaling, encoding, and transformations.

In [None]:
print("🔄 DATA PREPROCESSING")
print("="*50)

df_processed = df_eng.copy()

# Scaling numerical features
numeric_cols_to_scale = df_processed.select_dtypes(include=np.number).columns
scaler = StandardScaler()
df_processed[numeric_cols_to_scale] = scaler.fit_transform(df_processed[numeric_cols_to_scale])
print("✅ Scaled numerical features using StandardScaler")

print("\n📊 PROCESSED DATA PREVIEW:")
print(df_processed.head().round(2))

# Principal Component Analysis (PCA) for dimensionality reduction
pca = PCA(n_components=0.95) # Retain 95% of variance
df_pca = pca.fit_transform(df_processed)

print(f"\n🔬 PCA ANALYSIS:")
print(f"   Original number of features: {df_processed.shape[1]}")
print(f"   Number of components to retain 95% variance: {pca.n_components_}")

# Visualize explained variance
explained_variance = np.cumsum(pca.explained_variance_ratio_)
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(1, len(explained_variance) + 1), y=explained_variance, mode='lines+markers'))
fig.add_hline(y=0.95, line_dash='dash', line_color='red', annotation_text='95% Variance')
fig.update_layout(title='PCA Explained Variance on Economic Data', xaxis_title='Number of Components', yaxis_title='Cumulative Explained Variance')
fig.show()

## 7. Summary and Next Steps

Key findings from the data exploration and preprocessing steps, and recommendations for modeling.

In [None]:
print("🎯 COMPREHENSIVE EDA SUMMARY")
print("="*50)

print("📊 DATA QUALITY & INTEGRATION:")
print("  - Successfully loaded and merged multiple datasets (GDP, Debt, FX rates).")
print("  - Handled data type inconsistencies and resampled to a consistent annual frequency.")

print("📈 KEY INSIGHTS:")
print("  - Strong visual correlation between the rise in public debt and the depreciation of the KES.")
print("  - The Debt-to-GDP ratio shows a significant upward trend, a key indicator of fiscal pressure.")
print("  - Real GDP growth has been volatile, influenced by both domestic and external factors.")

print("🛠️ FEATURE ENGINEERING:")
print("  - Created Year-over-Year change features to capture economic momentum.")
print("  - Lagged GDP growth to account for delayed effects in economic systems.")

print("🔄 PREPROCESSING:")
print("  - Standardized all numerical features to prepare for modeling.")
print("  - PCA indicates that a smaller number of components can explain most of the variance in the dataset.")

print("\n💡 NEXT STEPS & RECOMMENDATIONS:")
print("  1. **Advanced Modeling**: Use the preprocessed and engineered data to train advanced forecasting models (e.g., VAR, Prophet, LSTMs).")
print("  2. **Causality Analysis**: Investigate causal relationships between variables (e.g., Granger causality).")
print("  3. **Scenario Modeling**: Use the insights to model different economic scenarios and their potential impact.")

print("\n✅ Immersive data exploration and preprocessing complete!")