# üåç African Commodities Paradox - Quickstart Guide

**Author:** Abraham Adegoke  
**Date:** December 2025

This notebook provides an interactive introduction to the African Commodities Paradox analysis tool.

---

## üìã What This Tool Does

This project analyzes the relationship between **commodity dependence** and **economic volatility** in African countries.

**Key Questions:**
- Do resource-rich countries experience more volatile growth?
- Which factors (commodity dependence, inflation, governance) amplify instability?
- Can we predict GDP growth volatility using structural indicators?

## ‚öôÔ∏è Setup: Import Libraries and Configure

In [None]:
import sys
from pathlib import Path

# Add src to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / 'src'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print("‚úÖ Libraries imported successfully")
print(f"üìÅ Project root: {project_root}")

## üìù User Configuration

Customize your analysis by modifying the parameters below:

In [None]:
# ============================================
# üìù USER CONFIGURATION
# ============================================

# Option 1: Choose specific countries (ISO3 codes)
COUNTRIES = ['NGA', 'ZAF', 'KEN', 'GHA', 'EGY', 'DZA', 'AGO', 'ETH']

# Option 2: Use a predefined subset (uncomment to use)
# from yaml import safe_load
# with open(project_root / 'configs/countries.yaml', 'r') as f:
#     config = safe_load(f)
# COUNTRIES = config['oil_exporters']  # or: mineral_dependent, agricultural, all_countries

# Time period
START_YEAR = 2000
END_YEAR = 2023

# Download settings
DOWNLOAD_FRESH_DATA = False  # Set to True to download new data

print(f"üìä Analysis Configuration:")
print(f"  Countries: {', '.join(COUNTRIES)}")
print(f"  Period: {START_YEAR} - {END_YEAR}")
print(f"  Download new data: {DOWNLOAD_FRESH_DATA}")

## üì• Step 1: Load or Download Data

In [None]:
data_path = project_root / 'data/raw/worldbank_wdi.csv'

if DOWNLOAD_FRESH_DATA or not data_path.exists():
    print("üì• Downloading data from World Bank...")
    print("(This may take 2-5 minutes depending on the number of countries)\n")
    
    from data_io.worldbank import fetch_wdi_data
    
    df_raw = fetch_wdi_data(
        countries=COUNTRIES,
        start_year=START_YEAR,
        end_year=END_YEAR,
        output_path=str(data_path)
    )
    print(f"\n‚úÖ Downloaded {len(df_raw)} records")
else:
    print("üìÇ Loading existing data...")
    df_raw = pd.read_csv(data_path)
    print(f"‚úÖ Loaded {len(df_raw)} records")

# Display sample
print("\nüìã Sample of raw data:")
df_raw.head(10)

## üîç Step 2: Data Exploration

In [None]:
# Basic statistics
print("üìä Dataset Overview:")
print(f"  Shape: {df_raw.shape}")
print(f"  Countries: {df_raw['country'].nunique()}")
print(f"  Years: {df_raw['year'].min()} - {df_raw['year'].max()}")
print(f"  Total observations: {len(df_raw)}")

print("\nüìà Countries in dataset:")
country_counts = df_raw.groupby('country_name')['year'].count().sort_values(ascending=False)
print(country_counts)

print("\n‚ùå Missing values:")
missing = df_raw.isnull().sum()
missing_pct = (missing / len(df_raw) * 100).round(1)
missing_df = pd.DataFrame({'missing': missing, 'pct': missing_pct})
print(missing_df[missing_df['missing'] > 0].sort_values('missing', ascending=False))

In [None]:
# Visualize Commodity Dependence Index (CDI)
if 'cdi_raw' in df_raw.columns:
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Plot 1: CDI distribution
    axes[0].hist(df_raw['cdi_raw'].dropna(), bins=30, edgecolor='black', alpha=0.7)
    axes[0].set_xlabel('Commodity Dependence Index (%)')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('Distribution of Commodity Dependence (CDI)')
    axes[0].axvline(df_raw['cdi_raw'].mean(), color='red', linestyle='--', 
                    label=f'Mean: {df_raw["cdi_raw"].mean():.1f}%')
    axes[0].legend()
    
    # Plot 2: Average CDI by country
    cdi_by_country = df_raw.groupby('country_name')['cdi_raw'].mean().sort_values(ascending=False).head(10)
    colors = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(cdi_by_country)))
    axes[1].barh(range(len(cdi_by_country)), cdi_by_country.values, color=colors)
    axes[1].set_yticks(range(len(cdi_by_country)))
    axes[1].set_yticklabels(cdi_by_country.index)
    axes[1].set_xlabel('Average CDI (%)')
    axes[1].set_title('Top 10 Most Commodity-Dependent Countries')
    axes[1].invert_yaxis()
    
    plt.tight_layout()
    plt.show()
    
    print("\nüî• Most commodity-dependent countries:")
    print(cdi_by_country)
else:
    print("‚ö†Ô∏è CDI column not found in data")

## ‚öôÔ∏è Step 3: Feature Engineering

Now we'll create the features needed for modeling:
1. **CDI smoothing** (3-year moving average)
2. **GDP growth volatility** (5-year rolling std)
3. **Lagged features** (t-1)

In [None]:
# Sort data
df = df_raw.sort_values(['country', 'year']).copy()

# 1. Smooth CDI with 3-year moving average
print("‚öôÔ∏è  Applying 3-year moving average to CDI...")
if 'cdi_raw' in df.columns:
    df['cdi_smooth'] = df.groupby('country')['cdi_raw'].transform(
        lambda x: x.rolling(window=3, min_periods=1).mean()
    )
else:
    print("‚ö†Ô∏è cdi_raw not found, skipping CDI smoothing")

# 2. Calculate 5-year rolling volatility of GDP growth
print("üìä Calculating GDP growth volatility (5-year rolling std)...")
if 'gdp_growth' in df.columns:
    df['gdp_volatility'] = df.groupby('country')['gdp_growth'].transform(
        lambda x: x.rolling(window=5, min_periods=3).std()
    )
    # Log-transform volatility (as per proposal)
    df['log_gdp_volatility'] = np.log(df['gdp_volatility'] + 0.01)
else:
    print("‚ö†Ô∏è gdp_growth not found, skipping volatility calculation")

# 3. Create lagged features (t-1)
print("üîÑ Creating lagged features (t-1)...")
lag_features = ['cdi_smooth', 'inflation', 'trade_openness', 'investment']

for feature in lag_features:
    if feature in df.columns:
        df[f'{feature}_lag1'] = df.groupby('country')[feature].shift(1)

# Remove rows with NaN in target
if 'log_gdp_volatility' in df.columns:
    df_features = df.dropna(subset=['log_gdp_volatility'])
else:
    df_features = df.copy()

print(f"\n‚úÖ Feature engineering complete!")
print(f"  Final dataset shape: {df_features.shape}")
print(f"  Features created: {[col for col in df_features.columns if 'lag1' in col or 'smooth' in col or 'volatility' in col]}")

# Display sample
print("\nüìã Sample with engineered features:")
display_cols = ['country_name', 'year', 'cdi_raw', 'cdi_smooth', 'gdp_growth', 'gdp_volatility', 'log_gdp_volatility']
display_cols = [c for c in display_cols if c in df_features.columns]
df_features[display_cols].head(10)

In [None]:
# Visualize relationship between CDI and volatility
if 'cdi_smooth' in df_features.columns and 'log_gdp_volatility' in df_features.columns:
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Plot 1: Scatter plot
    scatter_data = df_features.dropna(subset=['cdi_smooth', 'log_gdp_volatility'])
    axes[0].scatter(scatter_data['cdi_smooth'], scatter_data['log_gdp_volatility'], alpha=0.5)
    axes[0].set_xlabel('Commodity Dependence Index (smoothed)')
    axes[0].set_ylabel('Log GDP Growth Volatility')
    axes[0].set_title('CDI vs Economic Volatility')
    
    # Add trend line
    if len(scatter_data) > 1:
        z = np.polyfit(scatter_data['cdi_smooth'], scatter_data['log_gdp_volatility'], 1)
        p = np.poly1d(z)
        x_line = np.linspace(scatter_data['cdi_smooth'].min(), scatter_data['cdi_smooth'].max(), 100)
        axes[0].plot(x_line, p(x_line), "r--", alpha=0.8, label='Trend')
        axes[0].legend()
    
    # Plot 2: Boxplot by CDI quartiles
    df_features['cdi_quartile'] = pd.qcut(
        df_features['cdi_smooth'].dropna(), 
        q=4, 
        labels=['Q1 (Low)', 'Q2', 'Q3', 'Q4 (High)']
    )
    
    quartile_data = df_features.dropna(subset=['cdi_quartile', 'gdp_volatility'])
    quartile_data.boxplot(column='gdp_volatility', by='cdi_quartile', ax=axes[1])
    axes[1].set_xlabel('CDI Quartile')
    axes[1].set_ylabel('GDP Growth Volatility')
    axes[1].set_title('Economic Volatility by Commodity Dependence')
    plt.suptitle('')  # Remove default title
    
    plt.tight_layout()
    plt.show()
    
    print("\nüìä Volatility statistics by CDI quartile:")
    print(quartile_data.groupby('cdi_quartile')['gdp_volatility'].describe())
else:
    print("‚ö†Ô∏è Required columns not available for visualization")

## üíæ Step 4: Save Processed Data

In [None]:
# Save to processed folder
output_path = project_root / 'data/processed/features_ready.csv'
output_path.parent.mkdir(parents=True, exist_ok=True)

# Drop temporary columns
if 'cdi_quartile' in df_features.columns:
    df_features = df_features.drop(columns=['cdi_quartile'])

df_features.to_csv(output_path, index=False)

print(f"‚úÖ Processed data saved to: {output_path}")
print(f"\nüìä Final dataset summary:")
print(f"  Shape: {df_features.shape}")
print(f"  Countries: {df_features['country'].nunique()}")
print(f"  Years: {df_features['year'].min()} - {df_features['year'].max()}")
print(f"\nüí° Next steps:")
print(f"  1. Train models: python scripts/train_models.py")
print(f"  2. Or continue exploring in this notebook")

## üéØ Step 5: Quick Model Training (Optional)

Let's do a quick model training to see initial results:

In [None]:
# Prepare data for modeling
features = ['cdi_smooth_lag1', 'inflation_lag1', 'trade_openness_lag1', 'investment_lag1']
target = 'log_gdp_volatility'

# Check which features are available
available_features = [f for f in features if f in df_features.columns]
print(f"Available features: {available_features}")

if target in df_features.columns and len(available_features) >= 2:
    # Drop NaN
    df_model = df_features[available_features + [target]].dropna()
    print(f"\nModeling dataset: {len(df_model)} observations")
    
    X = df_model[available_features]
    y = df_model[target]
    
    # Train/test split
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    print(f"Train set: {len(X_train)} | Test set: {len(X_test)}")
else:
    print("‚ö†Ô∏è Not enough features or target variable for modeling")

In [None]:
# Quick Ridge Regression
if 'X_train' in dir():
    from sklearn.linear_model import RidgeCV
    from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
    
    # Train Ridge
    ridge = RidgeCV(alphas=np.logspace(-2, 3, 50), cv=5)
    ridge.fit(X_train, y_train)
    
    # Predict
    y_pred = ridge.predict(X_test)
    
    # Evaluate
    print("\nüìä Ridge Regression Results:")
    print(f"  Best alpha: {ridge.alpha_:.4f}")
    print(f"  R¬≤ Score: {r2_score(y_test, y_pred):.4f}")
    print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
    print(f"  MAE: {mean_absolute_error(y_test, y_pred):.4f}")
    
    # Coefficients
    print("\nüìà Feature Coefficients:")
    coef_df = pd.DataFrame({
        'feature': available_features,
        'coefficient': ridge.coef_
    }).sort_values('coefficient', key=abs, ascending=False)
    print(coef_df.to_string(index=False))