In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# üåç African Commodities Paradox - Quickstart Guide\n",
    "\n",
    "**Author:** Abraham Adegoke  \n",
    "**Date:** November 2025\n",
    "\n",
    "This notebook provides an interactive introduction to the African Commodities Paradox analysis tool.\n",
    "\n",
    "---\n",
    "\n",
    "## üìã What This Tool Does\n",
    "\n",
    "This project analyzes the relationship between **commodity dependence** and **economic volatility** in African countries.\n",
    "\n",
    "**Key Questions:**\n",
    "- Do resource-rich countries experience more volatile growth?\n",
    "- Which factors (commodity dependence, inflation, governance) amplify instability?\n",
    "- Can we predict GDP growth volatility using structural indicators?\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ‚öôÔ∏è Setup: Choose Your Analysis Parameters\n",
    "\n",
    "**Customize your analysis by selecting:**\n",
    "- Countries to analyze\n",
    "- Time period\n",
    "- Whether to download fresh data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# üìù USER CONFIGURATION\n",
    "# =====================\n",
    "\n",
    "# Option 1: Choose specific countries (ISO3 codes)\n",
    "COUNTRIES = ['NGA', 'ZAF', 'KEN', 'GHA', 'EGY', 'DZA', 'AGO', 'ETH']\n",
    "\n",
    "# Option 2: Or use a predefined subset (uncomment to use)\n",
    "# SUBSET = 'oil_exporters'  # Options: oil_exporters, mineral_dependent, agricultural, all_countries\n",
    "# COUNTRIES = None  # Set to None when using SUBSET\n",
    "\n",
    "# Time period\n",
    "START_YEAR = 2000\n",
    "END_YEAR = 2023\n",
    "\n",
    "# Download settings\n",
    "DOWNLOAD_FRESH_DATA = True  # Set to False to use existing data\n",
    "\n",
    "print(f\"üìä Analysis Configuration:\")\n",
    "print(f\"  Countries: {', '.join(COUNTRIES) if COUNTRIES else SUBSET}\")\n",
    "print(f\"  Period: {START_YEAR} - {END_YEAR}\")\n",
    "print(f\"  Download new data: {DOWNLOAD_FRESH_DATA}\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üì• Step 1: Data Collection\n",
    "\n",
    "Download economic indicators from the World Bank for your selected countries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "import sys\n",
    "from pathlib import Path\n",
    "\n",
    "# Add src to path\n",
    "sys.path.insert(0, str(Path.cwd().parent / 'src'))\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "# Set plot style\n",
    "sns.set_style('whitegrid')\n",
    "plt.rcParams['figure.figsize'] = (12, 6)\n",
    "\n",
    "print(\"‚úÖ Libraries imported successfully\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "if DOWNLOAD_FRESH_DATA:\n",
    "    print(\"üì• Downloading data from World Bank...\")\n",
    "    print(\"(This may take 2-5 minutes depending on the number of countries)\\n\")\n",
    "    \n",
    "    from data_io.worldbank import fetch_wdi_data\n",
    "    \n",
    "    # Load countries from config if using subset\n",
    "    if COUNTRIES is None:\n",
    "        import yaml\n",
    "        with open('../configs/countries.yaml', 'r') as f:\n",
    "            config = yaml.safe_load(f)\n",
    "        COUNTRIES = config[SUBSET]\n",
    "    \n",
    "    # Fetch data\n",
    "    df_raw = fetch_wdi_data(\n",
    "        countries=COUNTRIES,\n",
    "        start_year=START_YEAR,\n",
    "        end_year=END_YEAR,\n",
    "        output_path='../data/raw/worldbank_wdi.csv'\n",
    "    )\n",
    "    \n",
    "    print(f\"\\n‚úÖ Downloaded {len(df_raw)} records\")\n",
    "else:\n",
    "    print(\"üìÇ Loading existing data...\")\n",
    "    df_raw = pd.read_csv('../data/raw/worldbank_wdi.csv')\n",
    "    print(f\"‚úÖ Loaded {len(df_raw)} records\")\n",
    "\n",
    "# Display sample\n",
    "print(\"\\nüìã Sample of raw data:\")\n",
    "df_raw.head(10)"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üîç Step 2: Data Exploration\n",
    "\n",
    "Let's explore the data we just downloaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Basic statistics\n",
    "print(\"üìä Dataset Overview:\")\n",
    "print(f\"  Shape: {df_raw.shape}\")\n",
    "print(f\"  Countries: {df_raw['country'].nunique()}\")\n",
    "print(f\"  Years: {df_raw['year'].min()} - {df_raw['year'].max()}\")\n",
    "print(f\"  Total observations: {len(df_raw)}\")\n",
    "\n",
    "print(\"\\nüìà Countries in dataset:\")\n",
    "country_counts = df_raw.groupby('country_name')['year'].count().sort_values(ascending=False)\n",
    "print(country_counts)\n",
    "\n",
    "print(\"\\n‚ùå Missing values:\")\n",
    "missing = df_raw.isnull().sum()\n",
    "missing = missing[missing > 0].sort_values(ascending=False)\n",
    "print(missing)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Visualize Commodity Dependence Index (CDI)\n",
    "fig, axes = plt.subplots(1, 2, figsize=(15, 5))\n",
    "\n",
    "# Plot 1: CDI distribution\n",
    "axes[0].hist(df_raw['cdi_raw'].dropna(), bins=30, edgecolor='black')\n",
    "axes[0].set_xlabel('Commodity Dependence Index (%)')\n",
    "axes[0].set_ylabel('Frequency')\n",
    "axes[0].set_title('Distribution of Commodity Dependence (CDI)')\n",
    "axes[0].axvline(df_raw['cdi_raw'].mean(), color='red', linestyle='--', label=f'Mean: {df_raw[\"cdi_raw\"].mean():.1f}%')\n",
    "axes[0].legend()\n",
    "\n",
    "# Plot 2: Average CDI by country\n",
    "cdi_by_country = df_raw.groupby('country_name')['cdi_raw'].mean().sort_values(ascending=False).head(10)\n",
    "axes[1].barh(range(len(cdi_by_country)), cdi_by_country.values)\n",
    "axes[1].set_yticks(range(len(cdi_by_country)))\n",
    "axes[1].set_yticklabels(cdi_by_country.index)\n",
    "axes[1].set_xlabel('Average CDI (%)')\n",
    "axes[1].set_title('Top 10 Most Commodity-Dependent Countries')\n",
    "axes[1].invert_yaxis()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"\\nüî• Most commodity-dependent countries:\")\n",
    "print(cdi_by_country)"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ‚öôÔ∏è Step 3: Feature Engineering\n",
    "\n",
    "Now we'll create the features needed for modeling:\n",
    "1. **CDI smoothing** (3-year moving average)\n",
    "2. **GDP growth volatility** (5-year rolling std)\n",
    "3. **Lagged features** (t-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Sort data\n",
    "df = df_raw.sort_values(['country', 'year']).copy()\n",
    "\n",
    "# 1. Smooth CDI with 3-year moving average\n",
    "print(\"‚öôÔ∏è  Applying 3-year moving average to CDI...\")\n",
    "df['cdi_smooth'] = df.groupby('country')['cdi_raw'].transform(\n",
    "    lambda x: x.rolling(window=3, min_periods=1).mean()\n",
    ")\n",
    "\n",
    "# 2. Calculate 5-year rolling volatility of GDP growth\n",
    "print(\"üìä Calculating GDP growth volatility (5-year rolling std)...\")\n",
    "df['gdp_volatility'] = df.groupby('country')['gdp_growth'].transform(\n",
    "    lambda x: x.rolling(window=5, min_periods=3).std()\n",
    ")\n",
    "\n",
    "# Log-transform volatility (as per proposal)\n",
    "df['log_gdp_volatility'] = np.log(df['gdp_volatility'] + 0.01)\n",
    "\n",
    "# 3. Create lagged features (t-1)\n",
    "print(\"üîÑ Creating lagged features (t-1)...\")\n",
    "lag_features = ['cdi_smooth', 'inflation', 'trade_openness', 'investment']\n",
    "\n",
    "for feature in lag_features:\n",
    "    df[f'{feature}_lag1'] = df.groupby('country')[feature].shift(1)\n",
    "\n",
    "# Remove rows with NaN in target\n",
    "df_features = df.dropna(subset=['log_gdp_volatility'])\n",
    "\n",
    "print(f\"\\n‚úÖ Feature engineering complete!\")\n",
    "print(f\"  Final dataset shape: {df_features.shape}\")\n",
    "print(f\"  Features created: {[col for col in df_features.columns if 'lag1' in col or 'smooth' in col or 'volatility' in col]}\")\n",
    "\n",
    "# Display sample\n",
    "print(\"\\nüìã Sample with engineered features:\")\n",
    "df_features[['country_name', 'year', 'cdi_raw', 'cdi_smooth', 'gdp_growth', 'gdp_volatility', 'log_gdp_volatility']].head(10)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Visualize relationship between CDI and volatility\n",
    "fig, axes = plt.subplots(1, 2, figsize=(15, 5))\n",
    "\n",
    "# Plot 1: Scatter plot\n",
    "axes[0].scatter(df_features['cdi_smooth'], df_features['log_gdp_volatility'], alpha=0.5)\n",
    "axes[0].set_xlabel('Commodity Dependence Index (smoothed)')\n",
    "axes[0].set_ylabel('Log GDP Growth Volatility')\n",
    "axes[0].set_title('CDI vs Economic Volatility')\n",
    "\n",
    "# Add trend line\n",
    "z = np.polyfit(df_features['cdi_smooth'].dropna(), df_features['log_gdp_volatility'].dropna(), 1)\n",
    "p = np.poly1d(z)\n",
    "axes[0].plot(df_features['cdi_smooth'].dropna(), p(df_features['cdi_smooth'].dropna()), \"r--\", alpha=0.8, label='Trend')\n",
    "axes[0].legend()\n",
    "\n",
    "# Plot 2: Boxplot by CDI quartiles\n",
    "df_features['cdi_quartile'] = pd.qcut(df_features['cdi_smooth'], q=4, labels=['Q1 (Low)', 'Q2', 'Q3', 'Q4 (High)'])\n",
    "df_features.boxplot(column='gdp_volatility', by='cdi_quartile', ax=axes[1])\n",
    "axes[1].set_xlabel('CDI Quartile')\n",
    "axes[1].set_ylabel('GDP Growth Volatility')\n",
    "axes[1].set_title('Economic Volatility by Commodity Dependence')\n",
    "plt.suptitle('')  # Remove default title\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"\\nüìä Volatility statistics by CDI quartile:\")\n",
    "print(df_features.groupby('cdi_quartile')['gdp_volatility'].describe())"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üíæ Step 4: Save Processed Data\n",
    "\n",
    "Save the feature-engineered dataset for modeling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Save to processed folder\n",
    "output_path = '../data/processed/features_ready.csv'\n",
    "Path(output_path).parent.mkdir(parents=True, exist_ok=True)\n",
    "df_features.to_csv(output_path, index=False)\n",
    "\n",
    "print(f\"‚úÖ Processed data saved to: {output_path}\")\n",
    "print(f\"\\nüìä Final dataset summary:\")\n",
    "print(f\"  Shape: {df_features.shape}\")\n",
    "print(f\"  Countries: {df_features['country'].nunique()}\")\n",
    "print(f\"  Years: {df_features['year'].min()} - {df_features['year'].max()}\")\n",
    "print(f\"\\nüí° Next steps:\")\n",
    "print(f\"  1. Explore further: notebooks/01_data_exploration.ipynb\")\n",
    "print(f\"  2. Train models: notebooks/03_modeling.ipynb\")\n",
    "print(f\"  3. Or use CLI: python scripts/train_models.py\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üéØ Summary\n",
    "\n",
    "**What we accomplished:**\n",
    "\n",
    "‚úÖ Downloaded economic data for selected African countries  \n",
    "‚úÖ Calculated Commodity Dependence Index (CDI)  \n",
    "‚úÖ Engineered features: smoothed CDI, GDP volatility, lagged variables  \n",
    "‚úÖ Explored relationship between commodity dependence and economic volatility  \n",
    "‚úÖ Saved processed data for modeling  \n",
    "\n",
    "**Key Insights:**\n",
    "- Higher commodity dependence appears correlated with greater economic volatility\n",
    "- Country-specific patterns vary significantly\n",
    "- Ready for machine learning modeling!\n",
    "\n",
    "---\n",
    "\n",
    "**Continue to the modeling notebooks to:**\n",
    "- Train Ridge Regression and Gradient Boosting models\n",
    "- Predict GDP growth volatility\n",
    "- Identify key drivers of economic instability\n",
    "- Generate insights for policy recommendations"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}