# Task 4: Forecasting Access and Usage (Enhanced)

## Objectives
- Forecast Account Ownership Rate (Access) for 2025-2027
- Forecast Digital Payment Usage for 2025-2027
- Generate multi-model ensemble forecasts with confidence intervals
- Create scenario analysis framework
- Quantify uncertainty and provide actionable insights

## Enhanced Methodology
- **Multi-Model Ensemble**: ARIMA, Prophet, Event-Adjusted, Machine Learning
- **Uncertainty Quantification**: Monte Carlo simulation, bootstrap confidence intervals
- **Scenario Analysis**: Optimistic, Base Case, Pessimistic scenarios
- **Event Integration**: Incorporate planned policies and investments

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pathlib import Path
import warnings
from scipy import stats
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
warnings.filterwarnings('ignore')

# Enhanced forecasting libraries
try:
    from statsmodels.tsa.arima.model import ARIMA
    from statsmodels.tsa.seasonal import seasonal_decompose
    print("✅ Statsmodels time series capabilities")
except:
    print("⚠️ Statsmodels not available, using simplified models")

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

print("Enhanced forecasting libraries loaded successfully")
print("✅ Statistical modeling capabilities")
print("✅ Machine learning for forecasting")
print("✅ Uncertainty quantification tools")

✅ Statsmodels time series capabilities
Enhanced forecasting libraries loaded successfully
✅ Statistical modeling capabilities
✅ Machine learning for forecasting
✅ Uncertainty quantification tools


In [3]:
# Load data and prepare for forecasting
df = pd.read_csv('../data/processed/ethiopia_fi_enriched_data.csv')
observations = df[df['record_type'] == 'observation'].copy()
events = df[df['record_type'] == 'event'].copy()
impact_links = df[df['record_type'] == 'impact_link'].copy()

# Prepare time series data
# Safely parse dates: keep raw copies and coerce unparsable strings to NaT
observations['observation_date_raw'] = observations['observation_date']
observations['observation_date'] = pd.to_datetime(observations['observation_date'], errors='coerce')
events['observation_date_raw'] = events['observation_date']
events['observation_date'] = pd.to_datetime(events['observation_date'], errors='coerce')
# Report a few unparsable samples for debugging
unparsable_events = events[events['observation_date'].isna()]['observation_date_raw'].dropna().unique()
if len(unparsable_events):
    print('Unparsable event observation_date samples:', unparsable_events[:10])
unparsable_obs = observations[observations['observation_date'].isna()]['observation_date_raw'].dropna().unique()
if len(unparsable_obs):
    print('Unparsable observation observation_date samples:', unparsable_obs[:10])

# Define target indicators
targets = {
    'account_ownership': 'access_account_ownership',
    'digital_payments': 'usage_digital_payment',
    'mobile_money': 'usage_mm_account'
}

# Extract time series for each target
time_series = {}
for name, indicator in targets.items():
    data = observations[observations['indicator_code'] == indicator].copy()
    data = data.sort_values('observation_date')
    time_series[name] = data
    
print("Time series data prepared:")
for name, data in time_series.items():
    ds = data.dropna(subset=['observation_date'])
    if ds.empty:
        print(f"{name}: 0 valid date observations")
        continue
    start_year = int(ds['observation_date'].min().year)
    end_year = int(ds['observation_date'].max().year)
    latest_val = ds['value_numeric'].iloc[-1]
    print(f"{name}: {len(data)} observations from {start_year} to {end_year}")
    print(f"  Latest value: {latest_val:.3f} ({latest_val*100:.1f}%)")

Unparsable event observation_date samples: <StringArray>
[            'Ethio Telecom',                 'Safaricom',
        'Safaricom Ethiopia', 'National Bank of Ethiopia',
                 'EthSwitch']
Length: 5, dtype: str
Time series data prepared:
account_ownership: 5 observations from 2011 to 2024
  Latest value: 0.490 (49.0%)
digital_payments: 1 observations from 2024 to 2024
  Latest value: 0.350 (35.0%)
mobile_money: 1 observations from 2024 to 2024
  Latest value: 0.095 (9.4%)


## Forecasting Framework

### 1. Baseline Trend Models
- **Linear Regression**: Simple trend extrapolation
- **Polynomial Regression**: Non-linear trend capture
- **ARIMA**: Time series with autocorrelation

### 2. Event-Adjusted Models
- **Intervention Analysis**: Event dummies and lag effects
- **Event Accumulation**: Cumulative impact over time
- **Dynamic Regression**: Time-varying coefficients

### 3. Machine Learning Models
- **Random Forest**: Non-linear relationships
- **Gradient Boosting**: Sequential learning
- **Ensemble Methods**: Model combination

### 4. Scenario Analysis
- **Optimistic**: Full policy implementation + investment
- **Base Case**: Current trajectory continuation
- **Pessimistic**: Delays + economic headwinds