# Task 4 — Forecasting Access and Usage (2025–2027)
 **Objective:** Forecast (i) Account Ownership (Access) and (ii) Digital Financial Usage (Usage)
 for Ethiopia over 2025–2027.

 **Targets (rubric + implementation reality):**
 - Access: `ACC_OWNERSHIP` = % of adults with an account (FI or mobile money)
 - Usage (rubric):  `USG_DIGITAL_PAYMENT` = % of adults who made/received digital payment
 - Usage (implemented proxy): `USG_ACTIVE_RATE` = % of adults active in digital finance (proxy)
   because `USG_DIGITAL_PAYMENT` is not present in the compiled dataset.

 **Approach (sparse data):**
 1) Baseline trend forecast: OLS if ≥2 points, else flat carry-forward baseline
 2) Event-augmented forecast = baseline + annual Task 3 event effects
 3) Scenarios (pessimistic / base / optimistic)

 **Uncertainty:**
 - OLS trend: approximate prediction intervals
 - Flat baseline: SD grows with horizon (configurable)
 - Events: scenario scaling + assumed event uncertainty (implemented in forecasting module)

 Outputs:
 - `outputs/task_4/forecast_table_task4.csv`
 - `outputs/task_4/scenario_plot_access_usage.png`
 - `outputs/task_4/top_event_contributors_2027.csv`

In [None]:
from __future__ import annotations

from os import PathLike
import sys
from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt

PROJECT_ROOT = Path.cwd()
if PROJECT_ROOT.name == "notebooks":
    PROJECT_ROOT = PROJECT_ROOT.parent

SRC = PROJECT_ROOT / "src"
if str(SRC) not in sys.path:
    sys.path.insert(0, str(SRC))

from fi.data_io import load_csv
from fi.forecasting import (
    DEFAULT_SCENARIOS,
    baseline_forecast_auto,
    event_effects_by_year,
    forecast_event_augmented,
    prepare_findex_series,
    top_event_contributors,
)

OBS_PATH = PROJECT_ROOT / "data" / "processed" / "eda_enriched" / "observations.csv"
EFFECTS_PATH = PROJECT_ROOT / "outputs" / "task_3" / "event_effects_tidy.csv"

# quick guardrail
assert OBS_PATH.exists(), f"Missing: {OBS_PATH}"
assert EFFECTS_PATH.exists(), f"Missing: {EFFECTS_PATH}"
# Key indicators for heatmap plotting
# Final Task 4 targets (Usage uses proxy code)
TARGETS = {
    "ACC_OWNERSHIP": "Account ownership (Access)",
    "USG_ACTIVE_RATE": "Active usage rate (Usage proxy)",
}

YEARS_FCST = [2025, 2026, 2027]

# Optional: quick sanity thresholds for plotting
Y_MIN, Y_MAX = 0, 100

## Load

In [None]:
def load_csv(path: str | PathLike[str]) -> pd.DataFrame:
    """Load a CSV file into a DataFrame."""
    return pd.read_csv(path)

df_obs = load_csv(OBS_PATH)
df_eff = load_csv(EFFECTS_PATH)

display(df_obs.head())
display(df_eff.head())

## Guardrail: verify Task 3 tidy contains forecast years

If `event_effects_tidy.csv` does not contain `year` in {2025, 2026, 2027}, then Task 4 event impacts will be zero
in the forecast horizon (even if `event_date` is in 2025). This check is informational and does not stop execution.

In [None]:
years_present = set(pd.to_numeric(df_eff["year"], errors="coerce").dropna().astype(int).unique().tolist())
missing = [y for y in YEARS_FCST if y not in years_present]
if missing:
    print(
        "WARNING: event_effects_tidy.csv is missing forecast years: "
        + ", ".join(map(str, missing))
        + "\nEvent impacts in Task 4 will be 0 for those years until Task 3 tidy is regenerated with an expanded year grid."
    )
else:
    print("OK: event_effects_tidy.csv contains all forecast years.")

## 1) Baseline forecasts + 2) Event-augmented forecasts + scenarios

Notes:
 - `baseline_forecast_auto()` uses OLS if enough historical points exist; otherwise it carries forward last value.
 - `event_effects_by_year()` should already include proxy mapping logic (e.g., `USG_P2P_COUNT` → `USG_ACTIVE_RATE`) if configured in the module.


In [None]:
all_fc = []

for code, label in TARGETS.items():
    series = prepare_findex_series(
        df_obs,
        indicator_code=code,
        gender="all",
        location="national",
    )
    print(f"\nHistorical series — {code} — {label}")
    display(series)

    # Baseline (auto handles sparse series)
    base = baseline_forecast_auto(series, YEARS_FCST)

    # Annual event effects (should be non-zero only if Task 3 tidy includes 2025–2027 rows)
    ev = event_effects_by_year(df_eff, indicator_code=code, years=YEARS_FCST)

    for scen in DEFAULT_SCENARIOS:
        fc = forecast_event_augmented(base, ev, scenario=scen)
        fc.insert(0, "indicator_code", code)
        fc.insert(1, "indicator_label", label)
        all_fc.append(fc)

forecast_table = pd.concat(all_fc, ignore_index=True)
forecast_table = forecast_table.sort_values(["indicator_code", "scenario", "year"]).reset_index(drop=True)
display(forecast_table)

## Save forecast table

In [None]:
Path("outputs/task_4").mkdir(parents=True, exist_ok=True)
forecast_table.to_csv("outputs/task_4/forecast_table_task4.csv", index=False)
print("Saved: outputs/task_4/forecast_table_task4.csv")

## Scenario visualization

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=False)

for ax, (code, label) in zip(axes, TARGETS.items()):
    sub = forecast_table[forecast_table["indicator_code"] == code]
    for scen in ["pessimistic", "base", "optimistic"]:
        ssub = sub[sub["scenario"] == scen].sort_values("year")
        ax.plot(ssub["year"], ssub["pred_pp"], label=scen)
        ax.fill_between(ssub["year"], ssub["lo_pp"], ssub["hi_pp"], alpha=0.15)

    ax.set_title(label)
    ax.set_xlabel("Year")
    ax.set_ylabel("% of adults")
    ax.set_ylim(Y_MIN, Y_MAX)
    ax.grid(True, alpha=0.3)

axes[0].legend(title="Scenario")
fig.suptitle("Task 4 Forecasts (2025–2027): scenarios with uncertainty bands")
plt.tight_layout()
plt.show()

fig.savefig("outputs/task_4/scenario_plot_access_usage.png", dpi=200, bbox_inches="tight")
print("Saved: outputs/task_4/scenario_plot_access_usage.png")

## Which events have largest potential impact? (interpretation support)

This uses `top_event_contributors()` against Task 3 tidy output.
If Task 3 tidy does not contain year=2027 rows, the table will be empty or near-zero.

In [None]:
for code, label in TARGETS.items():
    top = top_event_contributors(df_eff, indicator_code=code, year=2027, k=5)
    print(f"\nTop event contributors (2027) — {code} — {label}")
    display(top)

## Save top contributors table (optional but helpful for audit trail)

In [None]:
tops = []
for code, label in TARGETS.items():
    top = top_event_contributors(df_eff, indicator_code=code, year=2027, k=10)
    top.insert(0, "indicator_code", code)
    top.insert(1, "indicator_label", label)
    tops.append(top)

tops_df = pd.concat(tops, ignore_index=True) if len(tops) else pd.DataFrame()
tops_df.to_csv("outputs/task_4/top_event_contributors_2027.csv", index=False)
print("Saved: outputs/task_4/top_event_contributors_2027.csv")
display(tops_df.head(20))

## Written interpretation (rubric deliverable)

### What does the model predict?
 - The **baseline** uses a linear regression when enough survey points exist; otherwise it uses a conservative flat baseline.
 - The **event-augmented forecast** adds annualized event effects from Task 3 (e.g., product launches, infrastructure).
 - We provide **pessimistic/base/optimistic** scenarios for 2025–2027.

### What events have the largest potential impact?
 - The table `outputs/task_4/top_event_contributors_2027.csv` lists the biggest event contributions in 2027.
 - These are the events the model is most sensitive to (largest absolute pp effect).

### Key uncertainties / limitations
 - **Sparse data** ⇒ slopes can be weakly identified; intervals widen quickly.
 - **Proxy usage KPI** ⇒ `USG_ACTIVE_RATE` is used as a proxy for rubric’s `USG_DIGITAL_PAYMENT` due to data availability.
 - **Additivity** ⇒ event effects add linearly and do not saturate; real adoption may saturate.
 - **Forecast-year event effects** ⇒ Task 3 tidy must include 2025–2027 in its year grid or event impacts will be zero.
 - **Structural breaks** (macro shocks, regulation changes) are not explicitly modeled.