# Predictive Modeling of County-Level Acute Food Insecurity in Kenya

## Business Understanding

### Context and Motivation

Food insecurity remains a persistent challenge in Kenya, particularly in arid and semi-arid lands (ASALs). According to the [World Food Programme – Kenya](https://www.wfp.org/countries/kenya), **approximately 4.3 million people were projected to face acute food insecurity during the peak of the 2023 drought**, highlighting the scale of climate-driven vulnerability. The [Global Hunger Index – Kenya](https://www.globalhungerindex.org/kenya.html) classifies Kenya’s hunger level as **“serious”**, indicating ongoing structural food security challenges. Projections from the [EU Knowledge for Policy platform](https://knowledge4policy.ec.europa.eu/publication/kenya-acute-food-insecurity-situation-july-september-2024-projection-october-2024_en) further showed that **multiple counties were expected to experience IPC Phase 3 (Crisis) or worse during seasonal periods in 2024**.

These recent assessments demonstrate that large populations in Kenya repeatedly face acute food insecurity. However, current tools such as IPC primarily describe conditions after they occur. This limits early action and often leads to reactive humanitarian response. There is therefore a clear need for predictive systems that can anticipate food insecurity and support proactive decision-making.

---

### Project Goal

This project develops a predictive early-warning model for acute food insecurity risk at the county level in Kenya by integrating historical IPC classifications with key drivers including climate conditions, staple food price dynamics, conflict events, and baseline poverty indicators. The resulting model and dashboard will provide forward-looking risk scores to support early action by county governments, humanitarian organizations, and policy planners, helping shift decision-making from reactive response to anticipatory planning.

---

### Project Objectives

1. Build a county-level early-warning system that anticipates IPC Phase 3+ food insecurity 1–3 months ahead.
2. Reveal the most influential climate, market, conflict, and poverty drivers shaping food insecurity risk across Kenya.
3. Produce reliable risk scores and hotspot signals to support timely policy and humanitarian action.
4. Deliver a decision-ready dashboard that visualizes risk trends, spatial patterns, and key contributing factors.

---

## Data Understanding

This project leverages five key data sources that represent both the target outcome and predictive signals:

### IPC Kenya Classifications

The IPC dataset provides historical classifications of food insecurity severity (Phase 1–5) by county and period. It serves as the **target variable** for supervised learning by flagging whether a county is in crisis (Phase 3 or worse) at a given time.

### Subnational Rainfall Indicators

Rainfall data contains monthly totals, historical averages, and anomalies for Kenyan counties. These indicators capture **climate stress**, particularly drought conditions, which are a known driver of crop failure and livestock losses.

### WFP Food Prices

Food price data tracks the cost of staple commodities — such as maize, beans, and rice — in various markets over time. Changes in staple food prices reflect **market stress** and reduced food access for net buyers, especially among low-income households.

### Political Violence Events

Conflict event data records the number of violent events and associated fatalities by county and month. Political violence disrupts markets, displacement, and access to humanitarian support, making it a critical **shock indicator** for food insecurity modeling.

### Multidimensional Poverty Index (MPI)

The MPI dataset provides county-level socioeconomic vulnerability measures, including headcount poverty rates and intensity of deprivation. These indicators act as **baseline contextual features**, helping the model account for structural vulnerability that can exacerbate the impact of shocks.

### 1. Notebook Setup & Environment

In [1]:
# Core data handling
import pandas as pd
import numpy as np

# Date and time utilities
from datetime import datetime

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    roc_auc_score,
    classification_report,
    confusion_matrix,
    precision_score,
    recall_score,
    f1_score
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Gradient Boosting
import xgboost as xgb

# Imbalance handling
from imblearn.over_sampling import SMOTE

# Model explainability
import shap

# File system utilities
from pathlib import Path
import os

# Notebook display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", "{:.4f}".format)

# Visualization style
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Project directory structure
BASE_DIR = Path.cwd()

DATA_DIR = BASE_DIR / "data"
MODELS_DIR = BASE_DIR / "models"
OUTPUTS_DIR = BASE_DIR / "outputs"
FIGURES_DIR = OUTPUTS_DIR / "figures"

# Create directories if they don't exist
for directory in [DATA_DIR, MODELS_DIR, OUTPUTS_DIR, FIGURES_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

print("Project directories initialized.")
print(f"Base directory: {BASE_DIR}")

Project directories initialized.
Base directory: c:\Users\Jasho\Documents\Moringa\Phase 5\Food Insecurity\PHASE-5-PROJECT---PREDICTING-FOOD-INSECURITY\notebooks


### 2. Data Loading & Inventory

In [4]:
# 2.1 Define File Paths
BASE_DIR = Path.cwd().parent if Path.cwd().name.lower() == "notebooks" else Path.cwd()
DATA_DIR = BASE_DIR / "data"

ipc_path = DATA_DIR / "ipc_ken_area_long.csv"
rainfall_path = DATA_DIR / "ken-rainfall-subnat-full.csv"
prices_path = DATA_DIR / "wfp_food_prices_ken.csv"
conflict_path = DATA_DIR / "kenya_political_violence_events_and_fatalities_by_month-year_as-of-11feb2026.xlsx"
mpi_path = DATA_DIR / "ken_mpi.csv"

# Quick existence check for all files
paths = {
    "IPC": ipc_path,
    "Rainfall": rainfall_path,
    "Food Prices": prices_path,
    "Conflict": conflict_path,
    "MPI": mpi_path,
}
missing_files = [name for name, p in paths.items() if not p.exists()]
if missing_files:
    print("Base directory:", BASE_DIR)
    print("Data directory:", DATA_DIR)
    for name, p in paths.items():
        print(f"{name} -> {p} (exists={p.exists()})")
    raise FileNotFoundError(f"Missing file(s) in DATA_DIR: {missing_files}")

# 2.2 Load Raw Datasets
def _read_csv_safely(path: Path) -> pd.DataFrame:
    """Read CSV with a safe fallback for encoding."""
    try:
        return pd.read_csv(path)
    except UnicodeDecodeError:
        return pd.read_csv(path, encoding="latin-1")

ipc_df = _read_csv_safely(ipc_path)
rainfall_df = _read_csv_safely(rainfall_path)
prices_df = _read_csv_safely(prices_path)
conflict_df = pd.read_excel(conflict_path, sheet_name="Data")
mpi_df = _read_csv_safely(mpi_path)

# 2.3 Initial Inspection
datasets = {
    "IPC": ipc_df,
    "Rainfall": rainfall_df,
    "Food Prices": prices_df,
    "Conflict": conflict_df,
    "MPI": mpi_df
}

for name, df in datasets.items():
    print(f"\n{name} Dataset")
    print("-" * 40)
    print("Shape:", df.shape)
    print("\nColumns:")
    print(df.columns.tolist())
    print("\nPreview:")
    display(df.head())

# 2.4 Data Types Check
for name, df in datasets.items():
    print(f"\n{name} Data Types")
    print("-" * 40)
    display(df.dtypes)

# 2.5 Missing Value Summary
for name, df in datasets.items():
    print(f"\n{name} Missing Values")
    print("-" * 40)
    display(df.isna().sum().sort_values(ascending=False))


IPC Dataset
----------------------------------------
Shape: (4522, 11)

Columns:
['Date of analysis', 'Country', 'Total country population', 'Level 1', 'Area', 'Validity period', 'From', 'To', 'Phase', 'Number', 'Percentage']

Preview:


Unnamed: 0,Date of analysis,Country,Total country population,Level 1,Area,Validity period,From,To,Phase,Number,Percentage
0,Jul 2025,KEN,16617000,Others,Marsabit,current,2025-07-01,2025-09-30,all,515000,1.0
1,Jul 2025,KEN,16617000,Others,Marsabit,current,2025-07-01,2025-09-30,3+,103000,0.2
2,Jul 2025,KEN,16617000,Others,Marsabit,current,2025-07-01,2025-09-30,1,128750,0.25
3,Jul 2025,KEN,16617000,Others,Marsabit,current,2025-07-01,2025-09-30,2,283250,0.55
4,Jul 2025,KEN,16617000,Others,Marsabit,current,2025-07-01,2025-09-30,3,103000,0.2



Rainfall Dataset
----------------------------------------
Shape: (131544, 15)

Columns:
['date', 'adm_level', 'adm_id', 'PCODE', 'n_pixels', 'rfh', 'rfh_avg', 'r1h', 'r1h_avg', 'r3h', 'r3h_avg', 'rfq', 'r1q', 'r3q', 'version']

Preview:


Unnamed: 0,date,adm_level,adm_id,PCODE,n_pixels,rfh,rfh_avg,r1h,r1h_avg,r3h,r3h_avg,rfq,r1q,r3q,version
0,1981-01-01,1,51325,KE019,427.0,7.3724,15.7594,,,,,59.5988,,,final
1,1981-01-11,1,51325,KE019,427.0,4.3255,19.2948,,,,,38.3849,,,final
2,1981-01-21,1,51325,KE019,427.0,5.5691,16.2654,17.267,51.3196,,,49.7008,39.5368,,final
3,1981-02-01,1,51325,KE019,427.0,5.8829,12.7193,15.7775,48.2795,,,61.4184,38.9972,,final
4,1981-02-11,1,51325,KE019,427.0,17.1803,18.7686,28.6323,47.7533,,,93.3177,63.7539,,final



Food Prices Dataset
----------------------------------------
Shape: (17365, 16)

Columns:
['date', 'admin1', 'admin2', 'market', 'market_id', 'latitude', 'longitude', 'category', 'commodity', 'commodity_id', 'unit', 'priceflag', 'pricetype', 'currency', 'price', 'usdprice']

Preview:


Unnamed: 0,date,admin1,admin2,market,market_id,latitude,longitude,category,commodity,commodity_id,unit,priceflag,pricetype,currency,price,usdprice
0,#date,#adm1+name,#adm2+name,#loc+market+name,#loc+market+code,#geo+lat,#geo+lon,#item+type,#item+name,#item+code,#item+unit,#item+price+flag,#item+price+type,#currency+code,#value,#value+usd
1,2006-01-15,Coast,Mombasa,Mombasa,191,-4.05,39.67,cereals and tubers,Maize,51,KG,actual,Wholesale,KES,16.13,0.22
2,2006-01-15,Coast,Mombasa,Mombasa,191,-4.05,39.67,cereals and tubers,Maize (white),67,90 KG,actual,Wholesale,KES,1480,20.58
3,2006-01-15,Coast,Mombasa,Mombasa,191,-4.05,39.67,pulses and nuts,Beans,50,KG,actual,Wholesale,KES,33.63,0.47
4,2006-01-15,Eastern,Kitui,Kitui,187,-1.37,38.02,cereals and tubers,Maize (white),67,KG,actual,Retail,KES,17,0.24



Conflict Dataset
----------------------------------------
Shape: (350, 5)

Columns:
['Country', 'Month', 'Year', 'Events', 'Fatalities']

Preview:


Unnamed: 0,Country,Month,Year,Events,Fatalities
0,Kenya,January,1997,3,6
1,Kenya,February,1997,3,9
2,Kenya,March,1997,6,179
3,Kenya,April,1997,4,7
4,Kenya,May,1997,4,22



MPI Dataset
----------------------------------------
Shape: (49, 11)

Columns:
['Country ISO3', 'Admin 1 PCode', 'Admin 1 Name', 'MPI', 'Headcount Ratio', 'Intensity of Deprivation', 'Vulnerable to Poverty', 'In Severe Poverty', 'Survey', 'Start Date', 'End Date']

Preview:


Unnamed: 0,Country ISO3,Admin 1 PCode,Admin 1 Name,MPI,Headcount Ratio,Intensity of Deprivation,Vulnerable to Poverty,In Severe Poverty,Survey,Start Date,End Date
0,#country+code,#adm1+code,#adm1+name,#indicator+mpi,#indicator+headcount_ratio,#indicator+intensity_of_deprivation,#indicator+vulnerable_to_poverty,#indicator+in_severe_poverty,#meta+survey,#date+start,#date+end
1,KEN,,,0.1134,25.3523,44.7108,26.4044,7.4594,DHS,2022-01-01 00:00:00+00:00,2022-12-31 23:59:59+00:00
2,KEN,KE001,Mombasa,0.0518,12.8866,40.2193,16.8930,2.2801,DHS,2022-01-01 00:00:00+00:00,2022-12-31 23:59:59+00:00
3,KEN,KE002,Kwale,0.2105,44.5123,47.2996,27.2794,17.2902,DHS,2022-01-01 00:00:00+00:00,2022-12-31 23:59:59+00:00
4,KEN,KE003,Kilifi,0.2026,46.3581,43.7119,23.8772,13.1869,DHS,2022-01-01 00:00:00+00:00,2022-12-31 23:59:59+00:00



IPC Data Types
----------------------------------------


Date of analysis             object
Country                      object
Total country population      int64
Level 1                      object
Area                         object
Validity period              object
From                         object
To                           object
Phase                        object
Number                        int64
Percentage                  float64
dtype: object


Rainfall Data Types
----------------------------------------


date          object
adm_level      int64
adm_id         int64
PCODE         object
n_pixels     float64
rfh          float64
rfh_avg      float64
r1h          float64
r1h_avg      float64
r3h          float64
r3h_avg      float64
rfq          float64
r1q          float64
r3q          float64
version       object
dtype: object


Food Prices Data Types
----------------------------------------


date            object
admin1          object
admin2          object
market          object
market_id       object
latitude        object
longitude       object
category        object
commodity       object
commodity_id    object
unit            object
priceflag       object
pricetype       object
currency        object
price           object
usdprice        object
dtype: object


Conflict Data Types
----------------------------------------


Country       object
Month         object
Year           int64
Events         int64
Fatalities     int64
dtype: object


MPI Data Types
----------------------------------------


Country ISO3                object
Admin 1 PCode               object
Admin 1 Name                object
MPI                         object
Headcount Ratio             object
Intensity of Deprivation    object
Vulnerable to Poverty       object
In Severe Poverty           object
Survey                      object
Start Date                  object
End Date                    object
dtype: object


IPC Missing Values
----------------------------------------


Level 1                     1610
Country                        0
Date of analysis               0
Total country population       0
Area                           0
Validity period                0
From                           0
To                             0
Phase                          0
Number                         0
Percentage                     0
dtype: int64


Rainfall Missing Values
----------------------------------------


r3q          648
r3h          648
r3h_avg      648
r1h_avg      162
r1h          162
r1q          162
date           0
rfh_avg        0
rfh            0
n_pixels       0
PCODE          0
adm_id         0
adm_level      0
rfq            0
version        0
dtype: int64


Food Prices Missing Values
----------------------------------------


admin1          50
admin2          50
longitude       50
latitude        50
market           0
date             0
market_id        0
category         0
commodity        0
commodity_id     0
unit             0
priceflag        0
pricetype        0
currency         0
price            0
usdprice         0
dtype: int64


Conflict Missing Values
----------------------------------------


Country       0
Month         0
Year          0
Events        0
Fatalities    0
dtype: int64


MPI Missing Values
----------------------------------------


Admin 1 PCode               1
Admin 1 Name                1
Country ISO3                0
MPI                         0
Headcount Ratio             0
Intensity of Deprivation    0
Vulnerable to Poverty       0
In Severe Poverty           0
Survey                      0
Start Date                  0
End Date                    0
dtype: int64

### 3. Key Standardization & Alignment

In [5]:
# 3.1 Helper functions
def clean_text(s: pd.Series) -> pd.Series:
    """Standardize text fields for joins (county names, etc.)."""
    return (
        s.astype(str)
         .str.strip()
         .str.replace(r"\s+", " ", regex=True)
         .str.title()
    )

def to_month_start(dt_series: pd.Series) -> pd.Series:
    """Convert dates to month-start timestamps (YYYY-MM-01)."""
    dt = pd.to_datetime(dt_series, errors="coerce")
    return dt.dt.to_period("M").dt.to_timestamp()

# 3.2 MPI: build county reference table
mpi_ref = (
    mpi_df
    .dropna(subset=["Admin 1 PCode", "Admin 1 Name"])  # drop the single incomplete row
    .assign(
        PCODE=lambda d: d["Admin 1 PCode"].astype(str).str.strip(),
        county=lambda d: clean_text(d["Admin 1 Name"])
    )
    .loc[:, ["PCODE", "county", "MPI", "Headcount Ratio", "Intensity of Deprivation",
             "Vulnerable to Poverty", "In Severe Poverty"]]
    .drop_duplicates(subset=["PCODE", "county"])
)

# quick sanity check
print("MPI reference table shape:", mpi_ref.shape)
display(mpi_ref.head())

# 3.3 IPC: standardize county + month keys
ipc_std = (
    ipc_df
    .assign(
        county=lambda d: clean_text(d["Area"]),
        # Use 'From' as the period anchor since it represents the start of the IPC phase, which is more relevant for prediction
        month=lambda d: to_month_start(d["From"])
    )
    .drop(columns=[c for c in ["Level 1"] if c in ipc_df.columns])  # not useful + highly missing
)

print("IPC standardized shape:", ipc_std.shape)
display(ipc_std[["county", "From", "To", "month", "Phase", "Number", "Percentage"]].head())

# 3.4 Rainfall: map PCODE -> county and standardize month
rainfall_std = (
    rainfall_df
    .assign(
        PCODE=lambda d: d["PCODE"].astype(str).str.strip(),
        month=lambda d: to_month_start(d["date"])
    )
    .merge(mpi_ref[["PCODE", "county"]], on="PCODE", how="left")  # county name from MPI reference
)

# Check if any PCODEs failed to map
unmapped_rainfall = rainfall_std["county"].isna().sum()
print("Rainfall rows with unmapped county (missing county after PCODE join):", unmapped_rainfall)

display(rainfall_std[["PCODE", "county", "date", "month", "rfq", "r1q", "r3q", "rfh", "r1h", "r3h"]].head())

# 3.5 Food prices: standardize county + month; drop rows missing county
prices_std = (
    prices_df
    .assign(
        county=lambda d: clean_text(d["admin1"]),
        month=lambda d: to_month_start(d["date"])
    )
)

# Drop rows where county is missing (cannot be used in county-level model)
prices_std = prices_std.dropna(subset=["admin1"])

print("Food prices standardized shape:", prices_std.shape)
display(prices_std[["county", "date", "month", "market", "commodity", "usdprice", "price"]].head())

# 3.6 Conflict: create month key (national-level series unless county is later added)
conflict_std = (
    conflict_df
    .assign(
        month=lambda d: pd.to_datetime(
            d["Year"].astype(str) + "-" + d["Month"].astype(str) + "-01",
            errors="coerce"
        )
    )
    .loc[:, ["month", "Events", "Fatalities"]]
    .sort_values("month")
)

print("Conflict standardized shape:", conflict_std.shape)
display(conflict_std.head())

# 3.7 Final quick key checks (what we’ll merge on later)
print("\nKey fields check:")
print("IPC keys:", ipc_std[["county", "month"]].isna().sum().to_dict())
print("Rainfall keys:", rainfall_std[["county", "month"]].isna().sum().to_dict())
print("Prices keys:", prices_std[["county", "month"]].isna().sum().to_dict())
print("Conflict keys:", conflict_std[["month"]].isna().sum().to_dict())
print("MPI keys:", mpi_ref[["county", "PCODE"]].isna().sum().to_dict())

MPI reference table shape: (48, 7)


Unnamed: 0,PCODE,county,MPI,Headcount Ratio,Intensity of Deprivation,Vulnerable to Poverty,In Severe Poverty
0,#adm1+code,#Adm1+Name,#indicator+mpi,#indicator+headcount_ratio,#indicator+intensity_of_deprivation,#indicator+vulnerable_to_poverty,#indicator+in_severe_poverty
2,KE001,Mombasa,0.0518,12.8866,40.2193,16.8930,2.2801
3,KE002,Kwale,0.2105,44.5123,47.2996,27.2794,17.2902
4,KE003,Kilifi,0.2026,46.3581,43.7119,23.8772,13.1869
5,KE004,Tana River,0.3780,67.2861,56.1740,18.3876,44.4231


IPC standardized shape: (4522, 12)


Unnamed: 0,county,From,To,month,Phase,Number,Percentage
0,Marsabit,2025-07-01,2025-09-30,2025-07-01,all,515000,1.0
1,Marsabit,2025-07-01,2025-09-30,2025-07-01,3+,103000,0.2
2,Marsabit,2025-07-01,2025-09-30,2025-07-01,1,128750,0.25
3,Marsabit,2025-07-01,2025-09-30,2025-07-01,2,283250,0.55
4,Marsabit,2025-07-01,2025-09-30,2025-07-01,3,103000,0.2


Rainfall rows with unmapped county (missing county after PCODE join): 118552


Unnamed: 0,PCODE,county,date,month,rfq,r1q,r3q,rfh,r1h,r3h
0,KE019,Nyeri,1981-01-01,1981-01-01,59.5988,,,7.3724,,
1,KE019,Nyeri,1981-01-11,1981-01-01,38.3849,,,4.3255,,
2,KE019,Nyeri,1981-01-21,1981-01-01,49.7008,39.5368,,5.5691,17.267,
3,KE019,Nyeri,1981-02-01,1981-02-01,61.4184,38.9972,,5.8829,15.7775,
4,KE019,Nyeri,1981-02-11,1981-02-01,93.3177,63.7539,,17.1803,28.6323,


Food prices standardized shape: (17315, 18)


  dt = pd.to_datetime(dt_series, errors="coerce")


Unnamed: 0,county,date,month,market,commodity,usdprice,price
0,#Adm1+Name,#date,NaT,#loc+market+name,#item+name,#value+usd,#value
1,Coast,2006-01-15,2006-01-01,Mombasa,Maize,0.22,16.13
2,Coast,2006-01-15,2006-01-01,Mombasa,Maize (white),20.58,1480
3,Coast,2006-01-15,2006-01-01,Mombasa,Beans,0.47,33.63
4,Eastern,2006-01-15,2006-01-01,Kitui,Maize (white),0.24,17


Conflict standardized shape: (350, 3)


Unnamed: 0,month,Events,Fatalities
0,1997-01-01,3,6
1,1997-02-01,3,9
2,1997-03-01,6,179
3,1997-04-01,4,7
4,1997-05-01,4,22



Key fields check:
IPC keys: {'county': 0, 'month': 0}
Rainfall keys: {'county': 118552, 'month': 0}
Prices keys: {'county': 0, 'month': 1}
Conflict keys: {'month': 0}
MPI keys: {'county': 0, 'PCODE': 0}


### 4. Target Construction (IPC)

In [6]:
# 4.1 Parse dates + clean Phase
ipc_target = ipc_std.copy()

ipc_target["From"] = pd.to_datetime(ipc_target["From"], errors="coerce")
ipc_target["To"] = pd.to_datetime(ipc_target["To"], errors="coerce")

# Normalize Phase strings (handles "3+", "all", numeric phases)
ipc_target["Phase"] = ipc_target["Phase"].astype(str).str.strip()

# Quick checks
print("IPC date nulls:", ipc_target[["From", "To"]].isna().sum().to_dict())
print("Unique phases:", sorted(ipc_target["Phase"].unique())[:20])

# 4.2 Build IPC3+ measures per county-period
#     Priority: use Phase == "3+" if present; otherwise sum phases 3/4/5
def build_ipc3plus(df: pd.DataFrame) -> pd.DataFrame:
    gcols = ["county", "From", "To"]

    # Identify whether "3+" exists for each county-period
    has_3plus = (
        df.assign(is_3plus=lambda d: d["Phase"].eq("3+"))
          .groupby(gcols)["is_3plus"]
          .any()
          .reset_index()
          .rename(columns={"is_3plus": "has_3plus_row"})
    )

    df = df.merge(has_3plus, on=gcols, how="left")

    # Case A: use explicit "3+"
    df_a = (
        df[df["has_3plus_row"] & df["Phase"].eq("3+")]
        .groupby(gcols, as_index=False)
        .agg(
            ipc3plus_population=("Number", "sum"),
            ipc3plus_share=("Percentage", "sum")
        )
    )

    # Case B: sum phases 3,4,5 (only for groups without "3+")
    df_b_src = df[~df["has_3plus_row"]].copy()
    df_b_src["phase_num"] = pd.to_numeric(df_b_src["Phase"], errors="coerce")
    df_b = (
        df_b_src[df_b_src["phase_num"].isin([3, 4, 5])]
        .groupby(gcols, as_index=False)
        .agg(
            ipc3plus_population=("Number", "sum"),
            ipc3plus_share=("Percentage", "sum")
        )
    )

    out = pd.concat([df_a, df_b], ignore_index=True)

    # Attach total population if "all" row exists (helps sanity checks)
    total_pop = (
        df[df["Phase"].str.lower().eq("all")]
        .groupby(gcols, as_index=False)
        .agg(total_population=("Number", "sum"))
    )

    out = out.merge(total_pop, on=gcols, how="left")
    return out

ipc3plus_period = build_ipc3plus(ipc_target)

print("IPC3+ period table shape:", ipc3plus_period.shape)
display(ipc3plus_period.head())

# 4.3 Expand From/To into monthly rows
#     (Each IPC assessment period becomes repeated across months in that period)
def expand_periods_to_months(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # month-start range for each row
    df["start_month"] = df["From"].dt.to_period("M").dt.to_timestamp()
    df["end_month"] = df["To"].dt.to_period("M").dt.to_timestamp()

    # build monthly sequences
    df["month"] = df.apply(
        lambda r: pd.date_range(r["start_month"], r["end_month"], freq="MS"),
        axis=1
    )

    df = df.explode("month").drop(columns=["start_month", "end_month"])
    return df

ipc3plus_monthly = expand_periods_to_months(ipc3plus_period)

print("IPC3+ monthly expanded shape:", ipc3plus_monthly.shape)
display(ipc3plus_monthly.head())

# 4.4 Final target variables
#     ipc_crisis is a binary label derived from ipc3plus_share.
#     Default threshold: 0.20 (20% of county population in IPC3+).
#     Keep share/population for flexibility.
CRISIS_SHARE_THRESHOLD = 0.20  # adjust later if needed, but keep consistent for evaluation

ipc_targets = (
    ipc3plus_monthly
    .assign(
        ipc_crisis=lambda d: (d["ipc3plus_share"] >= CRISIS_SHARE_THRESHOLD).astype(int)
    )
    .loc[:, ["county", "month", "ipc_crisis", "ipc3plus_population", "ipc3plus_share", "total_population"]]
    .sort_values(["county", "month"])
)

print("Final IPC targets shape:", ipc_targets.shape)
display(ipc_targets.head())

# Optional: sanity check distribution
print("\nTarget distribution (ipc_crisis):")
print(ipc_targets["ipc_crisis"].value_counts(dropna=False))

IPC date nulls: {'From': 0, 'To': 0}
Unique phases: ['1', '2', '3', '3+', '4', '5', 'all']
IPC3+ period table shape: (646, 6)


Unnamed: 0,county,From,To,ipc3plus_population,ipc3plus_share,total_population
0,Bangladesh,2020-08-01,2020-09-30,89879,0.55,163415
1,Bangladesh,2020-10-01,2020-12-31,81708,0.5,163415
2,Baringo,2019-07-01,2019-07-31,105555,0.15,703697
3,Baringo,2019-08-01,2019-10-31,140740,0.2,703697
4,Baringo,2020-02-01,2020-03-31,33339,0.05,666783


IPC3+ monthly expanded shape: (1813, 7)


Unnamed: 0,county,From,To,ipc3plus_population,ipc3plus_share,total_population,month
0,Bangladesh,2020-08-01,2020-09-30,89879,0.55,163415,2020-08-01
0,Bangladesh,2020-08-01,2020-09-30,89879,0.55,163415,2020-09-01
1,Bangladesh,2020-10-01,2020-12-31,81708,0.5,163415,2020-10-01
1,Bangladesh,2020-10-01,2020-12-31,81708,0.5,163415,2020-11-01
1,Bangladesh,2020-10-01,2020-12-31,81708,0.5,163415,2020-12-01


Final IPC targets shape: (1813, 6)


Unnamed: 0,county,month,ipc_crisis,ipc3plus_population,ipc3plus_share,total_population
0,Bangladesh,2020-08-01,1,89879,0.55,163415
0,Bangladesh,2020-09-01,1,89879,0.55,163415
1,Bangladesh,2020-10-01,1,81708,0.5,163415
1,Bangladesh,2020-11-01,1,81708,0.5,163415
1,Bangladesh,2020-12-01,1,81708,0.5,163415



Target distribution (ipc_crisis):
ipc_crisis
0    1243
1     570
Name: count, dtype: int64
