# RTM — Phase 1b v0: Synthetic outcome proxy (Y_damage_v1b)

**Goal:** Generate a minimal binary outcome using the locked DGP:

- baseline damage rate: 5%
- α = log(0.05 / 0.95) ≈ -2.944
- β_E = 1.0
- β_H = 0.6
- seed = 42

Inputs:
- `E_hat` from `outputs/rtm/water_exposure_Ehat_v0.parquet`
- `H_pluvial_v1_mm` from Habnetic/data via loader `src/rtm/io_hazard.py`

Output:
- `outputs/rtm/outcomes/Y_damage_v1b.parquet`

In [None]:
## 0) Setup

This notebook assumes it is run from `resilient-housing-bayes/notebooks/`
and fixes the import path accordingly.

In [1]:
from __future__ import annotations

from pathlib import Path
import sys
import numpy as np
import pandas as pd

# --- resolve actual repo root (notebooks/ -> repo root) ---
REPO_ROOT = Path.cwd().resolve()
if REPO_ROOT.name.lower() == "notebooks":
    REPO_ROOT = REPO_ROOT.parent

assert (REPO_ROOT / "src").exists(), f"Expected src/ at repo root: {REPO_ROOT}"

if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))

print("Repo root:", REPO_ROOT)

from src.rtm.io_hazard import load_rtm_pluvial_v1_buildings

OUT_DIR = REPO_ROOT / "outputs" / "rtm" / "outcomes"
OUT_DIR.mkdir(parents=True, exist_ok=True)
print("Outcome outputs:", OUT_DIR)

Repo root: C:\Users\C.Price\Habnetic\resilient-housing-bayes
Outcome outputs: C:\Users\C.Price\Habnetic\resilient-housing-bayes\outputs\rtm\outcomes


## 1) Load inputs: E_hat and H_pluvial_v1_mm

In [2]:
E_PATH = REPO_ROOT / "outputs" / "rtm" / "water_exposure_Ehat_v0.parquet"
if not E_PATH.exists():
    raise FileNotFoundError(f"Missing E_hat parquet at {E_PATH}")

E = pd.read_parquet(E_PATH)

if "bldg_id" not in E.columns:
    raise ValueError(f"E_hat table missing bldg_id. Columns: {list(E.columns)}")

if "E_hat" not in E.columns:
    # try common alternates
    candidates = [c for c in E.columns if c.lower() in {"ehat", "e_hat", "water_exposure_ehat", "exposure"}]
    if candidates:
        E = E.rename(columns={candidates[0]: "E_hat"})
    else:
        raise ValueError(f"Could not find E_hat column. Columns: {list(E.columns)}")

E = E[["bldg_id", "E_hat"]].copy()
print("E rows:", len(E), "unique bldg_id:", E["bldg_id"].is_unique)
E.head()

E rows: 221324 unique bldg_id: True


Unnamed: 0,bldg_id,E_hat
0,305012,0.536838
1,313960,0.677579
2,313263,0.251841
3,310491,0.189019
4,313127,-0.292821


In [3]:
haz = load_rtm_pluvial_v1_buildings()
print("Haz rows:", len(haz), "unique bldg_id:", haz["bldg_id"].is_unique)
haz.head()

Haz rows: 221324 unique bldg_id: True


Unnamed: 0,bldg_id,H_pluvial_v1_mm
0,305012,25.422161
1,313960,25.418823
2,313263,25.423113
3,310491,25.4245
4,313127,25.423491


In [4]:
df = E.merge(haz, on="bldg_id", how="inner", validate="one_to_one")
print("Joined rows:", len(df))
print("Drops vs E:", len(E) - len(df))
print("Drops vs hazard:", len(haz) - len(df))

# Basic sanity
assert len(df) == 221_324, "Unexpected join row count; investigate before generating outcome."
assert df["E_hat"].notna().all()
assert df["H_pluvial_v1_mm"].notna().all()

df.head()

Joined rows: 221324
Drops vs E: 0
Drops vs hazard: 0


Unnamed: 0,bldg_id,E_hat,H_pluvial_v1_mm
0,305012,0.536838,25.422161
1,313960,0.677579,25.418823
2,313263,0.251841,25.423113
3,310491,0.189019,25.4245
4,313127,-0.292821,25.423491


## 2) Standardize inputs

We standardize to mean 0, std 1 across RTM.
This makes α interpretable as the baseline when E_std=0 and H_std=0.

In [5]:
def zscore(x: pd.Series) -> pd.Series:
    x = x.astype(float)
    mu = x.mean()
    sd = x.std(ddof=1)
    if sd <= 0:
        raise ValueError("Standard deviation is zero; cannot z-score.")
    return (x - mu) / sd

df["E_std"] = zscore(df["E_hat"])
df["H_std"] = zscore(df["H_pluvial_v1_mm"])

print("E_std mean/std:", float(df["E_std"].mean()), float(df["E_std"].std(ddof=1)))
print("H_std mean/std:", float(df["H_std"].mean()), float(df["H_std"].std(ddof=1)))

E_std mean/std: 4.4945863533287865e-19 0.9999999999999999
H_std mean/std: 1.561788497311158e-15 0.9999999999999999


## 3) Generate synthetic outcome (Phase 1b v0)

Locked DGP:
- baseline: 5%  => α = log(0.05/0.95)
- β_E = 1.0
- β_H = 0.6
- seed = 42

In [6]:
seed = 42
rng = np.random.default_rng(seed)

alpha = float(np.log(0.05 / 0.95))   # ≈ -2.944
beta_E = 1.0
beta_H = 0.6

linpred = alpha + beta_E * df["E_std"].values + beta_H * df["H_std"].values
p_true = 1.0 / (1.0 + np.exp(-linpred))

Y = rng.binomial(n=1, p=p_true, size=len(df)).astype(np.int8)

df["p_true"] = p_true.astype(np.float32)
df["Y_damage_v1b"] = Y

print("Realized damage rate:", float(df["Y_damage_v1b"].mean()))
print("p_true min/median/p95/max:",
      float(np.min(p_true)),
      float(np.median(p_true)),
      float(np.quantile(p_true, 0.95)),
      float(np.max(p_true)))

Realized damage rate: 0.07909219063454483
p_true min/median/p95/max: 0.0008382414571056146 0.0578358856024795 0.21316818491280945 0.7388597100921469


## 4) Export outcome table (canonical modeling artifact)

We write a small table keyed by `bldg_id`.

In [7]:
out = df[["bldg_id", "Y_damage_v1b", "p_true"]].copy()
out["outcome_src"] = "synthetic"
out["outcome_version"] = "v1b"
out["seed"] = seed
out["alpha"] = alpha
out["beta_E"] = beta_E
out["beta_H"] = beta_H

out_path = OUT_DIR / "Y_damage_v1b.parquet"
out.to_parquet(out_path, index=False)

print("Saved:", out_path)
print("Rows:", len(out))
out.head()

Saved: C:\Users\C.Price\Habnetic\resilient-housing-bayes\outputs\rtm\outcomes\Y_damage_v1b.parquet
Rows: 221324


Unnamed: 0,bldg_id,Y_damage_v1b,p_true,outcome_src,outcome_version,seed,alpha,beta_E,beta_H
0,305012,0,0.044659,synthetic,v1b,42,-2.944439,1.0,0.6
1,313960,0,0.051868,synthetic,v1b,42,-2.944439,1.0,0.6
2,313263,0,0.032318,synthetic,v1b,42,-2.944439,1.0,0.6
3,310491,0,0.030184,synthetic,v1b,42,-2.944439,1.0,0.6
4,313127,0,0.017186,synthetic,v1b,42,-2.944439,1.0,0.6


## 5) Minimal QA

- Ensure 1 row per bldg_id
- Ensure binary outcome only {0,1}
- Ensure no NaNs

In [8]:
assert out["bldg_id"].is_unique
assert set(out["Y_damage_v1b"].unique()).issubset({0, 1})
assert out["Y_damage_v1b"].isna().sum() == 0
assert out["p_true"].isna().sum() == 0

print("QA passed.")

QA passed.
