# Individual risk modeling with Cox proportional hazards and Random Survival Forests

What you will learn  
- Fit a Cox model and interpret hazard ratios with confidence intervals  
- Check proportional hazards using Schoenfeld residuals  
- Fit a Random Survival Forest (RSF) for non-linear effects and interactions  
- Evaluate with concordance index, time-dependent AUC at 7, 30, 90 days, and integrated Brier score  
- Calibrate 30-day survival probabilities and stratify patients into risk quintiles

Clinical lens  
- Cox is great for explanation and policy rules  
- RSF is great for flexible prediction when relationships are non-linear


In [None]:
# Why: import once and confirm environment
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance

import lifelines
from lifelines import CoxPHFitter
from lifelines.statistics import proportional_hazard_test
from lifelines.calibration import survival_probability_calibration

from sksurv.util import Surv
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sksurv.ensemble import RandomSurvivalForest
from sksurv.metrics import (
    concordance_index_censored,
    cumulative_dynamic_auc,
    integrated_brier_score,
)

import platform
print("Python", platform.python_version())
print("pandas", pd.__version__, "numpy", np.__version__, "scikit-learn", sklearn.__version__)
print("lifelines", lifelines.__version__)


## Validate labels and define analysis targets

We align the time scale to in-hospital death and verify label consistency

- Duration = `Length_of_stay` in days  
- Event = inferred from (`Survival`, `Length_of_stay`) using the dataset rules and cross-checked with `In-hospital_death`  
- If provided labels and inferred labels disagree more than a small tolerance, we use the inferred labels

This teaches students to check assumptions before building models and to connect modeling targets to their clinical definitions :contentReference[oaicite:3]{index=3}


In [None]:
df = pd.read_csv("PhysionetChallenge2012-set-a.csv.gz", compression="gzip").copy()

# Clean sentinels for basic demographics if present
for col in ["Height","Weight"]:
    if col in df.columns:
        df.loc[df[col] < 0, col] = np.nan

# Types
df["Length_of_stay"] = pd.to_numeric(df["Length_of_stay"], errors="coerce")
df["Survival"] = pd.to_numeric(df["Survival"], errors="coerce")
df["In-hospital_death"] = pd.to_numeric(df["In-hospital_death"], errors="coerce").astype("Int64")

# Duration
df["duration"] = df["Length_of_stay"].clip(lower=0)

# Infer event per dataset logic
USE_STRICT = True
strict_event = (df["Survival"].ge(2) & df["Survival"].le(df["Length_of_stay"])).astype("Int64")
general_event = (df["Survival"].ge(0) & df["Survival"].le(df["Length_of_stay"])).astype("Int64")
df["event_inferred"] = strict_event if USE_STRICT else general_event
df.loc[df["Survival"].isna() | (df["Survival"] == -1), "event_inferred"] = 0

# Compare to provided label if available
use_provided = "In-hospital_death" in df.columns
if use_provided:
    comp = pd.crosstab(df["In-hospital_death"].astype("Int64"), df["event_inferred"], rownames=["provided"], colnames=["inferred"])
    mismatch = (df["In-hospital_death"].astype("Int64") != df["event_inferred"]).sum()
    total = len(df)
    print("Provided vs inferred event label")
    display(comp)
    print(f"Mismatches: {mismatch} of {total} rows ({mismatch/total:.2%}) using {'strict' if USE_STRICT else 'general'} rule")

    # Switch to inferred if mismatches exceed tolerance
    if mismatch/total > 0.01:
        print("Mismatch exceeds tolerance, using inferred event for modeling")
        use_provided = False

# Final event
df["event"] = df["In-hospital_death"].astype(int) if use_provided else df["event_inferred"].fillna(0).astype(int)

# Keep valid rows
df = df.loc[df["duration"].notna() & df["event"].notna()].copy()
df["duration"] = df["duration"].clip(lower=0)

print("Event rate used for modeling =", df["event"].mean())
