# 3.2 SEER 生存因果分析：IPTW + Cox 回归 + Kaplan-Meier（模板）

本 Notebook 提供 **SEER/癌症队列** 的生存因果分析范式：
1) 倾向评分 → 稳定化 IPTW；2) **加权 CoxPH**（估计 HR）；3) **加权 KM 曲线**；
4) PH 假设/敏感性检查要点。

> ⚠️ 默认会先生成一个**可运行的生存模拟数据**；替换为你的 SEER 提取数据即可。

## 0. 环境依赖（如已安装可跳过）

```bash
pip install -U pandas numpy scikit-learn lifelines matplotlib
```

In [None]:
# 1) 读取或模拟生存数据：列需包含 ['time','event','treatment', 协变量...]
import os
import numpy as np
import pandas as pd

CSV_PATH = ''  # <- 可改为你的 SEER 提取结果，如 'seer_chemo_survival.csv'

def make_synthetic_seer(n=4000, seed=11):
    rng = np.random.default_rng(seed)
    age = rng.integers(30, 90, n)
    stage = rng.integers(1, 5, n)  # 1-4 期
    sex = rng.integers(0, 2, n)
    p_t = 1/(1+np.exp(-( -2 + 0.02*age + 0.6*(stage>=3) )))
    t = rng.binomial(1, p_t)
    linpred = -5 + 0.03*age + 0.8*(stage>=3) - 0.3*t
    rate = np.exp(linpred)
    time = rng.exponential(1/rate)
    censor = rng.exponential(1/np.exp(-3), size=n)
    event = (time <= censor).astype(int)
    obs_time = np.minimum(time, censor)
    df = pd.DataFrame({
        'time': obs_time,
        'event': event,
        'treatment': t,
        'age': age,
        'stage': stage,
        'sex': sex
    })
    miss_idx = rng.choice(n, size=int(0.03*n), replace=False)
    df.loc[miss_idx, 'stage'] = np.nan
    return df

if CSV_PATH and os.path.exists(CSV_PATH):
    df = pd.read_csv(CSV_PATH)
else:
    df = make_synthetic_seer()

df.head()


## 2) 清洗：缺失/极端值 + 类别处理

In [None]:
def median_impute(df, cols):
    for c in cols:
        df[c] = df[c].fillna(df[c].median())
    return df

cont = ['age','stage']
df = median_impute(df, cont)
df[cont].describe()


## 3) 倾向评分 + 稳定化 IPTW

In [None]:
from sklearn.linear_model import LogisticRegression

X = df[['age','stage','sex']]
T = df['treatment'].values

logit = LogisticRegression(max_iter=2000)
logit.fit(X, T)
ps = logit.predict_proba(X)[:,1]
df['ps'] = ps

p_treat = T.mean()
df['weight_sw'] = np.where(T==1, p_treat/ps, (1-p_treat)/(1-ps))
df[['ps','weight_sw']].head()


## 4) 加权 CoxPH 回归（lifelines） + 结果摘要

In [None]:
from lifelines import CoxPHFitter

cph = CoxPHFitter()
fit_df = df[['time','event','treatment','age','stage','sex','weight_sw']].copy()
cph.fit(fit_df, duration_col='time', event_col='event', weights_col='weight_sw')
cph.print_summary()


## 5) 加权 Kaplan-Meier 曲线：treatment=0 vs 1

In [None]:
from lifelines import KaplanMeierFitter
import matplotlib.pyplot as plt

kmf = KaplanMeierFitter()
for t in [0,1]:
    mask = df['treatment']==t
    plt.figure()
    kmf.fit(df.loc[mask,'time'], df.loc[mask,'event'], weights=df.loc[mask,'weight_sw'], label=f'treatment={t}')
    kmf.plot_survival_function()
    plt.title(f'Weighted KM: treatment={t}')
    plt.xlabel('Time')
    plt.ylabel('Survival probability')
    plt.show()


## 6) 平衡性（SMD）与敏感性要点
- 计算 IPTW 后的协变量 **SMD**（应显著下降）；
- 考察权重的极端值（可截尾，如 1st–99th 百分位）；
- 检查 CoxPH 的 **PH 假设**（`lifelines` 提供检验/残差图）；
- 生存数据若存在**时间依赖治疗或中介**，考虑 MSM/g-methods。