
# Selection Models & Non‑Random Treatment Effects — Heckman + Propensity Scores — Python Notebook

**When to Use**  
- Treatment/exposure is **not randomized** (e.g., targeted ads shown to likely buyers), or your outcome is observed only for a **selected sample** (e.g., spend observed only for purchasers).  
- Want to correct for **selection bias** or estimate **causal lift** using observational data.

**Best Application**  
- **Sample selection**: two‑step Heckman when the decision to be observed is modeled separately (e.g., purchase vs. spend).  
- **Propensity scores**: IPW or matching when you can defend **selection on observables**.  
- Combine with business rules and **experiments** for validation.

**When Not to Use**  
- When **strong instruments** exist → prefer **IV / RCT / geo‑experiments**.  
- If unobserved confounding is large and unaddressed, PS methods will still be biased.

**How to Interpret Results**  
- **Heckman**: the **inverse Mills ratio (IMR)** captures selection; a significant IMR implies non‑random selection.  
- **PSM/IPW**: effects are **conditional on assumptions** (no hidden bias). Always perform **balance checks** and **sensitivity analyses**.


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.stats import norm

pd.set_option('display.max_columns', 160)
plt.rcParams['figure.figsize'] = (8,4)
rng = np.random.default_rng(222)


### Data: Simulate purchase selection and ad exposure with confounding

In [None]:

n = 5000
# Covariates
x1 = rng.normal(0,1,n)                 # e.g., prior engagement
x2 = rng.normal(0,1,n)                 # e.g., income proxy
u  = rng.normal(0,1,n)                 # unobserved taste

# Treatment (ad exposure) depends on x1 and unobserved u (endogeneity)
p_treat = 1/(1+np.exp(-(0.5*x1 + 0.3*x2 + 0.6*u)))
treat = rng.binomial(1, p_treat)

# Selection: purchase indicator depends on x1, treat, and u
p_buy = 1/(1+np.exp(-(-0.3 + 0.8*x1 + 0.4*treat + 0.7*u)))
buy = rng.binomial(1, p_buy)

# Outcome only observed if buy==1 (e.g., spend conditional on purchase)
eps = rng.normal(0, 1, n)
spend_star = 10 + 2.0*x1 + 1.2*x2 + 1.5*treat + 0.8*u + eps  # latent spend
spend = np.where(buy==1, spend_star, np.nan)

df = pd.DataFrame({'x1':x1,'x2':x2,'treat':treat,'buy':buy,'spend':spend})
df.head()


### Naive Estimates (biased)

In [None]:

# Naive treatment effect on spend ignoring selection (use observed spend only)
naive_ols = smf.ols("spend ~ treat + x1 + x2", data=df[df['buy']==1]).fit()
naive_ate = df.loc[df['buy']==1].groupby('treat')['spend'].mean().diff().iloc[-1]
{'naive_ols_coef_treat': naive_ols.params['treat'], 'group_diff_observed': naive_ate}


### Heckman Two‑Step Sample Selection Correction

In [None]:

# Step 1: Probit selection (buy) using variables that affect selection; include an exclusion restriction if available
# We'll assume x1 and x2 affect both, but add z that affects selection only (exclusion)
z = (df['x2'] > 0).astype(int)  # proxy instrument for selection (toy)
df['z'] = z

probit = smf.probit("buy ~ x1 + x2 + treat + z", data=df.assign(buy=df['buy'].astype(float))).fit(disp=False)
df['XB'] = probit.predict(linear=True)
df['lambda'] = norm.pdf(df['XB']) / norm.cdf(df['XB'])  # IMR for buy==1; we use only on observed spend rows

# Step 2: Outcome on observed sample with IMR
heckman = smf.ols("spend ~ treat + x1 + x2 + lambda", data=df[df['buy']==1]).fit()
heckman.params, heckman.bse


In [None]:

print(heckman.summary())


**Interpretation:** Significant `lambda` suggests selection bias. The `treat` coefficient is corrected for selection under model assumptions.

### Propensity Scores (PS) and Balance Checks

In [None]:

# Estimate PS using observables (exclude u by design)
X = df[['x1','x2']].values
ps_model = LogisticRegression(max_iter=500).fit(X, df['treat'])
df['ps'] = ps_model.predict_proba(X)[:,1]

def std_diff(a, b):
    mu1, mu0 = a.mean(), b.mean()
    s = np.sqrt(0.5*(a.var()+b.var()) + 1e-9)
    return (mu1 - mu0)/s

balance_before = {
    'x1': std_diff(df.loc[df.treat==1,'x1'], df.loc[df.treat==0,'x1']),
    'x2': std_diff(df.loc[df.treat==1,'x2'], df.loc[df.treat==0,'x2']),
    'ps': std_diff(df.loc[df.treat==1,'ps'], df.loc[df.treat==0,'ps']),
}
balance_before


### IPW: Inverse Probability Weighting for ATE

In [None]:

# Use only rows with spend observed (buy==1) for ATT on purchasers, or model missingness; here we'll estimate ATE on spend_star proxy
# For demonstration, compute ATE on latent spend_star (oracle) and compare with IPW on observed spend (approx)
df['spend_star'] = df['spend'].where(df['buy']==1, np.nan)  # keep NaN where not observed to emphasize selection

# IPW weights
e = df['ps'].clip(1e-3, 1-1e-3)
w = np.where(df['treat']==1, 1/e, 1/(1-e))

# Use observed spend (will be biased if many NaNs); drop NaNs
ipw_df = df.dropna(subset=['spend']).copy()
e_obs = ipw_df['ps'].clip(1e-3, 1-1e-3)
w_obs = np.where(ipw_df['treat']==1, 1/e_obs, 1/(1-e_obs))

ipw_ate = (w_obs * (ipw_df['treat']*ipw_df['spend'] / e_obs - (1-ipw_df['treat'])*ipw_df['spend'] / (1-e_obs))).mean()
ipw_ate


### Propensity Score Matching (Nearest Neighbor, 1:1)

In [None]:

m_df = df.dropna(subset=['spend']).copy()
treated = m_df[m_df['treat']==1].copy()
control = m_df[m_df['treat']==0].copy()

nbrs = NearestNeighbors(n_neighbors=1).fit(control[['ps']].values)
dist, idx = nbrs.kneighbors(treated[['ps']].values)
matched_controls = control.iloc[idx.flatten()].copy()
matched_controls.index = treated.index  # align

att_psm = (treated['spend'].values - matched_controls['spend'].values).mean()
att_psm


### Comparison of Estimates

In [None]:

comp = pd.Series({
    'Naive_OLS_observed_treat_coef': naive_ols.params['treat'],
    'Heckman_treat_coef': heckman.params['treat'],
    'IPW_ATE_on_observed': ipw_ate,
    'PSM_ATT_on_observed': att_psm
}).to_frame('estimate')
comp


### Balance Check After Matching

In [None]:

matched = pd.concat([treated[['x1','x2','ps']], matched_controls[['x1','x2','ps']].rename(columns=lambda c: c+"_c")], axis=1)

balance_after = {
    'x1': std_diff(treated['x1'], matched_controls['x1']),
    'x2': std_diff(treated['x2'], matched_controls['x2']),
    'ps': std_diff(treated['ps'], matched_controls['ps']),
}
balance_before, balance_after



---

### Practical Guidance
- Include an **exclusion restriction** in Heckman (a variable that affects selection, not the outcome).  
- With propensity scores, check **overlap** (0.1–0.9) and improve balance (calipers, stratification, weighting).  
- Validate with **experiments** when possible; report **assumption checks** and **sensitivity** (e.g., Rosenbaum bounds).  
- If a strong instrument exists, prefer **IV/2SLS**.

### References (non‑link citations)
1. Heckman — *Sample Selection Bias as a Specification Error*.  
2. Rosenbaum & Rubin — *The Central Role of the Propensity Score*.  
3. Angrist & Pischke — *Mostly Harmless Econometrics*.
