
# Proxies
**Hands‑on Notebook**


**In this notebook**
Use a **proxy variable** to partially adjust for an unobserved confounder.


In [8]:
%pip install statsmodels


Collecting statsmodels
  Downloading statsmodels-0.14.5-cp314-cp314-macosx_11_0_arm64.whl.metadata (9.5 kB)
Collecting scipy!=1.9.2,>=1.8 (from statsmodels)
  Downloading scipy-1.16.3-cp314-cp314-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting patsy>=0.5.6 (from statsmodels)
  Using cached patsy-1.0.2-py2.py3-none-any.whl.metadata (3.6 kB)
Downloading statsmodels-0.14.5-cp314-cp314-macosx_11_0_arm64.whl (10.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m2.9 MB/s[0m  [33m0:00:03[0meta [36m0:00:01[0m
[?25hUsing cached patsy-1.0.2-py2.py3-none-any.whl (233 kB)
Downloading scipy-1.16.3-cp314-cp314-macosx_14_0_arm64.whl (20.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.9/20.9 MB[0m [31m2.7 MB/s[0m  [33m0:00:07[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: scipy, patsy, statsmodels
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [statsmodels][0m [statsmodels]
[1A[2KSuccess

In [2]:
import numpy as np
import pandas as pd


## **Proxy variable** for an unobserved confounder

Unobserved `U` (true smoking exposure) affects both `YellowTeeth (Z)` and `Cancer (Y)`;  
`Smoking (X)` is noisy self-report we can't rely on for percision issues. Nicotin level in body measured accurately `NL` serves as a **proxy** for `U`.

We compare naive estimate `P(Y|X)` with adjustment by the proxy `NL` (back-door via proxy).


In [5]:
N = 200_000
rng = np.random.default_rng(12)

# Unobserved true exposure
U = rng.normal(0, 1, N)                 # unobserved driver of risk

# Observed variables
X  = U + rng.normal(0, 1.0, N)          # self-report (noisy, low precision)
Z  = U + rng.normal(0, 0.8, N)          # yellow teeth (crude indicator; we won't use it for adjustment here)
NL = U + rng.normal(0, 1.5, N)          # biomarker (accurate proxy; low noise)

# Outcome depends on TRUE exposure (U), not X directly
logit = -0.7 + 1.4 * U
pY = 1 / (1 + np.exp(-logit))
Y = (rng.random(N) < pY).astype(int)

dfp = pd.DataFrame(dict(X=X, Z=Z, NL=NL, Y=Y))

# --- Models: naive vs proxy-adjusted ---
import statsmodels.api as sm

# 1) Naive: Y ~ X  (confounded by U)
m_naive = sm.Logit(dfp["Y"], sm.add_constant(dfp[["X"]])).fit(disp=False)

# 2) Proxy-adjusted with accurate biomarker: Y ~ X + NL
m_proxy = sm.Logit(dfp["Y"], sm.add_constant(dfp[["X","NL"]])).fit(disp=False)

# 3) Biomarker only: Y ~ NL  (close to the "oracle" using U)
m_biomarker = sm.Logit(dfp["Y"], sm.add_constant(dfp[["NL"]])).fit(disp=False)

print("Predicting Y (cancer) using self reported smoking only")
print("Naive (Y ~ X):")
print(f"  beta_X = {m_naive.params['X']:.3f}")

print("\nProxy-adjusted to include both self report and Nicotin level biomarker(Y ~ X + NL):")
print(f"  beta_X  = {m_proxy.params['X']:.3f}   (should shrink toward 0)")
print(f"  beta_NL = {m_proxy.params['NL']:.3f}  (captures the true U effect)")

print("\nPredicting Y (cancer) using Biomarker only (Y ~ NL):")
print(f"  beta_NL = {m_biomarker.params['NL']:.3f}")

# (Optional instructor check — uncomment to peek at "truth")
# corr_U_X  = np.corrcoef(U, X)[0,1]
# corr_U_Z  = np.corrcoef(U, Z)[0,1]
# corr_U_NL = np.corrcoef(U, NL)[0,1]
# print(f"\n[Hidden truth] corr(U,X)={corr_U_X:.2f}, corr(U,Z)={corr_U_Z:.2f}, corr(U,NL)={corr_U_NL:.2f}")


Predicting Y (cancer) using self reported smoking only
Naive (Y ~ X):
  beta_X = 0.579

Proxy-adjusted to include both self report and Nicotin level biomarker(Y ~ X + NL):
  beta_X  = 0.486   (should shrink toward 0)
  beta_NL = 0.219  (captures the true U effect)

Predicting Y (cancer) using Biomarker only (Y ~ NL):
  beta_NL = 0.339



> **Observation:** With only `X` we pick up confounding from `U`.  
> Adding the proxy `NL` absorbs much of `U`'s influence and moves `beta_X` toward the *direct* effect.


## Excersice:

**Proxy strength:** In section C, increase proxy noise (e.g., `Z = U + 1.5*eps_z`).  
   - How do `beta_X` and `beta_NL` change? What does this say about **weak proxies**?

Weak proxies fail to fully adjust for confounding.
They remove only part of the bias, leaving the estimated effect of X still confounded.