In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
try:
    import graphviz
    GRAPHVIZ_AVAILABLE = True
except ImportError:
    GRAPHVIZ_AVAILABLE = False
from IPython.display import display, Markdown

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 14, 'figure.figsize': (12, 8), 'figure.dpi': 150})
np.set_printoptions(suppress=True, linewidth=120, precision=4)

# --- Utility Functions ---
def note(msg, **kwargs):
    display(Markdown(f"<div class='alert alert-info'>📝 {textwrap.fill(msg, width=100)}</div>"))
def sec(title):
    print(f"\n{100*'='}\n| {title.upper()} |\n{100*'='}")

if not GRAPHVIZ_AVAILABLE: note("The 'graphviz' library is not installed (`pip install graphviz`). Some DAG visualizations will be skipped.")
note("Environment initialized for Advanced Causal Inference.")

# Part 6: Econometrics
## Chapter 6.04: Causal Inference: The Quest for "Why"

### Introduction: The Credibility Revolution
This chapter marks a pivotal transition from prediction and correlation to the far more ambitious world of **causal inference**. The famous mantra "correlation does not imply causation" is the starting point for this entire field. The focused pursuit of credible causal identification in the late 20th century has been dubbed the **"credibility revolution"** in economics, recognized with the 2021 Nobel Prize to David Card, Joshua Angrist, and Guido Imbens.

This chapter provides a PhD-level introduction to the modern causal inference toolkit, covering:
1.  **The Potential Outcomes Framework:** A formal language for defining causality.
2.  **Directed Acyclic Graphs (DAGs):** A graphical language for visualizing causal assumptions and identifying sources of bias.
3.  **Matching Methods:** An overview of methods to control for observed confounders.
4.  **A Bridge to Quasi-Experiments:** An introduction to the Local Average Treatment Effect (LATE).

### 1. The Potential Outcomes Framework
The core difficulty of causal inference is that for any given individual, we can never simultaneously observe their outcome in two different states of the world. The **Potential Outcomes framework** formalizes this.

For each individual $i$, we define two potential outcomes:
*   $Y_i(1)$: The potential outcome for individual $i$ *if they receive the treatment* ($D_i=1$).
*   $Y_i(0)$: The potential outcome for individual $i$ *if they do not receive the treatment* ($D_i=0$).

The **individual causal effect** is $\tau_i = Y_i(1) - Y_i(0)$. We can never observe this. Instead, we focus on estimating **average treatment effects**, such as the **ATE** (for the whole population) or the **ATT** (for the treated population).

A naive comparison of the average outcomes for the treated and untreated groups decomposes into:
$$ E[Y|D=1] - E[Y|D=0] = \underbrace{E[Y(1) - Y(0) | D=1]}_{\text{ATT}} + \underbrace{E[Y(0) | D=1] - E[Y(0) | D=0]}_{\text{Selection Bias}} $$ 
The entire goal of causal inference methods is to find a way to eliminate the **selection bias** term. The gold standard is **random assignment**, which by design forces the selection bias to zero.

### 2. Directed Acyclic Graphs (DAGs) and Causal Identification
Developed by Judea Pearl, Directed Acyclic Graphs (DAGs) are a powerful tool for visualizing causal assumptions and identifying sources of bias.

#### 2.1 Confounding and The Backdoor Criterion
A **confounder** is a variable that causes both the treatment and the outcome, creating a spurious **backdoor path**. To identify the causal effect, we must find a set of observable variables $Z$ that satisfies the **backdoor criterion**: it blocks every non-causal path from treatment to outcome. If we can find such a set, we can get a causal estimate by **conditioning** on $Z$ (e.g., including $Z$ in a regression).

In [None]:
sec("Visualizing Confounding")
if GRAPHVIZ_AVAILABLE:
    dot = graphviz.Digraph()
    dot.node('D', 'Treatment'); dot.node('Y', 'Outcome'); dot.node('Z', 'Confounder')
    dot.edge('Z', 'D'); dot.edge('Z', 'Y'); dot.edge('D', 'Y')
    note("The path D <- Z -> Y is a backdoor path. Controlling for Z closes it.")
    display(dot)
else: note("Graphviz not installed, skipping DAG visualization.")

#### 2.2 Collider Bias
A common and counter-intuitive source of bias is **collider bias**. A collider is a variable that is caused by two or more other variables. In the path $A \rightarrow C \leftarrow B$, the node $C$ is a collider. Conditioning on a collider (or a descendant of a collider) is a grave error: it *opens* a non-causal path between its parents, inducing a spurious correlation where none existed before.

**Example:** Suppose a student's `Talent` and `Luck` both contribute to them getting an `Elite University Admission`. Talent and Luck are independent in the general population. However, if we look *only* at students admitted to an elite university (i.e., we condition on the collider), we will find a negative correlation between Talent and Luck. Among the admitted students, the lucky ones are, on average, less talented, and the talented ones are, on average, less lucky.

In [None]:
sec("Illustrating Collider Bias")
if GRAPHVIZ_AVAILABLE:
    dot = graphviz.Digraph()
    dot.node('T', 'Talent'); dot.node('L', 'Luck'); dot.node('A', 'Admission')
    dot.edge('T', 'A'); dot.edge('L', 'A')
    note("Controlling for Admission (the collider) creates a spurious link between Talent and Luck.")
    display(dot)
else: note("Graphviz not installed, skipping DAG visualization.")

rng = np.random.default_rng(123)
n = 5000
talent = rng.normal(0, 1, n); luck = rng.normal(0, 1, n)
admission_score = talent + luck
admitted = admission_score > np.quantile(admission_score, 0.9) # Top 10% get admitted

corr_all = np.corrcoef(talent, luck)[0, 1]
corr_admitted = np.corrcoef(talent[admitted], luck[admitted])[0, 1]
note(f"Correlation in full population: {corr_all:.3f} (close to zero as expected)")
note(f"Correlation among admitted students: {corr_admitted:.3f} (spuriously negative!)")

### 3. Matching Methods: Controlling for Observables
When randomization is not possible, but we believe we have observed all important confounding variables ($Z$), we can use **matching methods**. The core assumption is **Conditional Independence**: $(Y(0), Y(1)) \perp D | Z$. The goal is to create a control group that is as similar as possible to the treated group on all observables $Z$.

**Propensity Score Matching (PSM):** Rosenbaum and Rubin (1983) showed that it is sufficient to match on a one-dimensional summary of the covariates: the **propensity score**, $p(Z) = P(D=1 | Z)$. A crucial diagnostic is to check for **common support**: the propensity scores of the treated and control groups must overlap substantially.

In [None]:
sec("Code Lab: Propensity Score Matching")
# Generate synthetic data with selection on observables
rng = np.random.default_rng(42); n = 2000
Z = rng.normal(0, 1, n); D_prob = 1 / (1 + np.exp(-(Z - 0.5)))
D = rng.binomial(1, D_prob); Y0 = 10 + 2*Z + rng.normal(0, 1, n)
Y = Y0 + 5*D; df = pd.DataFrame({'Y': Y, 'D': D, 'Z': Z})

# 1. Estimate propensity scores
ps_model = smf.logit('D ~ Z', data=df).fit(disp=0)
df['pscore'] = ps_model.predict(df)

# 2. Check for common support
sns.histplot(data=df, x='pscore', hue='D', stat='density', common_norm=False)
plt.title('Propensity Score Distributions (Common Support Check)'); plt.show()
note("The distributions overlap well, indicating good common support.")

# 3. Match and estimate ATT
treated = df[df['D'] == 1]; control = df[df['D'] == 0]
matches = []
for p in treated['pscore']:
    best_match_idx = (control['pscore'] - p).abs().idxmin()
    matches.append(control.loc[best_match_idx]['Y'])

att_psm = (treated['Y'].values - matches).mean()
note(f"Propensity Score Matching estimate of ATT: {att_psm:.3f} (True effect is 5)")

---***A Look Ahead:****The matching methods discussed here work well when the set of confounders, Z, is low-dimensional. However, in many modern applications, researchers have access to hundreds or even thousands of potential control variables. In such high-dimensional settings, standard matching on a propensity score can become fragile. The **Causal Machine Learning** chapter introduces state-of-the-art methods, such as Double/Debiased Machine Learning, that are specifically designed to handle high-dimensional confounding.*

### 4. A Bridge to Quasi-Experiments: The Local Average Treatment Effect (LATE)
Often, we cannot ensure conditional independence holds for the entire population. Quasi-experimental methods like Instrumental Variables (IV) rely on finding some source of variation that is "as good as random" for a subset of the population.

This leads to the concept of the **Local Average Treatment Effect (LATE)**, which is the average treatment effect for the specific sub-population whose treatment status is influenced by the instrument. These are the **compliers**—individuals who would take the treatment if encouraged by the instrument, but not otherwise. IV methods, which we will see in the next chapter, identify the LATE, not the ATE or ATT. This is a crucial distinction for policy interpretation: the effect of a policy on the group of people who can be induced to take it up may be very different from the effect on the general population.

### 5. Exercises

1.  **Selection Bias:** In the potential outcomes decomposition, we derived the bias for the naive estimator. Derive the corresponding bias for the ATT. What is the key assumption needed for the naive estimator to be a valid estimate of the ATT?

2.  **DAGs:** Draw a DAG that represents the following scenario: A student's `Innate Ability` affects both their choice to attend a `Private College` and their future `Earnings`. Their parents' `Income` also affects their choice of college. To estimate the causal effect of `Private College` on `Earnings`, which variables must you control for? Which variable should you *not* control for?

3.  **Collider Bias in Science:** A common observation is that among PhD students who successfully publish in top journals, there appears to be a negative correlation between the originality of their idea and the rigor of their execution. Draw a DAG to explain why this might be an example of collider bias. What is the collider?

4.  **Common Support:** What would the propensity score distribution plot look like if the common support assumption were violated? What would this imply about your ability to estimate the ATT?

5.  **LATE:** A job training program is offered, but only some people sign up. An economist uses a lottery that gives some people an encouragement (a small cash bonus) to sign up as an instrument. Who are the 'compliers' in this setting? What specific causal effect does this IV strategy identify?