# Identification and Exclusion Restrictions in Heckman Selection Models

**Tutorial Series**: Censored and Selection Models with PanelBox

**Notebook**: 06 - Identification and Exclusion Restrictions

**Author**: PanelBox Contributors

**Estimated Duration**: 60-75 minutes

**Difficulty Level**: Intermediate

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand why exclusion restrictions are critical for identification in Heckman selection models
2. Distinguish between models identified by functional form versus exclusion restrictions
3. Implement Heckman models with proper exclusion restrictions using PanelBox
4. Diagnose identification problems through coefficient instability and collinearity
5. Evaluate the economic validity of candidate exclusion restrictions
6. Conduct sensitivity analyses across alternative instrument specifications
7. Apply best practices for selecting exclusion restrictions in applied research

---

## Prerequisites

- Familiarity with the Heckman two-step estimator (Notebooks 01-05)
- Understanding of probit models and the inverse Mills ratio
- Basic knowledge of instrumental variable logic

---

## Table of Contents

1. [The Identification Problem](#section1)
2. [What Are Exclusion Restrictions?](#section2)
3. [Loading Data](#section3)
4. [Example 1: Labor Supply with Exclusion Restrictions](#section4)
5. [Example 2: College Wages with Exclusion Restrictions](#section5)
6. [What Happens Without Exclusion Restrictions?](#section6)
7. [Testing Exclusion Restrictions](#section7)
8. [Sensitivity Analysis](#section8)
9. [Best Practices](#section9)
10. [Summary and Key Takeaways](#section10)
11. [Exercises](#exercises)

## Setup

Import all required libraries and configure the environment.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

from scipy import stats
import statsmodels.api as sm

from panelbox.models.selection import PanelHeckman

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

np.random.seed(42)

BASE_DIR = Path('..')
DATA_DIR = BASE_DIR / 'data'
OUTPUT_DIR = BASE_DIR / 'outputs'
FIGURES_DIR = OUTPUT_DIR / 'figures'
TABLES_DIR = OUTPUT_DIR / 'tables'
FIGURES_DIR.mkdir(parents=True, exist_ok=True)
TABLES_DIR.mkdir(parents=True, exist_ok=True)

print('Setup complete!')

<a id='section1'></a>
## 1. The Identification Problem

### 1.1 Review: The Heckman Selection Model

The Heckman model consists of two equations:

**Selection equation** (who is observed?):
$$s_i^* = Z_i'\gamma + u_i, \quad s_i = \mathbf{1}[s_i^* > 0]$$

**Outcome equation** (what is the outcome for those observed?):
$$y_i = X_i'\beta + \varepsilon_i \quad \text{if } s_i = 1$$

where $(u_i, \varepsilon_i) \sim \text{Bivariate Normal}$ with correlation $\rho$.

The key correction formula for the expected outcome, conditional on being selected, is:

$$E[y_i | s_i = 1, X_i] = X_i'\beta + \rho\sigma_\varepsilon \lambda(Z_i'\gamma)$$

where $\lambda(\cdot) = \phi(\cdot) / \Phi(\cdot)$ is the **inverse Mills ratio** (IMR).

### 1.2 The Core Problem: Collinearity

The Heckman two-step estimator augments the outcome equation with $\lambda(Z_i'\gamma)$:

$$y_i = X_i'\beta + \theta \lambda(Z_i'\gamma) + \eta_i$$

where $\theta = \rho \sigma_\varepsilon$.

**What happens if $Z = X$** (no exclusion restrictions)?

- $\lambda(X_i'\gamma)$ is a nonlinear function of $X_i$
- But over the range of typical data, $\lambda(\cdot)$ is **approximately linear**
- This means $\lambda(X_i'\gamma) \approx a + b \cdot X_i'\gamma$ for some constants
- The augmented equation becomes approximately: $y_i \approx X_i'\beta + \theta(a + b \cdot X_i'\gamma) + \eta_i$
- **Result**: Near-perfect multicollinearity between $X$ and $\lambda$

### 1.3 Why This Matters

Without exclusion restrictions, identification relies **entirely** on the nonlinearity of $\lambda(\cdot)$, which comes from the bivariate normality assumption. This is problematic because:

1. **Functional form dependence**: Small deviations from normality can drastically change estimates
2. **Unstable estimates**: Parameters become sensitive to sample composition
3. **Large standard errors**: Multicollinearity inflates variance
4. **Lack of robustness**: Results are not credible for policy analysis

In [None]:
# Visualize the identification problem:
# Show that lambda(z) is approximately linear over typical data ranges

z = np.linspace(-3, 3, 500)
phi_z = stats.norm.pdf(z)
Phi_z = stats.norm.cdf(z)
lambda_z = phi_z / np.clip(Phi_z, 1e-10, None)

# Fit a linear approximation over a typical range
mask_typical = (z > -1.5) & (z < 2.0)  # typical probit index range
slope, intercept, r_value, _, _ = stats.linregress(z[mask_typical], lambda_z[mask_typical])
lambda_linear = intercept + slope * z

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left panel: IMR function and linear approximation
axes[0].plot(z, lambda_z, linewidth=2.5, label=r'$\lambda(z) = \phi(z)/\Phi(z)$', color='#2980b9')
axes[0].plot(z, lambda_linear, '--', linewidth=2, label=f'Linear approx. ($R^2$ = {r_value**2:.4f})',
             color='#e74c3c')
axes[0].axvspan(-1.5, 2.0, alpha=0.1, color='green', label='Typical data range')
axes[0].set_xlabel(r"Probit index $Z'\gamma$", fontsize=12)
axes[0].set_ylabel(r'$\lambda(z)$', fontsize=12)
axes[0].set_title('Inverse Mills Ratio: Nearly Linear\nin Typical Data Range', fontsize=13)
axes[0].legend(fontsize=10)
axes[0].set_xlim([-3, 3])
axes[0].set_ylim([0, 4])
axes[0].grid(True, alpha=0.3)

# Right panel: Residuals from linear approximation
residuals = lambda_z - lambda_linear
axes[1].plot(z, residuals, linewidth=2, color='#8e44ad')
axes[1].axvspan(-1.5, 2.0, alpha=0.1, color='green', label='Typical data range')
axes[1].axhline(y=0, color='black', linewidth=0.8, linestyle='-')
axes[1].set_xlabel(r"Probit index $Z'\gamma$", fontsize=12)
axes[1].set_ylabel(r'$\lambda(z) - $ linear approx.', fontsize=12)
axes[1].set_title('Nonlinear Component of IMR\n(Source of Identification Without Exclusion)', fontsize=13)
axes[1].legend(fontsize=10)
axes[1].set_xlim([-3, 3])
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'imr_linearity.png', dpi=150, bbox_inches='tight')
plt.show()

print(f'\nLinear approximation R-squared in typical range: {r_value**2:.4f}')
print(f'Max deviation in typical range: {np.max(np.abs(residuals[mask_typical])):.4f}')
print(f'\nConclusion: The IMR is very nearly linear over typical probit index values.')
print('Without exclusion restrictions, the only "identification" comes from this')
print('tiny nonlinear residual -- a very fragile basis for estimation.')

*Figure: The inverse Mills ratio $\lambda(z) = \phi(z)/\Phi(z)$ is plotted alongside its linear approximation (left). Over the typical data range of the probit index, the linear fit is nearly perfect ($R^2 \approx 0.99$). The right panel shows that the nonlinear component -- the sole source of identification without exclusion restrictions -- is negligibly small.*

<a id='section2'></a>
## 2. What Are Exclusion Restrictions?

### 2.1 Definition

An **exclusion restriction** is a variable that:

1. **Appears in the selection equation** ($Z$): it affects whether we observe the outcome
2. **Does NOT appear in the outcome equation** ($X$): it does not directly affect the outcome itself

Formally, if $Z = [X, W]$ where $W$ are the excluded instruments, then:
- $W$ shifts the probability of selection
- $W$ has no direct effect on $y$ (conditional on $X$)

### 2.2 Why Exclusion Restrictions Solve the Problem

With exclusion restrictions:
- $\lambda(Z_i'\gamma) = \lambda(X_i'\gamma_1 + W_i'\gamma_2)$
- Variation in $W$ generates variation in $\lambda$ that is **independent** of $X$
- This breaks the collinearity between $X$ and $\lambda$
- The model is now identified by **genuine exclusion-based variation**, not just functional form

### 2.3 The Analogy with Instrumental Variables

Exclusion restrictions in sample selection models are analogous to instruments in IV estimation:

| IV Estimation | Heckman Selection Model |
|---|---|
| Instrument $Z$ correlated with endogenous $X$ | Exclusion $W$ affects selection $s$ |
| Instrument $Z$ uncorrelated with $\varepsilon$ | Exclusion $W$ does not affect outcome $y$ |
| Relevance condition | Strong predictor of selection |
| Exclusion restriction | Excluded from outcome equation |

### 2.4 Classic Examples

| Application | Selection | Outcome | Exclusion Restriction |
|---|---|---|---|
| Female wages | Labor force participation | Log wage | Number of children, husband's income |
| College wage premium | College attendance | Post-college wage | Distance to college, tuition |
| Union wage effect | Union membership | Log wage | State right-to-work laws |
| Program evaluation | Program participation | Earnings | Distance to program site |

<a id='section3'></a>
## 3. Loading Data

We will work with two classic datasets throughout this notebook:

1. **Mroz (1987)**: Married women's labor force participation and wages
2. **College Wage**: College attendance decisions and post-college earnings

In [None]:
# Load both datasets
mroz = pd.read_csv(DATA_DIR / 'mroz_1987.csv')
college = pd.read_csv(DATA_DIR / 'college_wage.csv')

print('=== Mroz (1987) Dataset ===')
print(f'Observations: {len(mroz)}')
print(f'Participation rate: {mroz["lfp"].mean():.1%}')
print(f'\nColumns: {list(mroz.columns)}')
print(f'\nSummary statistics:')
display(mroz.describe().round(2))

In [None]:
print('=== College Wage Dataset ===')
print(f'Observations: {len(college)}')
print(f'College attendance rate: {college["college"].mean():.1%}')
print(f'\nColumns: {list(college.columns)}')
print(f'\nSummary statistics:')
display(college.describe().round(2))

In [None]:
# Visualize the selection patterns in both datasets
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Mroz: wage distribution by participation status
wages_observed = mroz.loc[mroz['lfp'] == 1, 'wage'].dropna()
axes[0].hist(wages_observed, bins=30, edgecolor='black', alpha=0.7, color='#3498db')
axes[0].axvline(wages_observed.mean(), color='red', linestyle='--', linewidth=2,
                label=f'Mean = {wages_observed.mean():.2f}')
axes[0].set_xlabel('Observed Wage', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title(f'Mroz: Wage Distribution (Workers Only)\n'
                   f'N observed = {len(wages_observed)}, '
                   f'N censored = {(mroz["lfp"] == 0).sum()}', fontsize=12)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# College: wage distribution by college attendance
wages_college = college.loc[college['college'] == 1, 'wage'].dropna()
axes[1].hist(wages_college, bins=30, edgecolor='black', alpha=0.7, color='#27ae60')
axes[1].axvline(wages_college.mean(), color='red', linestyle='--', linewidth=2,
                label=f'Mean = {wages_college.mean():.2f}')
axes[1].set_xlabel('Observed Wage', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title(f'College: Wage Distribution (Graduates Only)\n'
                   f'N observed = {len(wages_college)}, '
                   f'N censored = {(college["college"] == 0).sum()}', fontsize=12)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'selection_patterns.png', dpi=150, bbox_inches='tight')
plt.show()

*Figure: Distribution of observed wages for the two datasets. Left: Mroz data shows wages only for women who participate in the labor force. Right: College data shows wages only for individuals who attended college. In both cases, a substantial portion of the sample is censored (outcome not observed), creating the sample selection problem.*

<a id='section4'></a>
## 4. Example 1: Labor Supply with Exclusion Restrictions

### 4.1 The Economic Setting

In Mroz (1987), married women decide whether to participate in the labor market. We observe:

- **Outcome**: Log wage (observed only for participants)
- **Selection**: Labor force participation (lfp = 1 if working)

### 4.2 Choosing Exclusion Restrictions

For the labor supply application, the classic exclusion restrictions are:

1. **Number of young children (children_lt6)**: 
   - Affects participation (childcare costs reduce labor supply)
   - Should not directly affect hourly wage rate (conditional on experience, education)

2. **Number of older children (children_6_18)**:
   - Similar logic, weaker effect than young children

3. **Husband's income (husband_income)**:
   - Higher household income reduces need to work (income effect)
   - Should not affect the woman's own wage rate

### 4.3 Equation Specification

**Outcome equation** (wage determination):
$$\log(wage_i) = \beta_0 + \beta_1 \cdot education_i + \beta_2 \cdot experience_i + \beta_3 \cdot experience^2_i + \varepsilon_i$$

**Selection equation** (labor force participation):
$$s_i^* = \gamma_0 + \gamma_1 \cdot education_i + \gamma_2 \cdot experience_i + \gamma_3 \cdot experience^2_i + \gamma_4 \cdot age_i + \underbrace{\gamma_5 \cdot children\_lt6_i + \gamma_6 \cdot children\_6\_18_i + \gamma_7 \cdot husband\_income_i}_{\text{Exclusion restrictions}} + u_i$$

In [None]:
# Prepare data for Mroz example
# Outcome: log wage (for workers)
mroz['log_wage'] = np.log(mroz['wage'])

# Selection indicator
selection = mroz['lfp'].values

# For outcome equation: replace NaN wages with 0 (PanelHeckman uses selection indicator)
y = mroz['log_wage'].fillna(0).values

# Outcome equation variables (X): const, education, experience, experience_sq
X = sm.add_constant(mroz[['education', 'experience', 'experience_sq']].values)
X_names = ['const', 'education', 'experience', 'experience_sq']

# Selection equation variables (Z): X + exclusion restrictions
Z = sm.add_constant(mroz[['education', 'experience', 'experience_sq',
                           'age', 'children_lt6', 'children_6_18',
                           'husband_income']].values)
Z_names = ['const', 'education', 'experience', 'experience_sq',
           'age', 'children_lt6', 'children_6_18', 'husband_income']

print('Outcome equation (X) variables:', X_names)
print(f'  Shape: {X.shape}')
print(f'\nSelection equation (Z) variables:', Z_names)
print(f'  Shape: {Z.shape}')
print(f'\nExclusion restrictions: age, children_lt6, children_6_18, husband_income')
print(f'  (Variables in Z but NOT in X)')
print(f'\nSelected observations: {selection.sum()} / {len(selection)} ({selection.mean():.1%})')

In [None]:
# Estimate Heckman model WITH exclusion restrictions (properly identified)
model_excl = PanelHeckman(
    endog=y,
    exog=X,
    selection=selection,
    exog_selection=Z,
    method='two_step'
)
result_excl = model_excl.fit()

print('=' * 70)
print('   HECKMAN MODEL WITH EXCLUSION RESTRICTIONS (Mroz 1987)')
print('=' * 70)
print(result_excl.summary())

print('\n' + '=' * 70)
print('   OUTCOME EQUATION COEFFICIENTS')
print('=' * 70)
for name, coef in zip(X_names, result_excl.outcome_params):
    print(f'  {name:20s}: {coef:10.4f}')

print('\n' + '=' * 70)
print('   SELECTION EQUATION COEFFICIENTS (Probit)')
print('=' * 70)
for name, coef in zip(Z_names, result_excl.probit_params):
    print(f'  {name:20s}: {coef:10.4f}')

print('\n' + '=' * 70)
print('   SELECTION PARAMETERS')
print('=' * 70)
print(f'  sigma:               {result_excl.sigma:10.4f}')
print(f'  rho:                 {result_excl.rho:10.4f}')
print(f'  lambda (rho*sigma):  {result_excl.rho * result_excl.sigma:10.4f}')

In [None]:
# Verify that the exclusion restrictions are strong predictors of selection
# Run probit and check significance of excluded variables

print('=== Exclusion Restriction Relevance Check ===')
print('\nAre the exclusion restrictions significant in the selection equation?')
print()

# Approximate z-statistics for probit coefficients
# (using the selection equation estimates)
exclusion_vars = ['age', 'children_lt6', 'children_6_18', 'husband_income']
exclusion_indices = [Z_names.index(v) for v in exclusion_vars]

print(f'{"Variable":20s} {"Coefficient":>12s} {"Interpretation"}')
print('-' * 70)
for var, idx in zip(exclusion_vars, exclusion_indices):
    coef = result_excl.probit_params[idx]
    sign = '+' if coef > 0 else '-'
    if var == 'children_lt6':
        interp = f'({sign}) Young children reduce participation'
    elif var == 'children_6_18':
        interp = f'({sign}) Older children reduce participation'
    elif var == 'husband_income':
        interp = f'({sign}) Higher household income reduces need to work'
    elif var == 'age':
        interp = f'({sign}) Age affects labor supply decision'
    else:
        interp = ''
    print(f'{var:20s} {coef:12.4f}   {interp}')

print('\nAll exclusion restrictions have economically meaningful signs.')
print('These variables shift the probability of working without directly')
print('affecting the wage rate, providing genuine identifying variation.')

<a id='section5'></a>
## 5. Example 2: College Wages with Exclusion Restrictions

### 5.1 The Economic Setting

Individuals choose whether to attend college. We observe wages only for college graduates. The central question is: what is the return to college education, after correcting for the fact that those who attend college are a self-selected group?

### 5.2 Choosing Exclusion Restrictions

For the college wage application:

1. **Distance to nearest college (distance_college)**:
   - Greater distance increases costs of attending (travel, relocation)
   - Distance to college should not directly affect a worker's productivity or wage
   - Classic instrument from Card (1995)

2. **Local tuition (tuition)**:
   - Higher tuition increases the cost of attendance
   - Tuition paid years ago should not directly affect current wages

### 5.3 Equation Specification

**Outcome equation** (wage determination):
$$\log(wage_i) = \beta_0 + \beta_1 \cdot ability_i + \beta_2 \cdot parent\_educ_i + \beta_3 \cdot family\_income_i + \beta_4 \cdot urban_i + \beta_5 \cdot female_i + \varepsilon_i$$

**Selection equation** (college attendance):
$$s_i^* = \gamma_0 + \gamma_1 \cdot ability_i + ... + \underbrace{\gamma_6 \cdot distance\_college_i + \gamma_7 \cdot tuition_i}_{\text{Exclusion restrictions}} + u_i$$

In [None]:
# Prepare data for College Wage example
college['log_wage'] = np.log(college['wage'])

selection_c = college['college'].values
y_c = college['log_wage'].fillna(0).values

# Outcome equation variables (X)
X_c = sm.add_constant(college[['ability', 'parent_education', 'family_income',
                                'urban', 'female']].values)
X_c_names = ['const', 'ability', 'parent_education', 'family_income', 'urban', 'female']

# Selection equation variables (Z) = X + exclusion restrictions
Z_c = sm.add_constant(college[['ability', 'parent_education', 'family_income',
                                'urban', 'female',
                                'distance_college', 'tuition']].values)
Z_c_names = ['const', 'ability', 'parent_education', 'family_income', 'urban', 'female',
             'distance_college', 'tuition']

print('Outcome equation (X) variables:', X_c_names)
print(f'Selection equation (Z) variables:', Z_c_names)
print(f'Exclusion restrictions: distance_college, tuition')
print(f'\nCollege attendance: {selection_c.sum()} / {len(selection_c)} ({selection_c.mean():.1%})')

In [None]:
# Estimate Heckman model WITH exclusion restrictions
model_college_excl = PanelHeckman(
    endog=y_c,
    exog=X_c,
    selection=selection_c,
    exog_selection=Z_c,
    method='two_step'
)
result_college_excl = model_college_excl.fit()

print('=' * 70)
print('   HECKMAN MODEL WITH EXCLUSION RESTRICTIONS (College Wage)')
print('=' * 70)
print(result_college_excl.summary())

print('\n' + '=' * 70)
print('   OUTCOME EQUATION COEFFICIENTS')
print('=' * 70)
for name, coef in zip(X_c_names, result_college_excl.outcome_params):
    print(f'  {name:20s}: {coef:10.4f}')

print('\n' + '=' * 70)
print('   SELECTION EQUATION COEFFICIENTS (Probit)')
print('=' * 70)
for name, coef in zip(Z_c_names, result_college_excl.probit_params):
    print(f'  {name:20s}: {coef:10.4f}')

print('\n' + '=' * 70)
print('   SELECTION PARAMETERS')
print('=' * 70)
print(f'  sigma:               {result_college_excl.sigma:10.4f}')
print(f'  rho:                 {result_college_excl.rho:10.4f}')
print(f'  lambda (rho*sigma):  {result_college_excl.rho * result_college_excl.sigma:10.4f}')

In [None]:
# Check relevance of exclusion restrictions in the college equation
print('=== Exclusion Restriction Relevance: College Application ===')
print()

excl_vars_c = ['distance_college', 'tuition']
excl_indices_c = [Z_c_names.index(v) for v in excl_vars_c]

print(f'{"Variable":20s} {"Coefficient":>12s} {"Interpretation"}')
print('-' * 70)
for var, idx in zip(excl_vars_c, excl_indices_c):
    coef = result_college_excl.probit_params[idx]
    sign = '+' if coef > 0 else '-'
    if var == 'distance_college':
        interp = f'({sign}) Greater distance reduces college attendance'
    elif var == 'tuition':
        interp = f'({sign}) Higher tuition reduces college attendance'
    print(f'{var:20s} {coef:12.4f}   {interp}')

print('\nBoth instruments have the expected negative signs:')
print('  - Higher costs (distance, tuition) reduce college attendance')
print('  - But these costs should not affect post-college wages directly')

<a id='section6'></a>
## 6. What Happens Without Exclusion Restrictions?

Now we demonstrate what goes wrong when the model is identified only through the functional form (normality) assumption. We compare three specifications:

- **(a) Properly identified**: with valid exclusion restrictions
- **(b) No exclusion restrictions**: $Z = X$ (same variables in both equations)
- **(c) Weak exclusion restrictions**: poorly chosen instruments

### 6.1 Mroz Data: Three Specifications

In [None]:
# Specification (a): With proper exclusion restrictions (already estimated)
# result_excl is our baseline

# Specification (b): No exclusion restrictions (Z = X)
# Selection equation uses the SAME variables as outcome equation
model_no_excl = PanelHeckman(
    endog=y,
    exog=X,
    selection=selection,
    exog_selection=X,  # Z = X: NO exclusion restrictions!
    method='two_step'
)
result_no_excl = model_no_excl.fit()

# Specification (c): Weak exclusion restriction
# Use 'age' alone as exclusion (weak because age correlates with experience/wages)
Z_weak = sm.add_constant(mroz[['education', 'experience', 'experience_sq', 'age']].values)
Z_weak_names = ['const', 'education', 'experience', 'experience_sq', 'age']

model_weak = PanelHeckman(
    endog=y,
    exog=X,
    selection=selection,
    exog_selection=Z_weak,
    method='two_step'
)
result_weak = model_weak.fit()

print('All three specifications estimated successfully.')

In [None]:
# Compare outcome equation coefficients across the three specifications
comparison_mroz = pd.DataFrame({
    'Variable': X_names,
    '(a) With Exclusion': result_excl.outcome_params,
    '(b) No Exclusion (Z=X)': result_no_excl.outcome_params,
    '(c) Weak Exclusion': result_weak.outcome_params,
}).set_index('Variable')

# Add selection parameters
selection_params = pd.DataFrame({
    'Variable': ['sigma', 'rho', 'lambda (rho*sigma)'],
    '(a) With Exclusion': [result_excl.sigma, result_excl.rho,
                           result_excl.rho * result_excl.sigma],
    '(b) No Exclusion (Z=X)': [result_no_excl.sigma, result_no_excl.rho,
                                result_no_excl.rho * result_no_excl.sigma],
    '(c) Weak Exclusion': [result_weak.sigma, result_weak.rho,
                            result_weak.rho * result_weak.sigma],
}).set_index('Variable')

full_comparison = pd.concat([comparison_mroz, selection_params])

print('=' * 75)
print('  COMPARISON: MROZ DATA -- EFFECT OF EXCLUSION RESTRICTIONS')
print('=' * 75)
print()
print(full_comparison.round(4).to_string())

print('\n' + '=' * 75)
print('  INTERPRETATION')
print('=' * 75)
print()
print('(a) With proper exclusion restrictions:')
print('    - Coefficients are economically meaningful and stable')
print(f'    - rho = {result_excl.rho:.4f}: indicates selection bias')
print()
print('(b) Without exclusion restrictions (Z = X):')
print('    - Model relies ONLY on functional form for identification')
print(f'    - rho = {result_no_excl.rho:.4f}: may be unreliable')
print('    - Coefficients can differ substantially from (a)')
print()
print('(c) Weak exclusion restriction (age only):')
print('    - Age is correlated with experience and may affect wages')
print(f'    - rho = {result_weak.rho:.4f}')
print('    - Validity of exclusion is questionable')

In [None]:
# Visualize coefficient instability across specifications
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: Outcome equation coefficients (excluding constant for scale)
vars_to_plot = ['education', 'experience', 'experience_sq']
idx_to_plot = [X_names.index(v) for v in vars_to_plot]

x_pos = np.arange(len(vars_to_plot))
width = 0.25

bars_a = axes[0].bar(x_pos - width, [result_excl.outcome_params[i] for i in idx_to_plot],
                      width, label='(a) With Exclusion', color='#27ae60', alpha=0.8)
bars_b = axes[0].bar(x_pos, [result_no_excl.outcome_params[i] for i in idx_to_plot],
                      width, label='(b) No Exclusion', color='#e74c3c', alpha=0.8)
bars_c = axes[0].bar(x_pos + width, [result_weak.outcome_params[i] for i in idx_to_plot],
                      width, label='(c) Weak Exclusion', color='#f39c12', alpha=0.8)

axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(vars_to_plot, fontsize=11)
axes[0].set_ylabel('Coefficient Value', fontsize=12)
axes[0].set_title('Outcome Equation: Coefficient Comparison\n(Mroz Data)', fontsize=13)
axes[0].legend(fontsize=9)
axes[0].grid(True, alpha=0.3, axis='y')
axes[0].axhline(y=0, color='black', linewidth=0.8)

# Right: Selection parameters
sel_labels = ['sigma', 'rho', 'lambda']
sel_a = [result_excl.sigma, result_excl.rho, result_excl.rho * result_excl.sigma]
sel_b = [result_no_excl.sigma, result_no_excl.rho, result_no_excl.rho * result_no_excl.sigma]
sel_c = [result_weak.sigma, result_weak.rho, result_weak.rho * result_weak.sigma]

x_pos2 = np.arange(len(sel_labels))

axes[1].bar(x_pos2 - width, sel_a, width, label='(a) With Exclusion', color='#27ae60', alpha=0.8)
axes[1].bar(x_pos2, sel_b, width, label='(b) No Exclusion', color='#e74c3c', alpha=0.8)
axes[1].bar(x_pos2 + width, sel_c, width, label='(c) Weak Exclusion', color='#f39c12', alpha=0.8)

axes[1].set_xticks(x_pos2)
axes[1].set_xticklabels([r'$\sigma$', r'$\rho$', r'$\lambda = \rho\sigma$'], fontsize=12)
axes[1].set_ylabel('Parameter Value', fontsize=12)
axes[1].set_title('Selection Parameters: Sensitivity\nto Exclusion Restrictions', fontsize=13)
axes[1].legend(fontsize=9)
axes[1].grid(True, alpha=0.3, axis='y')
axes[1].axhline(y=0, color='black', linewidth=0.8)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'exclusion_comparison_mroz.png', dpi=150, bbox_inches='tight')
plt.show()

*Figure: Comparison of outcome equation coefficients (left) and selection parameters (right) across three specifications of the Mroz model. The properly identified model (green) provides the benchmark. When exclusion restrictions are removed (red), or when weak instruments are used (orange), both the outcome coefficients and the estimated selection correction can shift substantially, demonstrating the sensitivity of the Heckman estimator to identification strategy.*

### 6.2 College Data: Three Specifications

In [None]:
# Specification (a): With proper exclusion restrictions (already estimated)
# result_college_excl is our baseline

# Specification (b): No exclusion restrictions (Z = X)
model_college_no_excl = PanelHeckman(
    endog=y_c,
    exog=X_c,
    selection=selection_c,
    exog_selection=X_c,  # Z = X
    method='two_step'
)
result_college_no_excl = model_college_no_excl.fit()

# Specification (c): Weak exclusion (distance_college only, without tuition)
Z_c_weak = sm.add_constant(college[['ability', 'parent_education', 'family_income',
                                     'urban', 'female', 'distance_college']].values)
Z_c_weak_names = ['const', 'ability', 'parent_education', 'family_income',
                   'urban', 'female', 'distance_college']

model_college_weak = PanelHeckman(
    endog=y_c,
    exog=X_c,
    selection=selection_c,
    exog_selection=Z_c_weak,
    method='two_step'
)
result_college_weak = model_college_weak.fit()

# Build comparison table
comparison_college = pd.DataFrame({
    'Variable': X_c_names,
    '(a) With Exclusion': result_college_excl.outcome_params,
    '(b) No Exclusion (Z=X)': result_college_no_excl.outcome_params,
    '(c) Single Exclusion': result_college_weak.outcome_params,
}).set_index('Variable')

sel_params_c = pd.DataFrame({
    'Variable': ['sigma', 'rho', 'lambda (rho*sigma)'],
    '(a) With Exclusion': [result_college_excl.sigma, result_college_excl.rho,
                            result_college_excl.rho * result_college_excl.sigma],
    '(b) No Exclusion (Z=X)': [result_college_no_excl.sigma, result_college_no_excl.rho,
                                result_college_no_excl.rho * result_college_no_excl.sigma],
    '(c) Single Exclusion': [result_college_weak.sigma, result_college_weak.rho,
                              result_college_weak.rho * result_college_weak.sigma],
}).set_index('Variable')

full_comparison_c = pd.concat([comparison_college, sel_params_c])

print('=' * 75)
print('  COMPARISON: COLLEGE DATA -- EFFECT OF EXCLUSION RESTRICTIONS')
print('=' * 75)
print()
print(full_comparison_c.round(4).to_string())

print('\n' + '-' * 75)
print('Note: Specification (c) uses only distance_college as exclusion restriction.')
print('This may still provide reasonable identification if distance is a strong predictor.')

In [None]:
# Visualize college comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot outcome equation coefficients (exclude constant for readability)
vars_plot_c = ['ability', 'parent_education', 'family_income', 'urban', 'female']
idx_plot_c = [X_c_names.index(v) for v in vars_plot_c]

x_pos = np.arange(len(vars_plot_c))
width = 0.25

axes[0].bar(x_pos - width, [result_college_excl.outcome_params[i] for i in idx_plot_c],
            width, label='(a) With Exclusion', color='#27ae60', alpha=0.8)
axes[0].bar(x_pos, [result_college_no_excl.outcome_params[i] for i in idx_plot_c],
            width, label='(b) No Exclusion', color='#e74c3c', alpha=0.8)
axes[0].bar(x_pos + width, [result_college_weak.outcome_params[i] for i in idx_plot_c],
            width, label='(c) Single Exclusion', color='#f39c12', alpha=0.8)

axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(vars_plot_c, fontsize=9, rotation=15)
axes[0].set_ylabel('Coefficient Value', fontsize=12)
axes[0].set_title('Outcome Equation: Coefficient Comparison\n(College Data)', fontsize=13)
axes[0].legend(fontsize=9)
axes[0].grid(True, alpha=0.3, axis='y')
axes[0].axhline(y=0, color='black', linewidth=0.8)

# Selection parameters
sel_labels = ['sigma', 'rho', 'lambda']
sel_a = [result_college_excl.sigma, result_college_excl.rho,
         result_college_excl.rho * result_college_excl.sigma]
sel_b = [result_college_no_excl.sigma, result_college_no_excl.rho,
         result_college_no_excl.rho * result_college_no_excl.sigma]
sel_c = [result_college_weak.sigma, result_college_weak.rho,
         result_college_weak.rho * result_college_weak.sigma]

x_pos2 = np.arange(len(sel_labels))
axes[1].bar(x_pos2 - width, sel_a, width, label='(a) With Exclusion', color='#27ae60', alpha=0.8)
axes[1].bar(x_pos2, sel_b, width, label='(b) No Exclusion', color='#e74c3c', alpha=0.8)
axes[1].bar(x_pos2 + width, sel_c, width, label='(c) Single Exclusion', color='#f39c12', alpha=0.8)

axes[1].set_xticks(x_pos2)
axes[1].set_xticklabels([r'$\sigma$', r'$\rho$', r'$\lambda = \rho\sigma$'], fontsize=12)
axes[1].set_ylabel('Parameter Value', fontsize=12)
axes[1].set_title('Selection Parameters: Sensitivity\nto Exclusion Restrictions', fontsize=13)
axes[1].legend(fontsize=9)
axes[1].grid(True, alpha=0.3, axis='y')
axes[1].axhline(y=0, color='black', linewidth=0.8)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'exclusion_comparison_college.png', dpi=150, bbox_inches='tight')
plt.show()

*Figure: Coefficient comparison across three specifications of the college wage model. With proper exclusion restrictions (green), estimates are stable and economically meaningful. Removing exclusion restrictions (red) leads to identification purely through functional form, potentially distorting the estimated returns to ability, parental education, and other factors. The single-exclusion specification (orange) provides an intermediate case.*

### 6.3 Diagnosing the Problem: Collinearity Analysis

When exclusion restrictions are absent, the inverse Mills ratio $\lambda(X'\hat{\gamma})$ is a near-linear function of $X$, creating severe multicollinearity in the augmented outcome equation. We can diagnose this directly.

In [None]:
# Demonstrate collinearity: Correlate IMR with X variables

# Model WITH exclusion restrictions
Zg_excl = Z @ result_excl.probit_params
Phi_excl = stats.norm.cdf(Zg_excl)
imr_excl = stats.norm.pdf(Zg_excl) / np.clip(Phi_excl, 1e-10, None)

# Model WITHOUT exclusion restrictions
Xg_no_excl = X @ result_no_excl.probit_params
Phi_no_excl = stats.norm.cdf(Xg_no_excl)
imr_no_excl = stats.norm.pdf(Xg_no_excl) / np.clip(Phi_no_excl, 1e-10, None)

# Compute correlations between IMR and outcome equation variables
# (for selected observations only)
sel_mask = selection == 1

print('=== Correlation Between IMR and Outcome Variables (Selected Sample) ===')
print()
print(f'{"Variable":20s} {"With Exclusion":>15s} {"Without Exclusion":>18s}')
print('-' * 58)

for i, name in enumerate(X_names):
    if name == 'const':
        continue
    corr_excl = np.corrcoef(X[sel_mask, i], imr_excl[sel_mask])[0, 1]
    corr_no_excl = np.corrcoef(X[sel_mask, i], imr_no_excl[sel_mask])[0, 1]
    flag = ' *** HIGH' if abs(corr_no_excl) > 0.7 else ''
    print(f'{name:20s} {corr_excl:15.4f} {corr_no_excl:18.4f}{flag}')

# Also check correlation of IMR with linear predictor Xb
Xb = X[sel_mask] @ result_excl.outcome_params
corr_xb_excl = np.corrcoef(Xb, imr_excl[sel_mask])[0, 1]
corr_xb_no = np.corrcoef(Xb, imr_no_excl[sel_mask])[0, 1]

print('-' * 58)
print(f'{"X*beta (linear pred)":20s} {corr_xb_excl:15.4f} {corr_xb_no:18.4f}')

print('\n*** HIGH marks correlations above 0.7 in absolute value')
print('\nConclusion: Without exclusion restrictions, the IMR is highly')
print('correlated with X variables, creating the multicollinearity')
print('that makes estimation unreliable.')

In [None]:
# Scatter plots: IMR vs linear predictor (with vs without exclusion)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# With exclusion restrictions
axes[0].scatter(Xb, imr_excl[sel_mask], alpha=0.4, s=15, color='#27ae60')
axes[0].set_xlabel(r"$X'\hat{\beta}$ (outcome linear predictor)", fontsize=12)
axes[0].set_ylabel(r'$\lambda(Z\'\hat{\gamma})$ (IMR)', fontsize=12)
axes[0].set_title(f'WITH Exclusion Restrictions\n'
                   f'Corr(X\'b, IMR) = {corr_xb_excl:.3f}', fontsize=13)
axes[0].grid(True, alpha=0.3)

# Without exclusion restrictions
axes[1].scatter(Xb, imr_no_excl[sel_mask], alpha=0.4, s=15, color='#e74c3c')
axes[1].set_xlabel(r"$X'\hat{\beta}$ (outcome linear predictor)", fontsize=12)
axes[1].set_ylabel(r"$\lambda(X'\hat{\gamma})$ (IMR)", fontsize=12)
axes[1].set_title(f'WITHOUT Exclusion Restrictions\n'
                   f'Corr(X\'b, IMR) = {corr_xb_no:.3f}', fontsize=13)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'imr_collinearity.png', dpi=150, bbox_inches='tight')
plt.show()

print('With exclusion restrictions (left):')
print('  - The IMR has variation independent of X\'b')
print('  - The scatter shows dispersion, NOT a tight line')
print()
print('Without exclusion restrictions (right):')
print('  - The IMR is nearly perfectly correlated with X\'b')
print('  - Points form a tight curve (near-linear relationship)')
print('  - This collinearity makes it impossible to separately estimate')
print('    the effect of X and the selection correction')

*Figure: Scatter plots of the inverse Mills ratio against the outcome linear predictor $X'\hat{\beta}$ for selected observations. Left: With exclusion restrictions, the IMR has substantial variation independent of $X'\beta$, providing genuine identifying power. Right: Without exclusion restrictions, the IMR collapses onto a near-perfect function of $X'\beta$, making it impossible to disentangle the selection correction from the outcome equation covariates.*

<a id='section7'></a>
## 7. Testing Exclusion Restrictions

A credible exclusion restriction must satisfy two conditions:

1. **Relevance**: The instrument must meaningfully predict selection
2. **Validity (Excludability)**: The instrument must not directly affect the outcome

### 7.1 Testing Relevance

We can test this statistically: are the exclusion restrictions jointly significant in the probit selection equation?

In [None]:
# Test relevance: Compare restricted vs unrestricted probit models
# Restricted: probit without exclusion variables
# Unrestricted: probit with exclusion variables

def probit_log_likelihood(gamma, Z, selection):
    """Compute probit log-likelihood."""
    linear_pred = Z @ gamma
    prob = stats.norm.cdf(linear_pred)
    prob = np.clip(prob, 1e-10, 1 - 1e-10)
    return np.sum(selection * np.log(prob) + (1 - selection) * np.log(1 - prob))

# --- Mroz dataset ---
print('=== Relevance Test: Mroz Data ===')
print()

# Unrestricted: full selection equation
ll_unrestricted = probit_log_likelihood(result_excl.probit_params, Z, selection)

# Restricted: selection equation WITHOUT exclusion variables (same as outcome vars)
from scipy.optimize import minimize as sp_minimize

def neg_probit_llf(gamma, Z, sel):
    return -probit_log_likelihood(gamma, Z, sel)

# Fit restricted probit (Z = X only)
res_restricted = sp_minimize(neg_probit_llf, np.zeros(X.shape[1]),
                              args=(X, selection), method='BFGS')
ll_restricted = -res_restricted.fun

# Likelihood ratio test
n_restrictions = Z.shape[1] - X.shape[1]  # Number of excluded variables
lr_stat = 2 * (ll_unrestricted - ll_restricted)
lr_pvalue = 1 - stats.chi2.cdf(lr_stat, df=n_restrictions)

print(f'Log-likelihood (unrestricted, with exclusions): {ll_unrestricted:.2f}')
print(f'Log-likelihood (restricted, without exclusions): {ll_restricted:.2f}')
print(f'Number of exclusion restrictions tested: {n_restrictions}')
print(f'\nLikelihood Ratio statistic: {lr_stat:.4f}')
print(f'Chi-squared degrees of freedom: {n_restrictions}')
print(f'p-value: {lr_pvalue:.6f}')
print()

if lr_pvalue < 0.05:
    print('RESULT: Reject H0 at 5% level.')
    print('The exclusion restrictions are JOINTLY SIGNIFICANT in the selection equation.')
    print('This confirms the RELEVANCE condition is satisfied.')
else:
    print('RESULT: Fail to reject H0 at 5% level.')
    print('WARNING: The exclusion restrictions are NOT jointly significant.')
    print('The instruments may be too weak for reliable identification.')

In [None]:
# --- College dataset ---
print('=== Relevance Test: College Data ===')
print()

ll_unrestricted_c = probit_log_likelihood(result_college_excl.probit_params, Z_c, selection_c)

res_restricted_c = sp_minimize(neg_probit_llf, np.zeros(X_c.shape[1]),
                                args=(X_c, selection_c), method='BFGS')
ll_restricted_c = -res_restricted_c.fun

n_restrictions_c = Z_c.shape[1] - X_c.shape[1]
lr_stat_c = 2 * (ll_unrestricted_c - ll_restricted_c)
lr_pvalue_c = 1 - stats.chi2.cdf(lr_stat_c, df=n_restrictions_c)

print(f'Log-likelihood (unrestricted, with exclusions): {ll_unrestricted_c:.2f}')
print(f'Log-likelihood (restricted, without exclusions): {ll_restricted_c:.2f}')
print(f'Number of exclusion restrictions tested: {n_restrictions_c}')
print(f'\nLikelihood Ratio statistic: {lr_stat_c:.4f}')
print(f'Chi-squared degrees of freedom: {n_restrictions_c}')
print(f'p-value: {lr_pvalue_c:.6f}')
print()

if lr_pvalue_c < 0.05:
    print('RESULT: Reject H0 at 5% level.')
    print('The exclusion restrictions are JOINTLY SIGNIFICANT in the selection equation.')
    print('Relevance condition is satisfied.')
else:
    print('RESULT: Fail to reject H0 at 5% level.')
    print('WARNING: Weak instruments detected.')

### 7.2 Testing Validity (The Hard Part)

The **validity** of exclusion restrictions -- that the instrument does not directly affect the outcome -- is fundamentally **untestable** with the data alone. This is the same problem as with instrumental variables: the exclusion restriction is an identifying assumption.

However, we can provide supporting evidence:

#### A. Economic Reasoning (Most Important)

The strongest case for validity comes from **economic theory**:

| Instrument | Argument for Validity |
|---|---|
| Children (Mroz) | Number of children affects time allocation, not hourly wage rate |
| Husband's income | Other household income affects reservation wage, not market wage |
| Distance to college | Geographic proximity affects costs, not worker productivity |
| Tuition | Historical tuition costs do not affect current employer's willingness to pay |

#### B. Informal Over-Identification Test

If we have **more exclusion restrictions than needed**, we can test whether results are stable when using different subsets of instruments.

In [None]:
# Over-identification style check: Mroz data
# Estimate with each exclusion restriction individually

print('=== Over-Identification Check: Mroz Data ===')
print('Estimate model with EACH exclusion restriction individually.')
print('If results are consistent, it supports validity of all instruments.')
print()

individual_results = {}
exclusion_sets = {
    'children_lt6 only': ['education', 'experience', 'experience_sq', 'children_lt6'],
    'children_6_18 only': ['education', 'experience', 'experience_sq', 'children_6_18'],
    'husband_income only': ['education', 'experience', 'experience_sq', 'husband_income'],
    'All children vars': ['education', 'experience', 'experience_sq', 'children_lt6', 'children_6_18'],
    'Full (all excl.)': ['education', 'experience', 'experience_sq', 'age',
                          'children_lt6', 'children_6_18', 'husband_income'],
}

for name, z_cols in exclusion_sets.items():
    Z_test = sm.add_constant(mroz[z_cols].values)
    try:
        model_test = PanelHeckman(
            endog=y, exog=X, selection=selection,
            exog_selection=Z_test, method='two_step'
        )
        res_test = model_test.fit()
        individual_results[name] = {
            'education': res_test.outcome_params[1],
            'experience': res_test.outcome_params[2],
            'rho': res_test.rho,
            'sigma': res_test.sigma,
        }
    except Exception as e:
        individual_results[name] = {'error': str(e)}

overid_df = pd.DataFrame(individual_results).T
print(overid_df.round(4).to_string())

print('\nInterpretation:')
print('If the education and experience coefficients are similar across')
print('specifications, it supports the validity of the exclusion restrictions.')
if 'education' in overid_df.columns:
    edu_range = overid_df['education'].max() - overid_df['education'].min()
    print(f'\nRange of education coefficient: {edu_range:.4f}')
    if edu_range < 0.05:
        print('  -> Coefficients are stable. Good evidence for instrument validity.')
    elif edu_range < 0.15:
        print('  -> Moderate variation. Results are somewhat sensitive to instrument choice.')
    else:
        print('  -> Large variation. Instruments may not all be valid.')

In [None]:
# Visualize the over-identification check
if 'education' in overid_df.columns and 'rho' in overid_df.columns:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Education coefficient across specifications
    specs = overid_df.index.tolist()
    x_pos = np.arange(len(specs))

    colors = ['#3498db', '#e74c3c', '#f39c12', '#9b59b6', '#27ae60']

    axes[0].bar(x_pos, overid_df['education'].values, color=colors[:len(specs)], alpha=0.8,
                edgecolor='black', linewidth=0.5)
    axes[0].axhline(y=overid_df['education'].mean(), color='black', linestyle='--',
                     linewidth=1.5, label=f'Mean = {overid_df["education"].mean():.4f}')
    axes[0].set_xticks(x_pos)
    axes[0].set_xticklabels(specs, fontsize=8, rotation=20, ha='right')
    axes[0].set_ylabel('Education Coefficient', fontsize=12)
    axes[0].set_title('Stability of Education Coefficient\nAcross Instrument Sets', fontsize=13)
    axes[0].legend(fontsize=10)
    axes[0].grid(True, alpha=0.3, axis='y')

    # Rho across specifications
    axes[1].bar(x_pos, overid_df['rho'].values, color=colors[:len(specs)], alpha=0.8,
                edgecolor='black', linewidth=0.5)
    axes[1].axhline(y=overid_df['rho'].mean(), color='black', linestyle='--',
                     linewidth=1.5, label=f'Mean = {overid_df["rho"].mean():.4f}')
    axes[1].set_xticks(x_pos)
    axes[1].set_xticklabels(specs, fontsize=8, rotation=20, ha='right')
    axes[1].set_ylabel(r'$\rho$ (selection correlation)', fontsize=12)
    axes[1].set_title(r'Stability of $\rho$ Across Instrument Sets', fontsize=13)
    axes[1].legend(fontsize=10)
    axes[1].grid(True, alpha=0.3, axis='y')

    plt.tight_layout()
    plt.savefig(FIGURES_DIR / 'overidentification_check.png', dpi=150, bbox_inches='tight')
    plt.show()

*Figure: Stability of key estimates (education coefficient on the left, selection correlation rho on the right) across different subsets of exclusion restrictions. Consistent estimates across instrument sets provide informal evidence supporting the validity of the exclusion restrictions. If estimates varied wildly, it would raise concerns about instrument validity.*

### 7.3 Potential Problems with Common Exclusion Restrictions

Not all commonly used exclusion restrictions are above reproach:

| Instrument | Potential Concern |
|---|---|
| **Number of children** | Children might affect human capital accumulation (time out of labor force reduces skills), which in turn affects wages |
| **Husband's income** | Assortative mating: women married to high-income men may have different unobserved skills |
| **Distance to college** | Distance correlates with rurality, which may directly affect wages through labor market thickness |
| **Tuition** | Tuition varies by state/time, correlating with other state-level factors affecting wages |

**Key lesson**: There is no perfect exclusion restriction. The researcher must make a judgment call and defend it with economic reasoning.

<a id='section8'></a>
## 8. Sensitivity Analysis

A thorough applied analysis should examine how results change with different instrument specifications. This builds confidence (or reveals fragility) in the estimates.

### 8.1 Systematic Sensitivity Analysis: Mroz Data

In [None]:
# Comprehensive sensitivity analysis for Mroz data
# Try many different combinations of exclusion restrictions

import sys
sys.path.insert(0, str(BASE_DIR / 'utils'))
from comparison_tools import sensitivity_analysis

sensitivity_specs = {
    '1. No exclusion (Z=X)': X,
    '2. children_lt6': sm.add_constant(mroz[['education', 'experience', 'experience_sq',
                                              'children_lt6']].values),
    '3. children_6_18': sm.add_constant(mroz[['education', 'experience', 'experience_sq',
                                               'children_6_18']].values),
    '4. husband_income': sm.add_constant(mroz[['education', 'experience', 'experience_sq',
                                                'husband_income']].values),
    '5. Both children': sm.add_constant(mroz[['education', 'experience', 'experience_sq',
                                               'children_lt6', 'children_6_18']].values),
    '6. Children + income': sm.add_constant(mroz[['education', 'experience', 'experience_sq',
                                                    'children_lt6', 'children_6_18',
                                                    'husband_income']].values),
    '7. Full (all + age)': Z,
}

sensitivity_results = []
for name, Z_spec in sensitivity_specs.items():
    try:
        m = PanelHeckman(endog=y, exog=X, selection=selection,
                         exog_selection=Z_spec, method='two_step')
        r = m.fit()
        sensitivity_results.append({
            'Specification': name,
            'beta_education': r.outcome_params[1],
            'beta_experience': r.outcome_params[2],
            'beta_exper_sq': r.outcome_params[3],
            'sigma': r.sigma,
            'rho': r.rho,
            'lambda': r.rho * r.sigma,
        })
    except Exception as e:
        sensitivity_results.append({
            'Specification': name,
            'error': str(e)
        })

sens_df = pd.DataFrame(sensitivity_results).set_index('Specification')

print('=' * 85)
print('  SENSITIVITY ANALYSIS: MROZ DATA')
print('  How do estimates change across different exclusion restrictions?')
print('=' * 85)
print()
print(sens_df.round(4).to_string())

print('\n' + '-' * 85)
if 'beta_education' in sens_df.columns:
    edu_std = sens_df['beta_education'].std()
    rho_std = sens_df['rho'].std()
    print(f'Std. dev. of education coefficient: {edu_std:.4f}')
    print(f'Std. dev. of rho across specs:      {rho_std:.4f}')

In [None]:
# Visualize sensitivity analysis
if 'beta_education' in sens_df.columns:
    fig, axes = plt.subplots(2, 1, figsize=(12, 10))

    specs = sens_df.index.tolist()
    x_pos = np.arange(len(specs))

    # Top: Outcome coefficients across specifications
    width = 0.25
    axes[0].bar(x_pos - width, sens_df['beta_education'].values, width,
                label='Education', color='#3498db', alpha=0.8)
    axes[0].bar(x_pos, sens_df['beta_experience'].values, width,
                label='Experience', color='#27ae60', alpha=0.8)
    axes[0].bar(x_pos + width, sens_df['beta_exper_sq'].values * 100, width,
                label=r'Experience$^2$ ($\times 100$)', color='#e74c3c', alpha=0.8)
    axes[0].set_xticks(x_pos)
    axes[0].set_xticklabels(specs, fontsize=8, rotation=25, ha='right')
    axes[0].set_ylabel('Coefficient Value', fontsize=12)
    axes[0].set_title('Outcome Equation Coefficients Across Specifications', fontsize=13)
    axes[0].legend(fontsize=10)
    axes[0].grid(True, alpha=0.3, axis='y')
    axes[0].axhline(y=0, color='black', linewidth=0.8)

    # Bottom: rho and lambda across specifications
    ax2 = axes[1]
    color_rho = '#8e44ad'
    color_lambda = '#d35400'

    bars1 = ax2.bar(x_pos - 0.15, sens_df['rho'].values, 0.3,
                     label=r'$\rho$', color=color_rho, alpha=0.8)
    bars2 = ax2.bar(x_pos + 0.15, sens_df['lambda'].values, 0.3,
                     label=r'$\lambda = \rho\sigma$', color=color_lambda, alpha=0.8)

    ax2.set_xticks(x_pos)
    ax2.set_xticklabels(specs, fontsize=8, rotation=25, ha='right')
    ax2.set_ylabel('Parameter Value', fontsize=12)
    ax2.set_title('Selection Parameters Across Specifications', fontsize=13)
    ax2.legend(fontsize=10)
    ax2.grid(True, alpha=0.3, axis='y')
    ax2.axhline(y=0, color='black', linewidth=0.8)

    plt.tight_layout()
    plt.savefig(FIGURES_DIR / 'sensitivity_analysis_mroz.png', dpi=150, bbox_inches='tight')
    plt.show()

*Figure: Comprehensive sensitivity analysis showing how outcome equation coefficients (top) and selection parameters (bottom) vary across seven instrument specifications for the Mroz data. Stable estimates across properly identified specifications (3-7) provide confidence in the results. Specification 1 (no exclusion) stands out as potentially unreliable, illustrating the importance of having at least one valid exclusion restriction.*

### 8.2 Sensitivity Analysis: College Data

In [None]:
# Sensitivity analysis for College data
sensitivity_specs_c = {
    '1. No exclusion (Z=X)': X_c,
    '2. distance only': sm.add_constant(college[['ability', 'parent_education', 'family_income',
                                                  'urban', 'female', 'distance_college']].values),
    '3. tuition only': sm.add_constant(college[['ability', 'parent_education', 'family_income',
                                                 'urban', 'female', 'tuition']].values),
    '4. Both (dist + tuit)': Z_c,
}

sensitivity_results_c = []
for name, Z_spec in sensitivity_specs_c.items():
    try:
        m = PanelHeckman(endog=y_c, exog=X_c, selection=selection_c,
                         exog_selection=Z_spec, method='two_step')
        r = m.fit()
        sensitivity_results_c.append({
            'Specification': name,
            'beta_ability': r.outcome_params[1],
            'beta_parent_ed': r.outcome_params[2],
            'beta_family_inc': r.outcome_params[3],
            'beta_urban': r.outcome_params[4],
            'beta_female': r.outcome_params[5],
            'sigma': r.sigma,
            'rho': r.rho,
            'lambda': r.rho * r.sigma,
        })
    except Exception as e:
        sensitivity_results_c.append({'Specification': name, 'error': str(e)})

sens_df_c = pd.DataFrame(sensitivity_results_c).set_index('Specification')

print('=' * 85)
print('  SENSITIVITY ANALYSIS: COLLEGE DATA')
print('=' * 85)
print()
print(sens_df_c.round(4).to_string())

print('\n' + '-' * 85)
if 'beta_ability' in sens_df_c.columns:
    ability_range = sens_df_c['beta_ability'].max() - sens_df_c['beta_ability'].min()
    print(f'Range of ability coefficient: {ability_range:.4f}')
    print(f'Range of rho: {sens_df_c["rho"].max() - sens_df_c["rho"].min():.4f}')

In [None]:
# Compare OLS (ignoring selection) vs Heckman (with correction)
# This shows the bias from ignoring selection

print('=' * 70)
print('  OLS vs HECKMAN: THE COST OF IGNORING SELECTION BIAS')
print('=' * 70)

# --- Mroz ---
sel_mask_m = selection == 1
beta_ols_m = np.linalg.lstsq(X[sel_mask_m], y[sel_mask_m], rcond=None)[0]

print('\n--- Mroz Data (Log Wage Equation) ---')
print(f'{"Variable":20s} {"OLS (biased)":>12s} {"Heckman":>12s} {"Difference":>12s} {"Bias %":>10s}')
print('-' * 70)
for i, name in enumerate(X_names):
    diff = beta_ols_m[i] - result_excl.outcome_params[i]
    pct = 100 * diff / (abs(result_excl.outcome_params[i]) + 1e-10)
    print(f'{name:20s} {beta_ols_m[i]:12.4f} {result_excl.outcome_params[i]:12.4f} '
          f'{diff:12.4f} {pct:9.1f}%')

# --- College ---
sel_mask_c = selection_c == 1
beta_ols_c = np.linalg.lstsq(X_c[sel_mask_c], y_c[sel_mask_c], rcond=None)[0]

print('\n--- College Data (Log Wage Equation) ---')
print(f'{"Variable":20s} {"OLS (biased)":>12s} {"Heckman":>12s} {"Difference":>12s} {"Bias %":>10s}')
print('-' * 70)
for i, name in enumerate(X_c_names):
    diff = beta_ols_c[i] - result_college_excl.outcome_params[i]
    pct = 100 * diff / (abs(result_college_excl.outcome_params[i]) + 1e-10)
    print(f'{name:20s} {beta_ols_c[i]:12.4f} {result_college_excl.outcome_params[i]:12.4f} '
          f'{diff:12.4f} {pct:9.1f}%')

print('\n' + '=' * 70)
print('OLS on the selected sample ignores selection bias.')
print('When rho is non-zero, OLS coefficients can be substantially biased.')
print('The Heckman model with proper exclusion restrictions corrects this.')

<a id='section9'></a>
## 9. Best Practices

### 9.1 Guidelines for Choosing Exclusion Restrictions

Based on the analysis above and the econometric literature, here are guidelines for selecting exclusion restrictions:

#### Rule 1: Start with Economic Theory

The exclusion restriction must be motivated by a clear economic argument:
- **Why does the variable affect selection?** There must be a plausible economic mechanism
- **Why does it NOT affect the outcome?** This must be defensible
- Think of the variable as shifting the **cost** of participation without affecting the **return**

#### Rule 2: Use Multiple Exclusion Restrictions

- Having more than one exclusion restriction allows:
  - Over-identification tests (checking consistency across instruments)
  - Stronger first-stage prediction of selection
  - More robust identification
- But additional instruments must also be valid!

#### Rule 3: Test Relevance Statistically

- Run the probit selection equation and test joint significance of excluded variables
- If the exclusion restrictions are weak predictors, the model is poorly identified
- Rule of thumb: F-statistic > 10 in linear probability selection equation

#### Rule 4: Report Sensitivity Analyses

- Always show results with and without exclusion restrictions
- Show results with different subsets of instruments
- If results are highly sensitive to the choice of instruments, be transparent about this

#### Rule 5: Be Honest About Limitations

- No exclusion restriction is perfect
- Acknowledge potential threats to validity
- Present the case for your chosen instruments clearly

### 9.2 Common Pitfalls

In [None]:
# Summary table of best practices
practices = pd.DataFrame({
    'Practice': [
        'Use exclusion restrictions',
        'Justify with economic theory',
        'Test instrument relevance',
        'Use multiple instruments',
        'Run sensitivity analysis',
        'Compare OLS vs Heckman',
        'Report specifications without exclusion',
        'Check IMR collinearity',
    ],
    'Why': [
        'Identification from functional form alone is fragile',
        'Statistical tests cannot validate the exclusion restriction',
        'Weak instruments lead to unreliable estimates',
        'Allows over-identification checks and stronger first stage',
        'Reveals how robust conclusions are to instrument choice',
        'Quantifies the magnitude and direction of selection bias',
        'Transparency about identification strategy',
        'Diagnoses multicollinearity in the augmented equation',
    ],
    'How in PanelBox': [
        'Set exog_selection to include variables not in exog',
        'Think about DGP and institutional details',
        'LR test on probit selection equation',
        'Include 2+ variables in Z but not X',
        'Re-estimate with different subsets of exclusions',
        'Use result.compare_ols_heckman()',
        'Estimate with exog_selection=exog as a specification',
        'Correlate IMR with X variables',
    ],
})

print('=' * 90)
print('  BEST PRACTICES FOR EXCLUSION RESTRICTIONS IN HECKMAN MODELS')
print('=' * 90)
print()
for _, row in practices.iterrows():
    print(f'  {row["Practice"]}')
    print(f'    WHY: {row["Why"]}')
    print(f'    HOW: {row["How in PanelBox"]}')
    print()

### 9.3 Decision Flowchart

When deciding on your identification strategy:

```
1. Do you have a candidate exclusion restriction?
   |
   +-- YES --> Is there a clear economic argument for validity?
   |           |
   |           +-- YES --> Does it significantly predict selection? (LR test)
   |           |           |
   |           |           +-- YES --> Use it! Run sensitivity analysis.
   |           |           +-- NO  --> Instrument is too weak. Find a better one.
   |           |
   |           +-- NO  --> Do NOT use it. Seek alternatives.
   |
   +-- NO  --> Options:
               a) Search for valid instruments (institutional details, policy variation)
               b) Estimate without exclusion restrictions (acknowledge limitations)
               c) Use alternative methods (bounds, control function approaches)
```

<a id='section10'></a>
## 10. Summary and Key Takeaways

### What We Learned

1. **The identification problem**: Without exclusion restrictions, the Heckman model is identified only through the normality assumption. The inverse Mills ratio $\lambda(X'\gamma)$ is nearly linear in $X'\gamma$ over typical data ranges, creating severe multicollinearity.

2. **Exclusion restrictions**: Variables that affect selection but not the outcome break the collinearity and provide genuine identifying variation. They are the econometric analog of instrumental variables.

3. **Practical demonstration**: We showed that removing exclusion restrictions from both the Mroz and College datasets leads to substantially different (and potentially unreliable) estimates of outcome equation parameters and selection correlation.

4. **Testing**: Relevance can be tested statistically (LR test on the probit). Validity, however, relies on economic reasoning -- the same untestable assumption as in IV estimation.

5. **Sensitivity analysis**: A credible applied analysis reports results across multiple instrument specifications. Stability of key estimates across specifications provides confidence in the identification strategy.

### Key Equations

| Concept | Formula |
|---|---|
| Conditional expectation | $E[y|s=1, X] = X'\beta + \rho\sigma\lambda(Z'\gamma)$ |
| Without exclusion ($Z=X$) | $\lambda(X'\gamma) \approx a + bX'\gamma$ (collinearity!) |
| Relevance test | LR = $2(\ell_{\text{unrestricted}} - \ell_{\text{restricted}}) \sim \chi^2_q$ |
| Selection correction | $\hat{\theta} = \hat{\rho}\hat{\sigma}$ (coefficient on IMR) |

### Practical Takeaway

> **Always use exclusion restrictions when possible.** The strongest applied work combines credible economic reasoning for the exclusion restriction with statistical evidence of relevance and sensitivity analyses showing robustness.

<a id='exercises'></a>
## 11. Exercises

Test your understanding with these exercises!

---

### Exercise 1: Evaluate Candidate Instruments (Conceptual)

For each proposed exclusion restriction below, evaluate whether it satisfies:
- **Relevance**: Does it plausibly affect selection?
- **Validity**: Can we argue it does NOT directly affect the outcome?

| Application | Selection | Outcome | Proposed Instrument |
|---|---|---|---|
| (a) Female labor supply | Work vs not work | Hourly wage | Husband's age |
| (b) College wage premium | Attend college | Post-college wage | SAT score |
| (c) Union wage gap | Union member | Log wage | State right-to-work law |
| (d) Training program | Participate in training | Quarterly earnings | Distance to training site |

For each, write 2-3 sentences explaining your assessment.

In [None]:
# Exercise 1: Write your assessments here

# (a) Husband's age as instrument for female labor supply:
# YOUR ANSWER:
# Relevance: ...
# Validity: ...
# Assessment: ...

# (b) SAT score as instrument for college wage premium:
# YOUR ANSWER:
# Relevance: ...
# Validity: ...
# Assessment: ...

# (c) State right-to-work law for union wage gap:
# YOUR ANSWER:
# Relevance: ...
# Validity: ...
# Assessment: ...

# (d) Distance to training site:
# YOUR ANSWER:
# Relevance: ...
# Validity: ...
# Assessment: ...

---

### Exercise 2: Implement and Compare Specifications (Hands-On)

Using the Mroz dataset, implement the following three models and compare:

1. **Model A**: Exclusion restrictions = `children_lt6` + `husband_income`
2. **Model B**: Exclusion restrictions = `children_6_18` + `husband_income`
3. **Model C**: Exclusion restrictions = `age` only

**Tasks**:
- Estimate all three models
- Create a comparison table of outcome coefficients and selection parameters
- Which specification produces the most stable results? Why?
- Run the LR relevance test for each set of exclusion restrictions

In [None]:
# Exercise 2: Your solution here

# Step 1: Define the three selection equation variable matrices
# Model A: X + children_lt6 + husband_income
# Z_A = sm.add_constant(mroz[['education', 'experience', 'experience_sq',
#                              'children_lt6', 'husband_income']].values)

# Model B: X + children_6_18 + husband_income
# Z_B = ...

# Model C: X + age
# Z_C = ...

# Step 2: Estimate each model
# TODO: Use PanelHeckman with method='two_step'

# Step 3: Compare results
# TODO: Create comparison DataFrame

# Step 4: LR relevance test for each specification
# TODO: Compute LR statistics

---

### Exercise 3: Collinearity Diagnostic (Intermediate)

For the College Wage dataset:

1. Estimate the Heckman model with and without exclusion restrictions
2. For each specification, compute the correlation matrix between the IMR and each X variable (for selected observations only)
3. Create a heatmap visualization of both correlation matrices side by side
4. Compute the condition number of the augmented design matrix $[X, \lambda]$ for both cases
5. Discuss: how does adding exclusion restrictions reduce the condition number?

In [None]:
# Exercise 3: Your solution here

# Step 1: Estimate both models (already done above)
# With exclusion: result_college_excl
# Without exclusion: result_college_no_excl

# Step 2: Compute IMR for both specifications
# TODO: Compute lambda(Z'gamma) for both models

# Step 3: Correlation matrices
# TODO: Correlate IMR with each X variable in the selected sample

# Step 4: Condition numbers
# Hint: np.linalg.cond(X_augmented)
# TODO: Compute condition numbers

# Step 5: Create heatmap visualization
# TODO: Use seaborn heatmap

---

### Exercise 4: Monte Carlo Simulation (Advanced)

Design a Monte Carlo experiment to demonstrate the importance of exclusion restrictions:

1. Generate data from a known DGP with:
   - True $\beta = [1.0, 0.5]$ (outcome equation)
   - True $\rho = 0.5$ (selection correlation)
   - A valid exclusion restriction $W$ that affects selection but not the outcome

2. For 200 replications, estimate the Heckman model:
   - (a) With the exclusion restriction
   - (b) Without the exclusion restriction

3. Compare the sampling distributions of $\hat{\beta}_1$, $\hat{\rho}$, and $\hat{\sigma}$ across both specifications

4. Create histograms showing the distributions and mark the true parameter values

**Expected result**: Specification (a) should show less bias and smaller variance.

In [None]:
# Exercise 4: Your solution here

# Step 1: Define DGP
# np.random.seed(42)
# n = 500
# true_beta = np.array([1.0, 0.5])
# true_rho = 0.5
# true_sigma = 1.0

# Step 2: Monte Carlo loop
# n_reps = 200
# results_with_excl = []
# results_without_excl = []

# for rep in range(n_reps):
#     # Generate data
#     # TODO: Create X, W, selection, y
#     
#     # Estimate with exclusion
#     # TODO
#     
#     # Estimate without exclusion
#     # TODO

# Step 3: Compare distributions
# TODO: Create histograms

---

## References

### Essential Reading

1. **Heckman, J. J. (1979)**. "Sample Selection Bias as a Specification Error." *Econometrica*, 47(1), 153-161.

2. **Wooldridge, J. M. (2010)**. *Econometric Analysis of Cross Section and Panel Data* (2nd ed.). MIT Press. Chapter 19.

3. **Cameron, A. C., & Trivedi, P. K. (2005)**. *Microeconometrics: Methods and Applications*. Cambridge University Press. Chapter 16.

### On Identification

4. **Mroz, T. A. (1987)**. "The Sensitivity of an Empirical Model of Married Women's Hours of Work to Economic and Statistical Assumptions." *Econometrica*, 55(4), 765-799.

5. **Puhani, P. A. (2000)**. "The Heckman Correction for Sample Selection and Its Critique." *Journal of Economic Surveys*, 14(1), 53-68.

6. **Card, D. (1995)**. "Using Geographic Variation in College Proximity to Estimate the Return to Schooling." In *Aspects of Labour Market Behaviour: Essays in Honour of John Vanderkamp*.

### On Weak Identification

7. **Leung, S. F., & Yu, S. (1996)**. "On the Choice Between Sample Selection and Two-Part Models." *Journal of Econometrics*, 72(1-2), 197-229.

---

**Thank you for completing this tutorial!**

Questions or feedback? Visit: https://github.com/panelbox/panelbox/issues