# Heckman Two-Step Selection Correction -- SOLUTIONS

**This is the worked solution notebook.**  
It provides complete, working solutions for all 4 exercises from `04_heckman_selection.ipynb`.

> Instructors: do not distribute this file to students before they complete the tutorial notebook.

## Exercises covered

| # | Title | Level |
|---|-------|-------|
| 1 | Evaluate Candidate Instruments (Conceptual) | Conceptual |
| 2 | Implement and Compare Specifications (Hands-On) | Hands-On |
| 3 | Collinearity Diagnostic (Intermediate) | Intermediate |
| 4 | Monte Carlo Simulation (Advanced) | Advanced |

In [None]:
# ============================================================
# Setup
# ============================================================
import sys, pathlib

ROOT = pathlib.Path('..').resolve()
PANELBOX_ROOT = pathlib.Path('/home/guhaase/projetos/panelbox')
for p in [str(ROOT), str(PANELBOX_ROOT)]:
    if p not in sys.path:
        sys.path.insert(0, p)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Statistical libraries
from scipy import stats
import statsmodels.api as sm

# PanelBox imports
from panelbox.models.selection import PanelHeckman, compute_imr, imr_diagnostics, test_selection_effect

# Visualization configuration
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

# Set random seed for reproducibility
np.random.seed(42)

# Define paths (relative to notebook location in examples/censored/solutions/)
BASE_DIR = Path('..')
DATA_DIR = BASE_DIR / 'data'
OUTPUT_DIR = BASE_DIR / 'outputs'
FIGURES_DIR = OUTPUT_DIR / 'figures'
TABLES_DIR = OUTPUT_DIR / 'tables'
FIGURES_DIR.mkdir(parents=True, exist_ok=True)
TABLES_DIR.mkdir(parents=True, exist_ok=True)

print('Solution notebook loaded -- setup complete!')

---

## Data Loading

We load both datasets used across the exercises: **Mroz (1987)** for exercises 1-2 and 4, and **College Wage** for exercise 3.

In [None]:
# Load Mroz (1987) dataset
df_mroz = pd.read_csv(DATA_DIR / 'mroz_1987.csv')
print('Mroz dataset:', df_mroz.shape)
print('Columns:', list(df_mroz.columns))
print(f'LFP rate: {df_mroz["lfp"].mean():.1%}')
print()

# Load College Wage dataset
df_college = pd.read_csv(DATA_DIR / 'college_wage.csv')
print('College Wage dataset:', df_college.shape)
print('Columns:', list(df_college.columns))
print(f'College attendance rate: {df_college["college"].mean():.1%}')
print()

# Quick summaries
display(df_mroz.describe().round(3))
display(df_college.describe().round(3))

---

# Exercise 1: Evaluate Candidate Instruments (Conceptual)

For each proposed exclusion restriction, evaluate **relevance** (does the instrument
predict selection?) and **validity** (is the instrument excludable from the outcome
equation?).

A valid exclusion restriction must satisfy two conditions:

1. **Relevance**: The variable significantly predicts the selection decision  
   $\text{Cov}(Z_i, s_i) \neq 0$

2. **Validity (Excludability)**: The variable does NOT directly affect the outcome,
   conditional on the other regressors  
   $\text{Cov}(Z_i, \varepsilon_i | X_i) = 0$

### (a) Female labor supply -- Husband's age as instrument

**Setting**: We model married women's labor force participation (selection) and
hourly wages (outcome). The proposed instrument is the **husband's age**.

**Relevance assessment**:
- Husband's age is correlated with the household lifecycle stage. Older husbands
  may have higher earnings, reducing the wife's financial need to work. This
  provides a channel through which husband's age affects the participation
  decision.
- However, the relationship may be weak once we control for husband's income
  directly. If husband's income is already in the selection equation, husband's
  age adds little independent variation.
- **Verdict**: Moderate relevance. The strength depends on what other variables
  (especially husband's income) are already included.

**Validity assessment**:
- Husband's age could be correlated with the wife's age (assortative mating).
  If the wife's age affects her wage (through experience depreciation or
  vintage effects), then husband's age indirectly affects wages.
- There could also be network effects: older husbands may provide better job
  referrals or the couple may live in neighborhoods with different labor markets.
- **Verdict**: Questionable validity. The exclusion restriction is plausible only
  if the wife's own age (and experience) are already controlled for in the
  outcome equation. Even then, assortative mating creates subtle pathways.

**Overall**: Weak to moderate instrument. Use with caution and conduct sensitivity
analysis.

### (b) College wage premium -- SAT score as instrument

**Setting**: We model college attendance (selection) and post-college wages
(outcome). The proposed instrument is the **SAT score**.

**Relevance assessment**:
- SAT scores strongly predict college attendance. Students with higher SAT scores
  are more likely to be admitted and to choose to attend college.
- This is a very strong predictor of selection.
- **Verdict**: High relevance. SAT scores are among the strongest predictors of
  college enrollment.

**Validity assessment**:
- SAT scores measure cognitive ability. Cognitive ability directly affects
  wages through productivity, regardless of college attendance.
- If ability is in the outcome equation, this is partially addressed, but SAT
  scores likely capture dimensions of ability beyond what a single "ability"
  measure controls for.
- Employers may use SAT scores (or correlated signals) directly in hiring and
  wage setting, violating excludability.
- **Verdict**: Invalid. SAT scores are a direct measure of ability, which
  affects wages. This violates the exclusion restriction.

**Overall**: Despite high relevance, the SAT score **fails** as an exclusion
restriction because it directly affects wages. Better alternatives: distance to
college, local tuition levels, or cohort-level college capacity.

### (c) Union wage gap -- State right-to-work law

**Setting**: We model union membership (selection) and wages (outcome). The
proposed instrument is whether the worker lives in a **state with right-to-work
laws**.

**Relevance assessment**:
- Right-to-work laws prohibit mandatory union membership as a condition of
  employment, directly reducing union membership rates.
- States with right-to-work laws have significantly lower unionization rates
  (empirically well-documented).
- **Verdict**: High relevance. This is a strong and well-established predictor
  of union membership.

**Validity assessment**:
- Right-to-work states may differ systematically in their labor market
  conditions. These states tend to be in the South and have lower cost of
  living, different industry compositions, and different overall wage levels.
- If we do not control for state-level factors (cost of living, industry mix,
  regional labor demand), right-to-work status is likely correlated with
  wages through channels other than union membership.
- **Verdict**: Potentially valid, but requires careful conditioning. Include
  state-level controls (region, industry, cost of living) to make the exclusion
  restriction more plausible.

**Overall**: Good instrument if accompanied by appropriate state-level controls.
Without controls, likely invalid due to regional wage differentials.

### (d) Training program -- Distance to training site

**Setting**: We model participation in a job training program (selection) and
post-training earnings (outcome). The proposed instrument is **distance to the
training site**.

**Relevance assessment**:
- Distance is a practical barrier to participation: individuals who live farther
  from the training site face higher transportation costs and time costs, making
  them less likely to enroll.
- This is similar to the classic Card (1995) "distance to college" instrument.
- **Verdict**: High relevance. Geographic proximity is a strong predictor of
  program participation.

**Validity assessment**:
- If training sites are located in urban areas, distance may proxy for
  urban/rural residence, which correlates with wages independently of training.
- However, conditional on observable characteristics (education, experience,
  industry, urban/rural indicator), the remaining variation in distance is
  plausibly exogenous to earnings.
- The key assumption: conditional on observables, an individual's distance to
  the training site does not directly affect their earnings potential.
- **Verdict**: Plausibly valid, especially with appropriate controls for
  location characteristics.

**Overall**: Strong instrument. This is one of the most commonly used and
well-justified exclusion restrictions in the selection model literature.
Analogous to Card's distance-to-college instrument.

### Summary Table

| Scenario | Instrument | Relevance | Validity | Overall |
|----------|-----------|-----------|----------|---------|
| (a) Female labor supply | Husband's age | Moderate | Questionable | Weak |
| (b) College wage premium | SAT score | High | Invalid | Fails |
| (c) Union wage gap | Right-to-work law | High | Conditional | Good (with controls) |
| (d) Training program | Distance to site | High | Plausible | Strong |

In [None]:
# Empirical illustration: demonstrate relevance for instruments
# we can actually test with our data

# (a) For the Mroz data, we can test relevance of candidate exclusion restrictions
print('Empirical Relevance Tests for Mroz Data')
print('=' * 60)

# Probit: regress LFP on all candidate exclusion restrictions
exclusion_candidates = ['children_lt6', 'children_6_18', 'husband_income', 'age']
other_vars = ['education', 'experience']

Z_full = sm.add_constant(df_mroz[other_vars + exclusion_candidates].values)
selection = df_mroz['lfp'].values

probit = sm.Probit(selection, Z_full)
probit_result = probit.fit(disp=0)

var_names = ['const'] + other_vars + exclusion_candidates
relevance_table = pd.DataFrame({
    'Variable': var_names,
    'Coefficient': probit_result.params,
    'Std. Error': probit_result.bse,
    'z-stat': probit_result.tvalues,
    'p-value': probit_result.pvalues,
})

display(relevance_table.round(4))

print()
print('Relevance assessment (significant predictors of selection):')
for var in exclusion_candidates:
    idx = var_names.index(var)
    sig = 'Yes' if probit_result.pvalues[idx] < 0.05 else 'No'
    print(f'  {var:20s}: z = {probit_result.tvalues[idx]:7.3f}, '
          f'p = {probit_result.pvalues[idx]:.4f}, Significant: {sig}')

In [None]:
# (d) For college wage data, test distance_college and tuition as instruments
print('Empirical Relevance Tests for College Wage Data')
print('=' * 60)

exclusion_cands_cw = ['distance_college', 'tuition']
other_vars_cw = ['ability', 'parent_education', 'family_income', 'urban', 'female']

Z_cw = sm.add_constant(df_college[other_vars_cw + exclusion_cands_cw].values)
sel_cw = df_college['college'].values

probit_cw = sm.Probit(sel_cw, Z_cw)
probit_cw_result = probit_cw.fit(disp=0)

var_names_cw = ['const'] + other_vars_cw + exclusion_cands_cw
rel_table_cw = pd.DataFrame({
    'Variable': var_names_cw,
    'Coefficient': probit_cw_result.params,
    'Std. Error': probit_cw_result.bse,
    'z-stat': probit_cw_result.tvalues,
    'p-value': probit_cw_result.pvalues,
})

display(rel_table_cw.round(4))

print()
for var in exclusion_cands_cw:
    idx = var_names_cw.index(var)
    sig = 'Yes' if probit_cw_result.pvalues[idx] < 0.05 else 'No'
    print(f'  {var:20s}: z = {probit_cw_result.tvalues[idx]:7.3f}, '
          f'p = {probit_cw_result.pvalues[idx]:.4f}, Significant: {sig}')

print()
print('Both distance_college and tuition are strong, plausibly valid instruments.')
print('They affect college attendance but (conditional on ability and other controls)')
print('should not directly affect wages.')

---

# Exercise 2: Implement and Compare Specifications (Hands-On)

Using the Mroz dataset, compare three Heckman model specifications that differ
in their choice of exclusion restrictions:

| Model | Exclusion restrictions |
|-------|----------------------|
| A | `children_lt6` + `husband_income` |
| B | `children_6_18` + `husband_income` |
| C | `age` only |

All models share the same **outcome equation**:
$$\text{wage}_i = \beta_0 + \beta_1 \text{education}_i + \beta_2 \text{experience}_i + \beta_3 \text{experience\_sq}_i + \varepsilon_i$$

The **selection equation** includes the outcome regressors plus the exclusion
restrictions specific to each model.

In [None]:
# Prepare common data
y_all = df_mroz['wage'].fillna(0).values
selection = df_mroz['lfp'].values.astype(float)

# Outcome equation regressors (same for all models)
X_outcome = sm.add_constant(
    df_mroz[['education', 'experience', 'experience_sq']].values
)

# Common variables in selection equation (always included)
common_sel_vars = ['education', 'experience', 'age']

# Define the three specifications
specs = {
    'Model A': common_sel_vars + ['children_lt6', 'husband_income'],
    'Model B': common_sel_vars + ['children_6_18', 'husband_income'],
    'Model C': common_sel_vars,  # age is already in common; it serves as the
                                 # exclusion restriction since age is NOT in the
                                 # outcome equation
}

print('Outcome equation regressors: const, education, experience, experience_sq')
print()
for name, sel_vars in specs.items():
    excl = [v for v in sel_vars if v not in ['education', 'experience']]
    print(f'{name}: selection vars = {sel_vars}')
    print(f'        exclusion restrictions = {excl}')
    print()

In [None]:
# Estimate all three models
results = {}

for name, sel_vars in specs.items():
    print(f'Estimating {name}...')
    
    Z = sm.add_constant(df_mroz[sel_vars].values)
    
    model = PanelHeckman(
        endog=y_all,
        exog=X_outcome,
        selection=selection,
        exog_selection=Z,
        method='two_step'
    )
    result = model.fit()
    results[name] = result
    
    print(f'  rho = {result.rho:.4f}, sigma = {result.sigma:.4f}')
    print(f'  lambda = rho*sigma = {result.rho * result.sigma:.4f}')
    print()

print('All models estimated successfully.')

In [None]:
# Display full summary for each model
for name, result in results.items():
    print(f'\n{"=" * 60}')
    print(f'  {name}')
    print(f'{"=" * 60}')
    print(result.summary())
    print()

In [None]:
# Compare outcome coefficients across specifications
outcome_var_names = ['const', 'education', 'experience', 'experience_sq']

comparison_df = pd.DataFrame({
    name: pd.Series(result.outcome_params, index=outcome_var_names)
    for name, result in results.items()
})

# Add selection parameters
sel_params = pd.DataFrame({
    name: pd.Series({
        'rho': result.rho,
        'sigma': result.sigma,
        'lambda (rho*sigma)': result.rho * result.sigma,
    })
    for name, result in results.items()
})

print('Outcome Equation Coefficients Across Specifications')
print('=' * 60)
display(comparison_df.round(4))

print('\nSelection Parameters')
print('=' * 60)
display(sel_params.round(4))

In [None]:
# Visualize coefficient comparison across models
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: Outcome coefficients (excluding constant for clarity)
coef_vars = ['education', 'experience', 'experience_sq']
x_pos = np.arange(len(coef_vars))
width = 0.25
colors = ['steelblue', '#D55E00', '#009E73']

for i, (name, result) in enumerate(results.items()):
    coefs = [result.outcome_params[j+1] for j in range(len(coef_vars))]
    axes[0].bar(x_pos + i * width, coefs, width, label=name,
                color=colors[i], alpha=0.8, edgecolor='black')

axes[0].set_xticks(x_pos + width)
axes[0].set_xticklabels(coef_vars, rotation=15)
axes[0].set_ylabel('Coefficient')
axes[0].set_title('Outcome Coefficients by Specification')
axes[0].legend()
axes[0].grid(alpha=0.3, axis='y')
axes[0].axhline(y=0, color='black', linewidth=0.5)

# Right: Selection parameters (rho, sigma, lambda)
sel_labels = ['rho', 'sigma', 'lambda']
x_sel = np.arange(len(sel_labels))

for i, (name, result) in enumerate(results.items()):
    vals = [result.rho, result.sigma, result.rho * result.sigma]
    axes[1].bar(x_sel + i * width, vals, width, label=name,
                color=colors[i], alpha=0.8, edgecolor='black')

axes[1].set_xticks(x_sel + width)
axes[1].set_xticklabels(sel_labels)
axes[1].set_ylabel('Value')
axes[1].set_title('Selection Parameters by Specification')
axes[1].legend()
axes[1].grid(alpha=0.3, axis='y')
axes[1].axhline(y=0, color='black', linewidth=0.5)

plt.suptitle('Sensitivity to Choice of Exclusion Restrictions',
             fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex2_specification_comparison.png',
            dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Compare IMR diagnostics across specifications
print('IMR Diagnostics Comparison')
print('=' * 60)

diag_rows = []
for name, result in results.items():
    diag = result.imr_diagnostics()
    diag_rows.append({
        'Model': name,
        'IMR Mean': diag['imr_mean'],
        'IMR Std': diag['imr_std'],
        'IMR Min': diag['imr_min'],
        'IMR Max': diag['imr_max'],
        'High IMR (>2)': diag['high_imr_count'],
    })

diag_df = pd.DataFrame(diag_rows).set_index('Model')
display(diag_df.round(4))

In [None]:
# Selection tests for each specification
print('Selection Bias Tests')
print('=' * 60)

for name, result in results.items():
    test = result.selection_test()
    print(f'{name}:')
    print(f'  rho = {test["rho"]:.4f}, z = {test["z_statistic"]:.4f}, '
          f'p = {test["p_value"]:.4f}, Significant: {test["significant"]}')
    print()

### Exercise 2 -- Discussion

**Key findings**:

1. **Model A** (children_lt6 + husband_income) uses the two strongest exclusion
   restrictions. Young children have a large, significant effect on participation,
   and the model is well-identified.

2. **Model B** (children_6_18 + husband_income) replaces young children with
   school-age children. Since school-age children have a weaker effect on
   participation than young children, this specification is somewhat more
   weakly identified.

3. **Model C** (age only) relies solely on age as the exclusion restriction. Age
   appears in the selection equation but not in the outcome equation (where
   experience and experience-squared capture age-related productivity effects).
   This is the most weakly identified model because age is partially collinear
   with experience.

**Sensitivity**: The education and experience coefficients are reasonably stable
across specifications A and B but may shift more under specification C. This
illustrates why strong exclusion restrictions matter: they help pin down the
selection correction, making the outcome coefficients less sensitive to
specification choices.

**Recommendation**: Model A is preferred because `children_lt6` is the strongest
and most theoretically motivated exclusion restriction (young children constrain
labor supply but do not affect the wage rate).

---

# Exercise 3: Collinearity Diagnostic (Intermediate)

For the **College Wage** dataset, we investigate the collinearity problem that
arises when the Inverse Mills Ratio is highly correlated with the outcome
regressors. Without proper exclusion restrictions, the IMR is approximately
a linear function of the outcome regressors, leading to multicollinearity and
imprecise estimates.

**Steps**:
1. Estimate the model **with** exclusion restrictions (distance_college, tuition)
2. Estimate the model **without** exclusion restrictions
3. Compute IMR-X correlation matrices and visualize with heatmaps
4. Compute condition numbers to quantify collinearity

In [None]:
# Prepare College Wage data
y_college = df_college['wage'].fillna(0).values
sel_college = df_college['college'].values.astype(float)

# Outcome equation regressors
outcome_vars_cw = ['ability', 'parent_education', 'family_income', 'urban', 'female']
X_cw = sm.add_constant(df_college[outcome_vars_cw].values)

# Selection equation WITH exclusion restrictions
sel_vars_with = outcome_vars_cw + ['distance_college', 'tuition']
Z_cw_with = sm.add_constant(df_college[sel_vars_with].values)

# Selection equation WITHOUT exclusion restrictions
# (same variables as outcome equation -- identification from functional form only)
Z_cw_without = sm.add_constant(df_college[outcome_vars_cw].values)

print('College Wage Model Setup')
print('=' * 60)
print(f'Total observations:     {len(df_college)}')
print(f'College attendees:      {int(sel_college.sum())} ({sel_college.mean():.1%})')
print(f'Outcome regressors:     {outcome_vars_cw}')
print(f'Selection WITH excl:    {sel_vars_with}')
print(f'Selection WITHOUT excl: {outcome_vars_cw}')

In [None]:
# Estimate Model WITH exclusion restrictions
print('Model WITH Exclusion Restrictions')
print('=' * 60)

model_with = PanelHeckman(
    endog=y_college,
    exog=X_cw,
    selection=sel_college,
    exog_selection=Z_cw_with,
    method='two_step'
)
result_with = model_with.fit()
print(result_with.summary())

In [None]:
# Estimate Model WITHOUT exclusion restrictions
print('Model WITHOUT Exclusion Restrictions')
print('=' * 60)

model_without = PanelHeckman(
    endog=y_college,
    exog=X_cw,
    selection=sel_college,
    exog_selection=Z_cw_without,
    method='two_step'
)
result_without = model_without.fit()
print(result_without.summary())

In [None]:
# Compare outcome coefficients side by side
cw_outcome_names = ['const'] + outcome_vars_cw

cw_comparison = pd.DataFrame({
    'Variable': cw_outcome_names,
    'With Exclusion': result_with.outcome_params,
    'Without Exclusion': result_without.outcome_params,
})
cw_comparison['Difference'] = (
    cw_comparison['With Exclusion'] - cw_comparison['Without Exclusion']
)

print('Coefficient Comparison: With vs Without Exclusion Restrictions')
print('=' * 70)
display(cw_comparison.round(4))

print(f'\nWith exclusion:    rho = {result_with.rho:.4f}, sigma = {result_with.sigma:.4f}')
print(f'Without exclusion: rho = {result_without.rho:.4f}, sigma = {result_without.sigma:.4f}')

In [None]:
# Compute IMR for both specifications and build augmented design matrices
selected_mask = sel_college == 1

# --- WITH exclusion restrictions ---
# IMR from the model with exclusion restrictions
imr_with = result_with.lambda_imr[selected_mask]

# --- WITHOUT exclusion restrictions ---
# IMR from the model without exclusion restrictions
imr_without = result_without.lambda_imr[selected_mask]

# Build DataFrames with outcome regressors and IMR for selected sample
df_selected = df_college[df_college['college'] == 1].copy()

# Augmented data WITH exclusion
aug_with = df_selected[outcome_vars_cw].copy()
aug_with['IMR'] = imr_with

# Augmented data WITHOUT exclusion
aug_without = df_selected[outcome_vars_cw].copy()
aug_without['IMR'] = imr_without

print('IMR Statistics (selected sample)')
print('=' * 50)
print(f'WITH exclusion:    mean={imr_with.mean():.4f}, '
      f'std={imr_with.std():.4f}, range=[{imr_with.min():.4f}, {imr_with.max():.4f}]')
print(f'WITHOUT exclusion: mean={imr_without.mean():.4f}, '
      f'std={imr_without.std():.4f}, range=[{imr_without.min():.4f}, {imr_without.max():.4f}]')

In [None]:
# Compute correlation matrices
corr_with = aug_with.corr()
corr_without = aug_without.corr()

print('Correlation Matrix WITH Exclusion Restrictions')
print('=' * 60)
display(corr_with.round(3))

print('\nCorrelation Matrix WITHOUT Exclusion Restrictions')
print('=' * 60)
display(corr_without.round(3))

# Highlight the IMR correlations
print('\nIMR correlations with outcome regressors:')
print('-' * 50)
imr_corr_comparison = pd.DataFrame({
    'With Exclusion': corr_with['IMR'].drop('IMR'),
    'Without Exclusion': corr_without['IMR'].drop('IMR'),
})
imr_corr_comparison['Abs Diff'] = (
    imr_corr_comparison['Without Exclusion'].abs() -
    imr_corr_comparison['With Exclusion'].abs()
)
display(imr_corr_comparison.round(4))

print('\nHigher absolute correlations WITHOUT exclusion indicate more collinearity.')

In [None]:
# Create side-by-side heatmaps
fig, axes = plt.subplots(1, 2, figsize=(16, 7))

# Heatmap: WITH exclusion
mask_with = np.triu(np.ones_like(corr_with, dtype=bool), k=1)
sns.heatmap(corr_with, mask=mask_with, annot=True, fmt='.2f',
            cmap='RdBu_r', center=0, vmin=-1, vmax=1,
            square=True, linewidths=0.5, ax=axes[0],
            cbar_kws={'shrink': 0.8})
axes[0].set_title('WITH Exclusion Restrictions\n(distance_college, tuition)',
                  fontsize=12, fontweight='bold')

# Heatmap: WITHOUT exclusion
mask_without = np.triu(np.ones_like(corr_without, dtype=bool), k=1)
sns.heatmap(corr_without, mask=mask_without, annot=True, fmt='.2f',
            cmap='RdBu_r', center=0, vmin=-1, vmax=1,
            square=True, linewidths=0.5, ax=axes[1],
            cbar_kws={'shrink': 0.8})
axes[1].set_title('WITHOUT Exclusion Restrictions\n(functional form identification only)',
                  fontsize=12, fontweight='bold')

plt.suptitle('IMR--Regressor Correlation: Effect of Exclusion Restrictions',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex3_collinearity_heatmap.png',
            dpi=150, bbox_inches='tight')
plt.show()

print('Key observation: Without exclusion restrictions, the IMR row/column')
print('shows higher correlations with the outcome regressors, indicating')
print('that the IMR is nearly collinear with X.')

In [None]:
# Compute condition numbers
# The condition number measures how close to singular a matrix is.
# Higher condition number = more collinearity = less numerically stable.

# Augmented design matrix WITH exclusion (selected sample only)
X_sel = X_cw[selected_mask]
X_aug_with = np.column_stack([X_sel, imr_with])
X_aug_without = np.column_stack([X_sel, imr_without])

# Condition numbers
cond_X = np.linalg.cond(X_sel)
cond_with = np.linalg.cond(X_aug_with)
cond_without = np.linalg.cond(X_aug_without)

print('Condition Number Analysis')
print('=' * 60)
print(f'X only (no IMR):                     {cond_X:.2f}')
print(f'X + IMR (WITH exclusion):             {cond_with:.2f}')
print(f'X + IMR (WITHOUT exclusion):           {cond_without:.2f}')
print()
print(f'Ratio (without / with): {cond_without / cond_with:.2f}x')
print()
print('Interpretation:')
print('- Condition number < 30: acceptable collinearity')
print('- Condition number 30-300: moderate collinearity')
print('- Condition number > 300: severe collinearity')
print()
if cond_without > 2 * cond_with:
    print('The model WITHOUT exclusion restrictions has substantially higher')
    print('collinearity. Exclusion restrictions reduce the IMR-X correlation,')
    print('improving numerical stability and estimation precision.')
else:
    print('Both specifications show similar collinearity levels.')
    print('However, the model with exclusion restrictions is still preferred')
    print('for identification and interpretability reasons.')

In [None]:
# Additional visualization: scatter plot of IMR vs dominant regressors
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Most correlated regressor for each specification
# Use 'ability' since it is likely the strongest predictor
ability_sel = df_selected['ability'].values

# WITH exclusion
axes[0].scatter(ability_sel, imr_with, alpha=0.4, s=15, color='steelblue')
z = np.polyfit(ability_sel, imr_with, 1)
p = np.poly1d(z)
ability_sorted = np.sort(ability_sel)
axes[0].plot(ability_sorted, p(ability_sorted), 'r--', linewidth=2,
             label=f'r = {np.corrcoef(ability_sel, imr_with)[0,1]:.3f}')
axes[0].set_xlabel('Ability')
axes[0].set_ylabel('IMR')
axes[0].set_title('WITH Exclusion Restrictions')
axes[0].legend()
axes[0].grid(alpha=0.3)

# WITHOUT exclusion
axes[1].scatter(ability_sel, imr_without, alpha=0.4, s=15, color='#D55E00')
z2 = np.polyfit(ability_sel, imr_without, 1)
p2 = np.poly1d(z2)
axes[1].plot(ability_sorted, p2(ability_sorted), 'r--', linewidth=2,
             label=f'r = {np.corrcoef(ability_sel, imr_without)[0,1]:.3f}')
axes[1].set_xlabel('Ability')
axes[1].set_ylabel('IMR')
axes[1].set_title('WITHOUT Exclusion Restrictions')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.suptitle('IMR vs Ability: Collinearity with and without Exclusion Restrictions',
             fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex3_imr_vs_ability.png',
            dpi=150, bbox_inches='tight')
plt.show()

print('Without exclusion restrictions, the IMR is almost perfectly determined')
print('by the outcome regressors (ability, parent_education, etc.), since the')
print('selection equation uses the same variables. This makes the IMR nearly')
print('collinear with X, inflating standard errors and destabilizing estimates.')

### Exercise 3 -- Discussion

**Why exclusion restrictions reduce collinearity**:

Without exclusion restrictions, the selection equation and outcome equation use
the same regressors. The IMR is then $\lambda_i = \phi(X_i'\hat{\gamma}) / \Phi(X_i'\hat{\gamma})$,
which is a smooth nonlinear function of $X_i'\hat{\gamma}$. Over the range of
typical data, this function is approximately linear, so $\lambda_i \approx a + b \cdot X_i'\hat{\gamma}$.
Since $X_i'\hat{\gamma}$ is a linear combination of the same regressors in the
outcome equation, $\lambda_i$ is nearly collinear with $X_i$.

With exclusion restrictions, the selection equation includes additional variables
(e.g., distance_college, tuition) that are NOT in the outcome equation. This
means $Z_i'\hat{\gamma}$ varies independently of $X_i$, breaking the approximate
linear dependence between $\lambda_i$ and $X_i$.

**Practical consequences of high collinearity**:
- Standard errors of the outcome coefficients are inflated
- Coefficient estimates become unstable (sensitive to small data changes)
- The selection correction parameter is poorly estimated
- The model is "identified" only through the nonlinearity of the normal CDF,
  which is a very fragile form of identification

---

# Exercise 4: Monte Carlo Simulation (Advanced)

We conduct a Monte Carlo experiment to demonstrate that:
1. OLS on the selected sample is **biased** when selection is present
2. The Heckman two-step estimator is **consistent** with proper exclusion restrictions
3. Without exclusion restrictions, the Heckman estimator is imprecise

**Data Generating Process**:
- Selection: $s_i^* = \gamma_0 + \gamma_1 x_i + \gamma_2 z_i + u_i$, $s_i = \mathbf{1}[s_i^* > 0]$
- Outcome: $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$ (observed only if $s_i = 1$)
- $(u_i, \varepsilon_i) \sim N(0, \Sigma)$ with $\rho = 0.5$
- $z_i$ is the exclusion restriction (affects selection but not outcome)

**Design**: 200 replications, N=500 per replication

In [None]:
# Monte Carlo setup
n_sims = 200          # Number of replications
n_obs = 500           # Observations per replication

# True parameters
beta_true = np.array([1.0, 0.5])    # Outcome: y = 1.0 + 0.5*x + eps
gamma_true = np.array([-0.5, 0.3, 0.6])  # Selection: s* = -0.5 + 0.3*x + 0.6*z + u
rho_true = 0.5                       # Error correlation
sigma_eps_true = 1.0                 # Outcome error std dev

# Covariance matrix for bivariate normal errors
# u ~ N(0,1) and eps ~ N(0, sigma_eps^2) with correlation rho
Sigma = np.array([
    [1.0,                         rho_true * sigma_eps_true],
    [rho_true * sigma_eps_true,   sigma_eps_true**2]
])

print('Monte Carlo Design')
print('=' * 60)
print(f'Replications:  {n_sims}')
print(f'Sample size:   {n_obs}')
print(f'True beta:     {beta_true}')
print(f'True gamma:    {gamma_true}')
print(f'True rho:      {rho_true}')
print(f'True sigma:    {sigma_eps_true}')
print()
print(f'Error covariance matrix:')
print(f'  [[{Sigma[0,0]:.2f}, {Sigma[0,1]:.2f}]')
print(f'   [{Sigma[1,0]:.2f}, {Sigma[1,1]:.2f}]]')

In [None]:
# Run the Monte Carlo simulation
np.random.seed(42)

# Storage for results
beta1_ols = np.zeros(n_sims)           # OLS estimate of beta_1
beta1_heckman_with = np.zeros(n_sims)  # Heckman WITH exclusion restriction
beta1_heckman_no = np.zeros(n_sims)    # Heckman WITHOUT exclusion restriction
rho_heckman_with = np.zeros(n_sims)    # Estimated rho (with excl)
rho_heckman_no = np.zeros(n_sims)      # Estimated rho (without excl)
sel_rate = np.zeros(n_sims)            # Selection rate per replication

for sim in range(n_sims):
    if (sim + 1) % 50 == 0:
        print(f'  Replication {sim + 1}/{n_sims}...')
    
    # Generate regressors
    x = np.random.normal(0, 1, n_obs)   # Outcome regressor
    z = np.random.normal(0, 1, n_obs)   # Exclusion restriction (instrument)
    
    # Generate correlated errors
    errors = np.random.multivariate_normal([0, 0], Sigma, n_obs)
    u = errors[:, 0]    # Selection error
    eps = errors[:, 1]  # Outcome error
    
    # Selection equation
    s_star = gamma_true[0] + gamma_true[1] * x + gamma_true[2] * z + u
    s = (s_star > 0).astype(float)
    sel_rate[sim] = s.mean()
    
    # Outcome equation (latent for all, observed only if s=1)
    y_latent = beta_true[0] + beta_true[1] * x + eps
    y_observed = np.where(s == 1, y_latent, 0)
    
    # Skip if too few selected or too few censored
    if s.sum() < 30 or (1 - s).sum() < 10:
        beta1_ols[sim] = np.nan
        beta1_heckman_with[sim] = np.nan
        beta1_heckman_no[sim] = np.nan
        rho_heckman_with[sim] = np.nan
        rho_heckman_no[sim] = np.nan
        continue
    
    # --- OLS on selected sample (biased) ---
    sel_mask = s == 1
    X_sel_ols = sm.add_constant(x[sel_mask])
    ols_result = np.linalg.lstsq(X_sel_ols, y_latent[sel_mask], rcond=None)[0]
    beta1_ols[sim] = ols_result[1]
    
    # --- Heckman WITH exclusion restriction ---
    X_out = sm.add_constant(x.reshape(-1, 1))
    Z_sel_with = sm.add_constant(np.column_stack([x, z]))
    
    try:
        model_w = PanelHeckman(
            endog=y_observed,
            exog=X_out,
            selection=s,
            exog_selection=Z_sel_with,
            method='two_step'
        )
        res_w = model_w.fit()
        beta1_heckman_with[sim] = res_w.outcome_params[1]
        rho_heckman_with[sim] = res_w.rho
    except Exception:
        beta1_heckman_with[sim] = np.nan
        rho_heckman_with[sim] = np.nan
    
    # --- Heckman WITHOUT exclusion restriction ---
    Z_sel_no = sm.add_constant(x.reshape(-1, 1))  # Same as X (no instrument)
    
    try:
        model_no = PanelHeckman(
            endog=y_observed,
            exog=X_out,
            selection=s,
            exog_selection=Z_sel_no,
            method='two_step'
        )
        res_no = model_no.fit()
        beta1_heckman_no[sim] = res_no.outcome_params[1]
        rho_heckman_no[sim] = res_no.rho
    except Exception:
        beta1_heckman_no[sim] = np.nan
        rho_heckman_no[sim] = np.nan

print(f'\nSimulation complete.')
print(f'Average selection rate: {np.nanmean(sel_rate):.1%}')
print(f'Valid replications: OLS={np.isfinite(beta1_ols).sum()}, '
      f'Heckman(with)={np.isfinite(beta1_heckman_with).sum()}, '
      f'Heckman(no)={np.isfinite(beta1_heckman_no).sum()}')

In [None]:
# Summary statistics of the Monte Carlo experiment
def mc_summary(estimates, true_value, name):
    """Compute Monte Carlo summary statistics."""
    valid = estimates[np.isfinite(estimates)]
    bias = np.mean(valid) - true_value
    rmse = np.sqrt(np.mean((valid - true_value)**2))
    return {
        'Estimator': name,
        'True Value': true_value,
        'Mean': np.mean(valid),
        'Median': np.median(valid),
        'Std Dev': np.std(valid),
        'Bias': bias,
        'RMSE': rmse,
        'Valid Reps': len(valid),
    }

mc_results = pd.DataFrame([
    mc_summary(beta1_ols, beta_true[1], 'OLS (selected sample)'),
    mc_summary(beta1_heckman_with, beta_true[1], 'Heckman (with excl.)'),
    mc_summary(beta1_heckman_no, beta_true[1], 'Heckman (no excl.)'),
])

print('Monte Carlo Results for beta_1 (true value = 0.5)')
print('=' * 80)
display(mc_results.round(4))

print()
print('Key findings:')
print(f'  OLS bias:              {mc_results.iloc[0]["Bias"]:.4f} '
      f'({mc_results.iloc[0]["Bias"]/beta_true[1]*100:.1f}% of true value)')
print(f'  Heckman (with) bias:   {mc_results.iloc[1]["Bias"]:.4f} '
      f'({mc_results.iloc[1]["Bias"]/beta_true[1]*100:.1f}% of true value)')
print(f'  Heckman (no) bias:     {mc_results.iloc[2]["Bias"]:.4f} '
      f'({mc_results.iloc[2]["Bias"]/beta_true[1]*100:.1f}% of true value)')

In [None]:
# Rho estimation summary
rho_results = pd.DataFrame([
    mc_summary(rho_heckman_with, rho_true, 'Heckman (with excl.)'),
    mc_summary(rho_heckman_no, rho_true, 'Heckman (no excl.)'),
])

print('Monte Carlo Results for rho (true value = 0.5)')
print('=' * 80)
display(rho_results.round(4))

In [None]:
# Visualize sampling distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Top-left: Histogram of beta_1 estimates
bins = np.linspace(-0.5, 1.5, 50)

valid_ols = beta1_ols[np.isfinite(beta1_ols)]
valid_hw = beta1_heckman_with[np.isfinite(beta1_heckman_with)]
valid_hn = beta1_heckman_no[np.isfinite(beta1_heckman_no)]

axes[0, 0].hist(valid_ols, bins=bins, alpha=0.5, density=True,
                color='#D55E00', label='OLS', edgecolor='black')
axes[0, 0].hist(valid_hw, bins=bins, alpha=0.5, density=True,
                color='steelblue', label='Heckman (with excl.)', edgecolor='black')
axes[0, 0].axvline(beta_true[1], color='black', linewidth=2.5,
                   linestyle='--', label=f'True value = {beta_true[1]}')
axes[0, 0].axvline(np.mean(valid_ols), color='#D55E00', linewidth=1.5,
                   linestyle=':', label=f'OLS mean = {np.mean(valid_ols):.3f}')
axes[0, 0].axvline(np.mean(valid_hw), color='steelblue', linewidth=1.5,
                   linestyle=':', label=f'Heckman mean = {np.mean(valid_hw):.3f}')
axes[0, 0].set_xlabel(r'$\hat{\beta}_1$')
axes[0, 0].set_ylabel('Density')
axes[0, 0].set_title(r'Sampling Distribution of $\hat{\beta}_1$')
axes[0, 0].legend(fontsize=9)
axes[0, 0].grid(alpha=0.3)

# Top-right: Heckman with vs without exclusion
axes[0, 1].hist(valid_hw, bins=bins, alpha=0.5, density=True,
                color='steelblue', label='With exclusion', edgecolor='black')
axes[0, 1].hist(valid_hn, bins=bins, alpha=0.5, density=True,
                color='#009E73', label='Without exclusion', edgecolor='black')
axes[0, 1].axvline(beta_true[1], color='black', linewidth=2.5,
                   linestyle='--', label=f'True value = {beta_true[1]}')
axes[0, 1].set_xlabel(r'$\hat{\beta}_1$')
axes[0, 1].set_ylabel('Density')
axes[0, 1].set_title('With vs Without Exclusion Restriction')
axes[0, 1].legend(fontsize=9)
axes[0, 1].grid(alpha=0.3)

# Bottom-left: Rho estimates
valid_rho_w = rho_heckman_with[np.isfinite(rho_heckman_with)]
valid_rho_n = rho_heckman_no[np.isfinite(rho_heckman_no)]

rho_bins = np.linspace(-1, 1, 50)
axes[1, 0].hist(valid_rho_w, bins=rho_bins, alpha=0.5, density=True,
                color='steelblue', label='With exclusion', edgecolor='black')
axes[1, 0].hist(valid_rho_n, bins=rho_bins, alpha=0.5, density=True,
                color='#009E73', label='Without exclusion', edgecolor='black')
axes[1, 0].axvline(rho_true, color='black', linewidth=2.5,
                   linestyle='--', label=f'True rho = {rho_true}')
axes[1, 0].set_xlabel(r'$\hat{\rho}$')
axes[1, 0].set_ylabel('Density')
axes[1, 0].set_title(r'Sampling Distribution of $\hat{\rho}$')
axes[1, 0].legend(fontsize=9)
axes[1, 0].grid(alpha=0.3)

# Bottom-right: Bias boxplot
bias_data = [
    valid_ols - beta_true[1],
    valid_hw - beta_true[1],
    valid_hn - beta_true[1],
]
bp = axes[1, 1].boxplot(
    bias_data,
    labels=['OLS', 'Heckman\n(with excl.)', 'Heckman\n(no excl.)'],
    patch_artist=True,
    widths=0.5,
)
colors_box = ['#D55E00', 'steelblue', '#009E73']
for patch, color in zip(bp['boxes'], colors_box):
    patch.set_facecolor(color)
    patch.set_alpha(0.6)
axes[1, 1].axhline(y=0, color='black', linewidth=1.5, linestyle='--')
axes[1, 1].set_ylabel(r'Bias ($\hat{\beta}_1 - \beta_1$)')
axes[1, 1].set_title('Estimation Bias Distribution')
axes[1, 1].grid(alpha=0.3, axis='y')

plt.suptitle(f'Monte Carlo Simulation: {n_sims} Replications, N={n_obs}, '
             f'rho={rho_true}',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex4_monte_carlo_results.png',
            dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Formal comparison: bias, variance, and MSE
print('Formal Comparison: Bias, Variance, and MSE')
print('=' * 70)
print(f'{"":30s} {"Bias":>10s} {"Variance":>10s} {"MSE":>10s} {"Coverage":>10s}')
print('-' * 70)

for name, est in [('OLS (selected)', valid_ols),
                   ('Heckman (with excl.)', valid_hw),
                   ('Heckman (no excl.)', valid_hn)]:
    bias = np.mean(est) - beta_true[1]
    var = np.var(est)
    mse = np.mean((est - beta_true[1])**2)
    # Approximate 95% coverage
    se = np.std(est) / np.sqrt(len(est))
    ci_lower = est - 1.96 * np.std(est)
    ci_upper = est + 1.96 * np.std(est)
    coverage = np.mean((ci_lower <= beta_true[1]) & (beta_true[1] <= ci_upper))
    
    print(f'{name:30s} {bias:10.4f} {var:10.4f} {mse:10.4f} {coverage:10.1%}')

print()
print('MSE = Bias^2 + Variance')
print('Coverage = fraction of replications where true value is within 95% CI')

In [None]:
# Additional analysis: how does bias change with rho?
# Quick demonstration with a few values of rho

rho_values = [0.0, 0.25, 0.5, 0.75]
n_sims_quick = 100
n_obs_quick = 500

np.random.seed(123)

bias_by_rho = {'rho': [], 'OLS_bias': [], 'Heckman_bias': []}

for rho_val in rho_values:
    print(f'Running rho = {rho_val}...')
    Sigma_val = np.array([
        [1.0, rho_val * sigma_eps_true],
        [rho_val * sigma_eps_true, sigma_eps_true**2]
    ])
    
    ols_ests = []
    heck_ests = []
    
    for _ in range(n_sims_quick):
        x_q = np.random.normal(0, 1, n_obs_quick)
        z_q = np.random.normal(0, 1, n_obs_quick)
        errs = np.random.multivariate_normal([0, 0], Sigma_val, n_obs_quick)
        
        s_star_q = gamma_true[0] + gamma_true[1] * x_q + gamma_true[2] * z_q + errs[:, 0]
        s_q = (s_star_q > 0).astype(float)
        y_q = beta_true[0] + beta_true[1] * x_q + errs[:, 1]
        y_obs_q = np.where(s_q == 1, y_q, 0)
        
        if s_q.sum() < 30 or (1 - s_q).sum() < 10:
            continue
        
        # OLS
        X_q = sm.add_constant(x_q[s_q == 1])
        b_ols = np.linalg.lstsq(X_q, y_q[s_q == 1], rcond=None)[0]
        ols_ests.append(b_ols[1])
        
        # Heckman
        try:
            model_q = PanelHeckman(
                endog=y_obs_q,
                exog=sm.add_constant(x_q.reshape(-1, 1)),
                selection=s_q,
                exog_selection=sm.add_constant(np.column_stack([x_q, z_q])),
                method='two_step'
            )
            res_q = model_q.fit()
            heck_ests.append(res_q.outcome_params[1])
        except Exception:
            pass
    
    bias_by_rho['rho'].append(rho_val)
    bias_by_rho['OLS_bias'].append(np.mean(ols_ests) - beta_true[1])
    bias_by_rho['Heckman_bias'].append(np.mean(heck_ests) - beta_true[1])

bias_df = pd.DataFrame(bias_by_rho)

print('\nBias as a Function of rho')
print('=' * 50)
display(bias_df.round(4))

In [None]:
# Plot bias vs rho
fig, ax = plt.subplots(figsize=(8, 5))

ax.plot(bias_df['rho'], bias_df['OLS_bias'], 'o-',
        color='#D55E00', linewidth=2, markersize=8, label='OLS (biased)')
ax.plot(bias_df['rho'], bias_df['Heckman_bias'], 's-',
        color='steelblue', linewidth=2, markersize=8, label='Heckman (corrected)')
ax.axhline(y=0, color='black', linewidth=1, linestyle='--', alpha=0.7)
ax.set_xlabel(r'True $\rho$ (selection-outcome error correlation)', fontsize=12)
ax.set_ylabel(r'Bias in $\hat{\beta}_1$', fontsize=12)
ax.set_title(r'Selection Bias Increases with $\rho$', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex4_bias_vs_rho.png', dpi=150, bbox_inches='tight')
plt.show()

print('As expected:')
print('- When rho = 0, there is no selection bias and OLS is unbiased')
print('- As rho increases, OLS bias grows monotonically')
print('- The Heckman estimator remains approximately unbiased for all rho values')

### Exercise 4 -- Discussion

**Key findings from the Monte Carlo experiment**:

1. **OLS is biased**: When $\rho \neq 0$, OLS on the selected sample produces
   estimates that are systematically different from the true value. The bias
   grows with $|\rho|$.

2. **Heckman with exclusion restriction is consistent**: The Heckman two-step
   estimator with a proper exclusion restriction ($z_i$) produces estimates
   centered around the true value with reasonable variance.

3. **Heckman without exclusion restriction is imprecise**: When the selection
   equation uses the same variables as the outcome equation (no exclusion
   restriction), the estimator may be approximately unbiased but has much
   larger variance due to collinearity between the IMR and the regressors.

4. **Bias-variance tradeoff**: The Heckman estimator trades some variance
   (wider sampling distribution) for bias reduction. This is the typical
   econometric tradeoff when correcting for endogeneity/selection.

5. **Rho determines the severity**: When $\rho = 0$, there is no selection bias
   and all three estimators perform similarly. The Heckman correction only
   matters when $\rho$ is substantially different from zero.

**Practical implications**:
- Always test for selection bias before deciding whether to use the Heckman
  correction
- Invest effort in finding credible exclusion restrictions; they dramatically
  improve estimation precision
- The Heckman estimator without exclusion restrictions relies on functional
  form alone and should be used only as a robustness check

---

# Summary

This solution notebook covered four exercises on the Heckman two-step selection
model:

| Exercise | Key Takeaway |
|----------|-------------|
| 1. Evaluate instruments | Relevance and validity must both hold; SAT scores fail validity |
| 2. Compare specifications | Strong exclusion restrictions (children_lt6) yield stable estimates |
| 3. Collinearity diagnostic | Without exclusion restrictions, IMR is collinear with X |
| 4. Monte Carlo | OLS is biased with selection; Heckman with instruments is consistent |

**Central theme**: The Heckman model corrects for selection bias, but its
performance depends critically on having credible exclusion restrictions.
Without them, the model is weakly identified and estimates are imprecise.