# Identification and Exclusion Restrictions — SOLUTION

**This is the worked solution notebook.**
It corresponds to `06_identification.ipynb` and provides complete solutions for all 4 exercises.

> Instructors: do not distribute this file to students before they complete the tutorial notebook.

---

## Exercises Overview

| Exercise | Topic | Difficulty |
|---|---|---|
| 1 | Evaluate Candidate Instruments (Conceptual) | Introductory |
| 2 | Implement and Compare Specifications (Hands-On) | Intermediate |
| 3 | Collinearity Diagnostic (Intermediate) | Intermediate |
| 4 | Monte Carlo Simulation (Advanced) | Advanced |

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

from scipy import stats
from scipy.optimize import minimize as sp_minimize
import statsmodels.api as sm

# PanelBox imports
from panelbox.models.selection import PanelHeckman
from panelbox.models.selection.inverse_mills import compute_imr, test_selection_effect

# Visualization configuration
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

np.random.seed(42)

# Paths (relative to solutions/ directory)
BASE_DIR = Path('..')
DATA_DIR = BASE_DIR / 'data'
OUTPUT_DIR = BASE_DIR / 'outputs'
FIGURES_DIR = OUTPUT_DIR / 'figures'
TABLES_DIR = OUTPUT_DIR / 'tables'
FIGURES_DIR.mkdir(parents=True, exist_ok=True)
TABLES_DIR.mkdir(parents=True, exist_ok=True)

print('Setup complete!')

In [None]:
# Load datasets
mroz = pd.read_csv(DATA_DIR / 'mroz_1987.csv')
college = pd.read_csv(DATA_DIR / 'college_wage.csv')

print(f'Mroz dataset: {len(mroz)} observations, columns: {list(mroz.columns)}')
print(f'College dataset: {len(college)} observations, columns: {list(college.columns)}')

# Prepare Mroz data
mroz['log_wage'] = np.log(mroz['wage'])
y_mroz = mroz['log_wage'].fillna(0).values
selection_mroz = mroz['lfp'].values

# Outcome equation variables (X) for Mroz
X_mroz = sm.add_constant(mroz[['education', 'experience', 'experience_sq']].values)
X_mroz_names = ['const', 'education', 'experience', 'experience_sq']

# Prepare College data
college['log_wage'] = np.log(college['wage'])
y_college = college['log_wage'].fillna(0).values
selection_college = college['college'].values

# Outcome equation variables (X) for College
X_college = sm.add_constant(
    college[['ability', 'parent_education', 'family_income', 'urban', 'female']].values
)
X_college_names = ['const', 'ability', 'parent_education', 'family_income', 'urban', 'female']

print(f'\nMroz: {selection_mroz.sum()}/{len(selection_mroz)} selected ({selection_mroz.mean():.1%})')
print(f'College: {selection_college.sum()}/{len(selection_college)} selected ({selection_college.mean():.1%})')

---

## Exercise 1: Evaluate Candidate Instruments (Conceptual)

For each proposed exclusion restriction below, evaluate whether it satisfies:
- **Relevance**: Does it plausibly affect selection?
- **Validity**: Can we argue it does NOT directly affect the outcome?

| Application | Selection | Outcome | Proposed Instrument |
|---|---|---|---|
| (a) Female labor supply | Work vs not work | Hourly wage | Husband's age |
| (b) College wage premium | Attend college | Post-college wage | SAT score |
| (c) Union wage gap | Union member | Log wage | State right-to-work law |
| (d) Training program | Participate in training | Quarterly earnings | Distance to training site |

### Solution: Exercise 1

---

#### (a) Husband's age as instrument for female labor supply

**Relevance**: Moderate. Husband's age is correlated with husband's income and career stage,
which affects the household's financial need for the wife to work. Older husbands may have
higher incomes, reducing the wife's need to participate. However, the effect may be indirect
and weak after controlling for husband's income directly.

**Validity**: Questionable. Husband's age is likely correlated with the wife's own age
(assortative mating), which in turn correlates with her experience and human capital
accumulation. Since experience affects wages directly, husband's age could violate
the exclusion restriction by proxying for the wife's own age/experience.

**Assessment**: **Weak/problematic instrument.** While it has some relevance through the
income channel, the strong correlation with wife's age creates a plausible direct channel
to wages. Better alternatives exist (number of young children, husband's income).

---

#### (b) SAT score as instrument for college wage premium

**Relevance**: Strong. SAT scores are highly predictive of college attendance. Higher SAT
scores increase the probability of admission and enrollment.

**Validity**: **FAILS.** SAT scores are a direct measure of academic ability and cognitive
skill, which are rewarded in the labor market independently of college attendance. Employers
value the skills that SAT scores reflect (analytical reasoning, quantitative ability), so
SAT scores almost certainly affect wages directly. This is a classic example of an instrument
that is relevant but invalid.

**Assessment**: **Invalid instrument.** SAT scores directly affect wages through the ability
channel. Using SAT as an exclusion restriction would conflate the selection correction with
the ability premium, producing biased estimates of the college wage premium.

---

#### (c) State right-to-work law for union wage gap

**Relevance**: Strong. Right-to-work laws prohibit mandatory union membership as a
condition of employment. States with such laws have significantly lower unionization rates
(about 5-7% vs. 12-15% in non-RTW states). This creates strong, policy-driven variation
in union membership.

**Validity**: Debatable. The key concern is whether right-to-work laws affect wages through
channels other than individual union membership. If RTW laws reduce overall union density,
they may weaken unions' ability to bargain for all workers (including non-members through
spillover effects), which would violate the exclusion restriction. Also, RTW laws may
correlate with other state-level policies (taxes, regulation) that affect wages.

**Assessment**: **Reasonable but imperfect instrument.** The relevance condition is strongly
satisfied. The validity concern depends on whether one conditions on state-level controls.
With proper state fixed effects or controls for state economic conditions, RTW laws
become a more credible instrument. Best used in combination with other instruments.

---

#### (d) Distance to training site for training program

**Relevance**: Moderate to strong. Greater distance to the training site increases
participation costs (travel time, transportation expenses), reducing the probability
of participation. This follows the logic of Card (1995) who used distance to college
as an instrument for college attendance.

**Validity**: Generally plausible, but requires careful consideration. The main concern is
that distance may correlate with local labor market conditions. Workers in rural areas
(farther from training sites) may face different wage structures than urban workers.
However, conditional on observable labor market characteristics (urban/rural, industry,
local unemployment rate), the remaining variation in distance is plausibly exogenous
to individual earnings potential.

**Assessment**: **Good instrument.** This is one of the most credible types of exclusion
restrictions in program evaluation. The economic argument is clear: distance shifts the
cost of participation without directly affecting a worker's productivity. Recommended
to include controls for local labor market conditions to strengthen the exclusion
restriction.

In [None]:
# Summary table for Exercise 1
assessment_df = pd.DataFrame({
    'Application': [
        '(a) Female labor supply',
        '(b) College wage premium',
        '(c) Union wage gap',
        '(d) Training program'
    ],
    'Instrument': [
        "Husband's age",
        'SAT score',
        'Right-to-work law',
        'Distance to site'
    ],
    'Relevance': [
        'Moderate (indirect via income)',
        'Strong (predicts enrollment)',
        'Strong (policy-driven)',
        'Moderate-Strong (cost channel)'
    ],
    'Validity': [
        'Questionable (corr. with wife age)',
        'FAILS (ability -> wages)',
        'Debatable (spillover effects)',
        'Plausible (with controls)'
    ],
    'Verdict': [
        'Weak / Problematic',
        'INVALID',
        'Reasonable (with caveats)',
        'GOOD'
    ]
})

print('=' * 90)
print('  EXERCISE 1: ASSESSMENT OF CANDIDATE EXCLUSION RESTRICTIONS')
print('=' * 90)
print()
print(assessment_df.to_string(index=False))

print('\n' + '=' * 90)
print('Key Lesson: A valid exclusion restriction must satisfy BOTH conditions:')
print('  1. RELEVANCE: It must significantly predict selection')
print('  2. VALIDITY:  It must NOT directly affect the outcome')
print('The SAT score example shows that strong relevance is not sufficient;')
print('validity is equally (or more) important.')

---

## Exercise 2: Implement and Compare Specifications (Hands-On)

Using the Mroz dataset, implement the following three models and compare:

1. **Model A**: Exclusion restrictions = `children_lt6` + `husband_income`
2. **Model B**: Exclusion restrictions = `children_6_18` + `husband_income`
3. **Model C**: Exclusion restrictions = `age` only

**Tasks**:
- Estimate all three models
- Create a comparison table of outcome coefficients and selection parameters
- Which specification produces the most stable results? Why?
- Run the LR relevance test for each set of exclusion restrictions

### Solution: Exercise 2

In [None]:
# Step 1: Define the selection equation variable matrices for each model

# Model A: X + children_lt6 + husband_income
Z_A = sm.add_constant(
    mroz[['education', 'experience', 'experience_sq',
          'children_lt6', 'husband_income']].values
)
Z_A_names = ['const', 'education', 'experience', 'experience_sq',
             'children_lt6', 'husband_income']

# Model B: X + children_6_18 + husband_income
Z_B = sm.add_constant(
    mroz[['education', 'experience', 'experience_sq',
          'children_6_18', 'husband_income']].values
)
Z_B_names = ['const', 'education', 'experience', 'experience_sq',
             'children_6_18', 'husband_income']

# Model C: X + age (single exclusion restriction)
Z_C = sm.add_constant(
    mroz[['education', 'experience', 'experience_sq', 'age']].values
)
Z_C_names = ['const', 'education', 'experience', 'experience_sq', 'age']

print('Model A exclusion restrictions: children_lt6, husband_income')
print(f'  Z_A shape: {Z_A.shape}')
print('Model B exclusion restrictions: children_6_18, husband_income')
print(f'  Z_B shape: {Z_B.shape}')
print('Model C exclusion restriction:  age only')
print(f'  Z_C shape: {Z_C.shape}')

In [None]:
# Step 2: Estimate each model using PanelHeckman

# Model A
model_A = PanelHeckman(
    endog=y_mroz,
    exog=X_mroz,
    selection=selection_mroz,
    exog_selection=Z_A,
    method='two_step'
)
result_A = model_A.fit()

# Model B
model_B = PanelHeckman(
    endog=y_mroz,
    exog=X_mroz,
    selection=selection_mroz,
    exog_selection=Z_B,
    method='two_step'
)
result_B = model_B.fit()

# Model C
model_C = PanelHeckman(
    endog=y_mroz,
    exog=X_mroz,
    selection=selection_mroz,
    exog_selection=Z_C,
    method='two_step'
)
result_C = model_C.fit()

print('All three models estimated successfully.')
print(f'\nModel A - rho: {result_A.rho:.4f}, sigma: {result_A.sigma:.4f}')
print(f'Model B - rho: {result_B.rho:.4f}, sigma: {result_B.sigma:.4f}')
print(f'Model C - rho: {result_C.rho:.4f}, sigma: {result_C.sigma:.4f}')

In [None]:
# Step 3: Create comparison table of outcome coefficients and selection parameters

comparison = pd.DataFrame({
    'Variable': X_mroz_names,
    'Model A (children_lt6 + husb_inc)': result_A.outcome_params,
    'Model B (children_6_18 + husb_inc)': result_B.outcome_params,
    'Model C (age only)': result_C.outcome_params,
}).set_index('Variable')

# Add selection parameters
sel_params = pd.DataFrame({
    'Variable': ['sigma', 'rho', 'lambda (rho*sigma)'],
    'Model A (children_lt6 + husb_inc)': [
        result_A.sigma, result_A.rho, result_A.rho * result_A.sigma
    ],
    'Model B (children_6_18 + husb_inc)': [
        result_B.sigma, result_B.rho, result_B.rho * result_B.sigma
    ],
    'Model C (age only)': [
        result_C.sigma, result_C.rho, result_C.rho * result_C.sigma
    ],
}).set_index('Variable')

full_comparison = pd.concat([comparison, sel_params])

print('=' * 80)
print('  EXERCISE 2: COMPARISON OF THREE EXCLUSION RESTRICTION SPECIFICATIONS')
print('=' * 80)
print()
print(full_comparison.round(4).to_string())

# Compute coefficient stability metrics
print('\n' + '-' * 80)
print('Stability Metrics:')
for var in X_mroz_names:
    vals = full_comparison.loc[var].values.astype(float)
    print(f'  {var:20s} - Range: {vals.max() - vals.min():.4f}, '
          f'Std: {vals.std():.4f}')

rho_vals = full_comparison.loc['rho'].values.astype(float)
print(f'  {"rho":20s} - Range: {rho_vals.max() - rho_vals.min():.4f}, '
      f'Std: {rho_vals.std():.4f}')

In [None]:
# Step 3 (continued): Visualization of coefficient comparison

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: Outcome equation coefficients (excluding constant for scale)
vars_to_plot = ['education', 'experience', 'experience_sq']
idx_plot = [X_mroz_names.index(v) for v in vars_to_plot]

x_pos = np.arange(len(vars_to_plot))
width = 0.25

axes[0].bar(x_pos - width, [result_A.outcome_params[i] for i in idx_plot],
            width, label='Model A (children_lt6 + husb_inc)', color='#27ae60', alpha=0.8)
axes[0].bar(x_pos, [result_B.outcome_params[i] for i in idx_plot],
            width, label='Model B (children_6_18 + husb_inc)', color='#2980b9', alpha=0.8)
axes[0].bar(x_pos + width, [result_C.outcome_params[i] for i in idx_plot],
            width, label='Model C (age only)', color='#e74c3c', alpha=0.8)

axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(vars_to_plot, fontsize=11)
axes[0].set_ylabel('Coefficient Value', fontsize=12)
axes[0].set_title('Outcome Equation: Coefficients\nAcross Three Specifications', fontsize=13)
axes[0].legend(fontsize=8, loc='best')
axes[0].grid(True, alpha=0.3, axis='y')
axes[0].axhline(y=0, color='black', linewidth=0.8)

# Right: Selection parameters (rho, sigma, lambda)
sel_labels = [r'$\sigma$', r'$\rho$', r'$\lambda = \rho\sigma$']
sel_A = [result_A.sigma, result_A.rho, result_A.rho * result_A.sigma]
sel_B = [result_B.sigma, result_B.rho, result_B.rho * result_B.sigma]
sel_C = [result_C.sigma, result_C.rho, result_C.rho * result_C.sigma]

x_pos2 = np.arange(len(sel_labels))
axes[1].bar(x_pos2 - width, sel_A, width, label='Model A', color='#27ae60', alpha=0.8)
axes[1].bar(x_pos2, sel_B, width, label='Model B', color='#2980b9', alpha=0.8)
axes[1].bar(x_pos2 + width, sel_C, width, label='Model C', color='#e74c3c', alpha=0.8)

axes[1].set_xticks(x_pos2)
axes[1].set_xticklabels(sel_labels, fontsize=12)
axes[1].set_ylabel('Parameter Value', fontsize=12)
axes[1].set_title('Selection Parameters\nAcross Three Specifications', fontsize=13)
axes[1].legend(fontsize=9)
axes[1].grid(True, alpha=0.3, axis='y')
axes[1].axhline(y=0, color='black', linewidth=0.8)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex2_specification_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Step 4: Likelihood Ratio relevance test for each specification

def probit_log_likelihood(gamma, Z, selection):
    """Compute probit log-likelihood."""
    linear_pred = Z @ gamma
    prob = stats.norm.cdf(linear_pred)
    prob = np.clip(prob, 1e-10, 1 - 1e-10)
    return np.sum(selection * np.log(prob) + (1 - selection) * np.log(1 - prob))

def neg_probit_llf(gamma, Z, sel):
    """Negative probit log-likelihood for minimization."""
    return -probit_log_likelihood(gamma, Z, sel)

# Restricted probit: selection equation with only X (no exclusion restrictions)
res_restricted = sp_minimize(
    neg_probit_llf, np.zeros(X_mroz.shape[1]),
    args=(X_mroz, selection_mroz), method='BFGS'
)
ll_restricted = -res_restricted.fun

print('=' * 80)
print('  EXERCISE 2: LIKELIHOOD RATIO RELEVANCE TESTS')
print('=' * 80)
print(f'\nRestricted log-likelihood (Z = X, no exclusions): {ll_restricted:.4f}')
print()

for name, result, Z_spec in [
    ('Model A (children_lt6 + husb_inc)', result_A, Z_A),
    ('Model B (children_6_18 + husb_inc)', result_B, Z_B),
    ('Model C (age only)', result_C, Z_C),
]:
    # Unrestricted log-likelihood
    ll_unrestricted = probit_log_likelihood(result.probit_params, Z_spec, selection_mroz)
    
    # Number of exclusion restrictions
    n_excl = Z_spec.shape[1] - X_mroz.shape[1]
    
    # LR statistic
    lr_stat = 2 * (ll_unrestricted - ll_restricted)
    lr_pvalue = 1 - stats.chi2.cdf(lr_stat, df=n_excl)
    
    print(f'{name}:')
    print(f'  Unrestricted LL:  {ll_unrestricted:.4f}')
    print(f'  LR statistic:     {lr_stat:.4f}')
    print(f'  df:               {n_excl}')
    print(f'  p-value:          {lr_pvalue:.6f}')
    print(f'  Verdict:          {"RELEVANT (reject H0)" if lr_pvalue < 0.05 else "WEAK (fail to reject)"}"')
    print()

In [None]:
# Step 5: Discussion and conclusions

print('=' * 80)
print('  EXERCISE 2: DISCUSSION')
print('=' * 80)
print()
print('Which specification produces the most stable results?')
print('-' * 60)
print()
print('Model A (children_lt6 + husband_income) is likely the BEST specification:')
print()
print('  1. children_lt6 is the strongest exclusion restriction because young children')
print('     have a large, well-documented effect on mothers\' labor supply through the')
print('     childcare cost channel. This effect is strong and robust.')
print()
print('  2. husband_income provides an additional source of identifying variation')
print('     through the household income effect on the participation decision.')
print()
print('  3. Having TWO exclusion restrictions allows for over-identification checks')
print('     and generally produces more stable estimates than a single instrument.')
print()
print('Model B (children_6_18 + husband_income) is similar but slightly weaker:')
print('  - School-age children (6-18) have a weaker effect on labor supply than')
print('    young children (< 6), because school provides free childcare.')
print()
print('Model C (age only) is the WEAKEST specification:')
print('  - Only one exclusion restriction (less over-identification)')
print('  - Age is correlated with experience and experience_sq, which are in the')
print('    outcome equation. This creates a potential validity concern.')
print('  - The exclusion restriction is that age affects selection through factors')
print('    other than experience (e.g., social norms about working women by age),')
print('    but this argument is weaker than the childcare cost argument.')

---

## Exercise 3: Collinearity Diagnostic (Intermediate)

For the College Wage dataset:

1. Estimate the Heckman model with and without exclusion restrictions
2. For each specification, compute the correlation matrix between the IMR and each X variable (for selected observations only)
3. Create a heatmap visualization of both correlation matrices side by side
4. Compute the condition number of the augmented design matrix $[X, \lambda]$ for both cases
5. Discuss: how does adding exclusion restrictions reduce the condition number?

### Solution: Exercise 3

In [None]:
# Step 1: Estimate Heckman model WITH exclusion restrictions

Z_college_full = sm.add_constant(
    college[['ability', 'parent_education', 'family_income',
             'urban', 'female', 'distance_college', 'tuition']].values
)
Z_college_full_names = ['const', 'ability', 'parent_education', 'family_income',
                        'urban', 'female', 'distance_college', 'tuition']

model_with_excl = PanelHeckman(
    endog=y_college,
    exog=X_college,
    selection=selection_college,
    exog_selection=Z_college_full,
    method='two_step'
)
result_with_excl = model_with_excl.fit()

# Estimate Heckman model WITHOUT exclusion restrictions (Z = X)
model_no_excl = PanelHeckman(
    endog=y_college,
    exog=X_college,
    selection=selection_college,
    exog_selection=X_college,  # Z = X: no exclusion restrictions!
    method='two_step'
)
result_no_excl = model_no_excl.fit()

print('Both models estimated.')
print(f'\nWith exclusion restrictions: rho = {result_with_excl.rho:.4f}')
print(f'Without exclusion restrictions: rho = {result_no_excl.rho:.4f}')

In [None]:
# Step 2: Compute IMR for both specifications

# With exclusion restrictions
Zg_with = Z_college_full @ result_with_excl.probit_params
Phi_with = stats.norm.cdf(Zg_with)
imr_with = stats.norm.pdf(Zg_with) / np.clip(Phi_with, 1e-10, None)

# Without exclusion restrictions
Xg_no = X_college @ result_no_excl.probit_params
Phi_no = stats.norm.cdf(Xg_no)
imr_no = stats.norm.pdf(Xg_no) / np.clip(Phi_no, 1e-10, None)

# Mask for selected observations
sel_mask = selection_college == 1

print(f'IMR with exclusion - Mean: {imr_with[sel_mask].mean():.4f}, Std: {imr_with[sel_mask].std():.4f}')
print(f'IMR without exclusion - Mean: {imr_no[sel_mask].mean():.4f}, Std: {imr_no[sel_mask].std():.4f}')

In [None]:
# Step 3: Compute correlation matrices between IMR and X variables

# Variable names (excluding constant for correlation analysis)
var_names = ['ability', 'parent_education', 'family_income', 'urban', 'female']
var_indices = list(range(1, X_college.shape[1]))  # skip constant (column 0)

# Build correlation DataFrames
# WITH exclusion restrictions
data_with = pd.DataFrame(
    X_college[sel_mask, 1:],  # exclude constant
    columns=var_names
)
data_with['IMR'] = imr_with[sel_mask]
corr_with = data_with.corr()

# WITHOUT exclusion restrictions
data_no = pd.DataFrame(
    X_college[sel_mask, 1:],  # exclude constant
    columns=var_names
)
data_no['IMR'] = imr_no[sel_mask]
corr_no = data_no.corr()

# Print IMR correlations
print('=== Correlation of IMR with X Variables (Selected Sample) ===')
print()
print(f'{"Variable":25s} {"With Exclusion":>15s} {"Without Exclusion":>18s} {"Difference":>12s}')
print('-' * 75)
for var in var_names:
    r_with = corr_with.loc[var, 'IMR']
    r_no = corr_no.loc[var, 'IMR']
    flag = ' *** HIGH' if abs(r_no) > 0.7 else ''
    print(f'{var:25s} {r_with:15.4f} {r_no:18.4f} {abs(r_no) - abs(r_with):12.4f}{flag}')

print()
print('*** HIGH marks correlations above 0.7 in absolute value (without exclusion).')
print('\nConclusion: Without exclusion restrictions, the IMR is more')
print('correlated with X variables, indicating worse multicollinearity.')

In [None]:
# Step 3 (continued): Create side-by-side heatmaps

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Heatmap: WITH exclusion restrictions
sns.heatmap(
    corr_with, annot=True, fmt='.3f', cmap='RdBu_r',
    center=0, vmin=-1, vmax=1, square=True,
    ax=axes[0], linewidths=0.5,
    cbar_kws={'shrink': 0.8}
)
axes[0].set_title('WITH Exclusion Restrictions\n(distance_college, tuition)', fontsize=13)
axes[0].tick_params(axis='both', labelsize=9)

# Heatmap: WITHOUT exclusion restrictions
sns.heatmap(
    corr_no, annot=True, fmt='.3f', cmap='RdBu_r',
    center=0, vmin=-1, vmax=1, square=True,
    ax=axes[1], linewidths=0.5,
    cbar_kws={'shrink': 0.8}
)
axes[1].set_title('WITHOUT Exclusion Restrictions\n(Z = X)', fontsize=13)
axes[1].tick_params(axis='both', labelsize=9)

plt.suptitle('Correlation Matrices: X Variables and IMR\n(College Wage Data, Selected Sample)',
             fontsize=14, fontweight='bold', y=1.03)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex3_collinearity_heatmaps.png', dpi=150, bbox_inches='tight')
plt.show()

print('Notice the IMR row/column:')
print('  - LEFT (with exclusion): IMR has moderate correlations with X variables')
print('  - RIGHT (without exclusion): IMR has stronger correlations, indicating collinearity')

In [None]:
# Step 4: Compute condition numbers of the augmented design matrix [X, lambda]

# Augmented matrix: [X, IMR] for selected observations
X_sel = X_college[sel_mask]

# With exclusion restrictions
X_aug_with = np.column_stack([X_sel, imr_with[sel_mask]])
cond_with = np.linalg.cond(X_aug_with)

# Without exclusion restrictions
X_aug_no = np.column_stack([X_sel, imr_no[sel_mask]])
cond_no = np.linalg.cond(X_aug_no)

# Also compute condition number of X alone (baseline)
cond_X = np.linalg.cond(X_sel)

print('=' * 70)
print('  EXERCISE 3: CONDITION NUMBER ANALYSIS')
print('=' * 70)
print()
print(f'{"Matrix":40s} {"Condition Number":>18s}')
print('-' * 62)
print(f'{"X alone (baseline)":40s} {cond_X:18.2f}')
print(f'{"[X, IMR] WITH exclusion restrictions":40s} {cond_with:18.2f}')
print(f'{"[X, IMR] WITHOUT exclusion restrictions":40s} {cond_no:18.2f}')

print(f'\nRatio (without / with): {cond_no / cond_with:.2f}x')

print()
print('Interpretation:')
print('  - A condition number > 30 suggests moderate multicollinearity')
print('  - A condition number > 100 suggests severe multicollinearity')
print('  - Adding the IMR column increases the condition number in both cases,')
print('    but the increase is MUCH LARGER without exclusion restrictions.')
print('  - With exclusion restrictions, the IMR has independent variation from')
print('    the excluded variables (distance, tuition), keeping collinearity in check.')
print('  - Without exclusion restrictions, the IMR is nearly a linear function of X,')
print('    causing the condition number to explode.')

In [None]:
# Step 5: Visualize the collinearity problem with scatter plots

# Compute X'beta for the selected sample
Xb = X_sel @ result_with_excl.outcome_params

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# With exclusion restrictions
corr_xb_with = np.corrcoef(Xb, imr_with[sel_mask])[0, 1]
axes[0].scatter(Xb, imr_with[sel_mask], alpha=0.3, s=15, color='#27ae60')
axes[0].set_xlabel(r"$X'\hat{\beta}$ (outcome linear predictor)", fontsize=12)
axes[0].set_ylabel(r"$\lambda(Z'\hat{\gamma})$ (IMR)", fontsize=12)
axes[0].set_title(f'WITH Exclusion Restrictions\n'
                   f'Corr(X\'b, IMR) = {corr_xb_with:.3f}', fontsize=13)
axes[0].grid(True, alpha=0.3)

# Without exclusion restrictions
corr_xb_no = np.corrcoef(Xb, imr_no[sel_mask])[0, 1]
axes[1].scatter(Xb, imr_no[sel_mask], alpha=0.3, s=15, color='#e74c3c')
axes[1].set_xlabel(r"$X'\hat{\beta}$ (outcome linear predictor)", fontsize=12)
axes[1].set_ylabel(r"$\lambda(X'\hat{\gamma})$ (IMR)", fontsize=12)
axes[1].set_title(f'WITHOUT Exclusion Restrictions\n'
                   f'Corr(X\'b, IMR) = {corr_xb_no:.3f}', fontsize=13)
axes[1].grid(True, alpha=0.3)

plt.suptitle('Collinearity Diagnostic: IMR vs Outcome Linear Predictor\n(College Wage Data)',
             fontsize=14, fontweight='bold', y=1.04)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex3_imr_scatter.png', dpi=150, bbox_inches='tight')
plt.show()

print(f'Correlation(X\'b, IMR) with exclusion:    {corr_xb_with:.4f}')
print(f'Correlation(X\'b, IMR) without exclusion: {corr_xb_no:.4f}')
print()
print('The right panel shows a much tighter relationship between the IMR and')
print('the linear predictor, confirming the collinearity problem.')
print('The left panel shows more dispersion, indicating that the exclusion')
print('restrictions provide independent variation in the IMR.')

In [None]:
# Step 5 (continued): Summary discussion

print('=' * 80)
print('  EXERCISE 3: SUMMARY DISCUSSION')
print('=' * 80)
print()
print('How do exclusion restrictions reduce the condition number?')
print('-' * 60)
print()
print('1. MATHEMATICAL MECHANISM:')
print('   Without exclusion restrictions (Z = X), the IMR is computed as:')
print('     lambda = phi(X\'gamma) / Phi(X\'gamma)')
print('   Since lambda(.) is approximately linear over typical data ranges,')
print('     lambda \u2248 a + b * X\'gamma = a + b * (X_1*g_1 + X_2*g_2 + ...)')
print('   This makes the IMR column nearly a linear combination of the X columns,')
print('   inflating the condition number.')
print()
print('2. WITH EXCLUSION RESTRICTIONS:')
print('   The IMR is computed from a broader set of variables Z = [X, W]:')
print('     lambda = phi(X\'g_1 + W\'g_2) / Phi(X\'g_1 + W\'g_2)')
print('   The variation in W generates variation in lambda that is NOT a linear')
print('   function of X alone. This "breaks" the collinearity.')
print()
print('3. PRACTICAL CONSEQUENCE:')
print(f'   Condition number with exclusion:    {cond_with:.1f}')
print(f'   Condition number without exclusion: {cond_no:.1f}')
print(f'   Ratio: {cond_no/cond_with:.1f}x worse without exclusion')
print()
print('4. IMPLICATIONS FOR INFERENCE:')
print('   Higher condition numbers lead to:')
print('   - Larger standard errors for all coefficients')
print('   - Numerical instability in OLS computation')
print('   - Sensitivity of estimates to small changes in the data')
print('   - Unreliable hypothesis tests')

---

## Exercise 4: Monte Carlo Simulation (Advanced)

Design a Monte Carlo experiment to demonstrate the importance of exclusion restrictions:

1. Generate data from a known DGP with:
   - True $\beta = [1.0, 0.5]$ (outcome equation)
   - True $\rho = 0.5$ (selection correlation)
   - A valid exclusion restriction $W$ that affects selection but not the outcome

2. For 200 replications, estimate the Heckman model:
   - (a) With the exclusion restriction
   - (b) Without the exclusion restriction

3. Compare the sampling distributions of $\hat{\beta}_1$, $\hat{\rho}$, and $\hat{\sigma}$

4. Create histograms showing the distributions and mark the true parameter values

### Solution: Exercise 4

In [None]:
# Step 1: Define the Data Generating Process (DGP)

def generate_heckman_data(n, beta, gamma_x, gamma_w, rho, sigma, seed=None):
    """
    Generate data from a Heckman selection model DGP.
    
    Outcome equation:  y = beta[0] + beta[1] * x + epsilon
    Selection equation: s* = gamma_x[0] + gamma_x[1] * x + gamma_w * w + u
    s = 1 if s* > 0
    y observed only if s = 1
    
    (u, epsilon) ~ Bivariate Normal with correlation rho
    
    Parameters
    ----------
    n : int
        Sample size
    beta : array-like
        [intercept, slope] for outcome equation
    gamma_x : array-like
        [intercept, slope_x] for selection equation (shared with outcome)
    gamma_w : float
        Coefficient on exclusion restriction W in selection equation
    rho : float
        Correlation between selection and outcome errors
    sigma : float
        Standard deviation of outcome error
    seed : int, optional
        Random seed
    
    Returns
    -------
    dict with x, w, y, selection, y_latent
    """
    if seed is not None:
        rng = np.random.RandomState(seed)
    else:
        rng = np.random.RandomState()
    
    # Generate covariates
    x = rng.randn(n)  # observed in both equations
    w = rng.randn(n)  # exclusion restriction: only in selection equation
    
    # Generate correlated errors (u, epsilon)
    # u ~ N(0, 1) for probit, epsilon ~ N(0, sigma^2)
    cov_matrix = np.array([
        [1.0, rho * sigma],
        [rho * sigma, sigma**2]
    ])
    errors = rng.multivariate_normal([0, 0], cov_matrix, size=n)
    u = errors[:, 0]
    epsilon = errors[:, 1]
    
    # Selection equation: s* = gamma_x[0] + gamma_x[1]*x + gamma_w*w + u
    s_star = gamma_x[0] + gamma_x[1] * x + gamma_w * w + u
    selection = (s_star > 0).astype(float)
    
    # Outcome equation: y = beta[0] + beta[1]*x + epsilon
    y_latent = beta[0] + beta[1] * x + epsilon
    
    # Observed outcome (0 for non-selected, actual value for selected)
    y_obs = np.where(selection == 1, y_latent, 0.0)
    
    return {
        'x': x,
        'w': w,
        'y': y_obs,
        'y_latent': y_latent,
        'selection': selection,
        'u': u,
        'epsilon': epsilon,
    }

# Define true parameters
TRUE_BETA = np.array([1.0, 0.5])       # outcome: intercept=1.0, slope=0.5
TRUE_GAMMA_X = np.array([0.0, 0.3])    # selection: intercept=0, slope_x=0.3
TRUE_GAMMA_W = 0.8                      # exclusion restriction coefficient
TRUE_RHO = 0.5                          # selection correlation
TRUE_SIGMA = 1.0                        # outcome error std dev

# Test the DGP with a single draw
test_data = generate_heckman_data(
    n=1000, beta=TRUE_BETA, gamma_x=TRUE_GAMMA_X,
    gamma_w=TRUE_GAMMA_W, rho=TRUE_RHO, sigma=TRUE_SIGMA, seed=42
)

print('DGP Test (single draw, n=1000):')
print(f'  Selection rate: {test_data["selection"].mean():.1%}')
print(f'  Mean latent y: {test_data["y_latent"].mean():.3f}')
print(f'  Mean observed y (selected): {test_data["y"][test_data["selection"]==1].mean():.3f}')
print(f'  Corr(u, epsilon): {np.corrcoef(test_data["u"], test_data["epsilon"])[0,1]:.3f}')
print(f'  True rho: {TRUE_RHO}')
print(f'\nTrue parameters:')
print(f'  beta = {TRUE_BETA}')
print(f'  rho = {TRUE_RHO}, sigma = {TRUE_SIGMA}')

In [None]:
# Step 2: Monte Carlo simulation - 200 replications

n_reps = 200
n_obs = 500

# Storage for results
results_with_excl = {'beta0': [], 'beta1': [], 'rho': [], 'sigma': []}
results_without_excl = {'beta0': [], 'beta1': [], 'rho': [], 'sigma': []}
results_ols = {'beta0': [], 'beta1': []}  # OLS on selected sample (biased)

n_failed_with = 0
n_failed_without = 0

print(f'Running Monte Carlo simulation: {n_reps} replications, n={n_obs} each...')
print()

for rep in range(n_reps):
    if (rep + 1) % 50 == 0:
        print(f'  Replication {rep + 1}/{n_reps}...')
    
    # Generate data
    data = generate_heckman_data(
        n=n_obs, beta=TRUE_BETA, gamma_x=TRUE_GAMMA_X,
        gamma_w=TRUE_GAMMA_W, rho=TRUE_RHO, sigma=TRUE_SIGMA,
        seed=1000 + rep
    )
    
    x = data['x']
    w = data['w']
    y = data['y']
    sel = data['selection']
    
    # Design matrices
    X_mc = sm.add_constant(x)
    Z_with = sm.add_constant(np.column_stack([x, w]))  # X + exclusion
    Z_without = X_mc.copy()                            # X only (no exclusion)
    
    # --- OLS on selected sample (naive, biased) ---
    sel_mask_mc = sel == 1
    if sel_mask_mc.sum() > 5:  # need enough observations
        try:
            beta_ols = np.linalg.lstsq(X_mc[sel_mask_mc], y[sel_mask_mc], rcond=None)[0]
            results_ols['beta0'].append(beta_ols[0])
            results_ols['beta1'].append(beta_ols[1])
        except Exception:
            pass
    
    # --- Heckman WITH exclusion restriction ---
    try:
        model_w = PanelHeckman(
            endog=y, exog=X_mc, selection=sel,
            exog_selection=Z_with, method='two_step'
        )
        res_w = model_w.fit()
        results_with_excl['beta0'].append(res_w.outcome_params[0])
        results_with_excl['beta1'].append(res_w.outcome_params[1])
        results_with_excl['rho'].append(res_w.rho)
        results_with_excl['sigma'].append(res_w.sigma)
    except Exception:
        n_failed_with += 1
    
    # --- Heckman WITHOUT exclusion restriction ---
    try:
        model_wo = PanelHeckman(
            endog=y, exog=X_mc, selection=sel,
            exog_selection=Z_without, method='two_step'
        )
        res_wo = model_wo.fit()
        results_without_excl['beta0'].append(res_wo.outcome_params[0])
        results_without_excl['beta1'].append(res_wo.outcome_params[1])
        results_without_excl['rho'].append(res_wo.rho)
        results_without_excl['sigma'].append(res_wo.sigma)
    except Exception:
        n_failed_without += 1

print(f'\nSimulation complete!')
print(f'  Successful replications (with excl):    {len(results_with_excl["beta1"])}/{n_reps}')
print(f'  Successful replications (without excl): {len(results_without_excl["beta1"])}/{n_reps}')
print(f'  Failed (with excl):    {n_failed_with}')
print(f'  Failed (without excl): {n_failed_without}')

In [None]:
# Step 3: Compare sampling distributions

# Convert to arrays
beta1_with = np.array(results_with_excl['beta1'])
beta1_without = np.array(results_without_excl['beta1'])
beta1_ols = np.array(results_ols['beta1'])

rho_with = np.array(results_with_excl['rho'])
rho_without = np.array(results_without_excl['rho'])

sigma_with = np.array(results_with_excl['sigma'])
sigma_without = np.array(results_without_excl['sigma'])

# Summary statistics
print('=' * 85)
print('  EXERCISE 4: MONTE CARLO RESULTS')
print('=' * 85)
print(f'  True parameters: beta_1 = {TRUE_BETA[1]}, rho = {TRUE_RHO}, sigma = {TRUE_SIGMA}')
print()

print(f'{"":35s} {"Mean":>10s} {"Bias":>10s} {"Std Dev":>10s} {"RMSE":>10s}')
print('-' * 80)

# beta_1
for label, arr, true_val in [
    ('beta_1 (Heckman WITH excl)', beta1_with, TRUE_BETA[1]),
    ('beta_1 (Heckman WITHOUT excl)', beta1_without, TRUE_BETA[1]),
    ('beta_1 (OLS, biased)', beta1_ols, TRUE_BETA[1]),
]:
    mean_val = arr.mean()
    bias = mean_val - true_val
    std_val = arr.std()
    rmse = np.sqrt(bias**2 + std_val**2)
    print(f'{label:35s} {mean_val:10.4f} {bias:10.4f} {std_val:10.4f} {rmse:10.4f}')

print()

# rho
for label, arr, true_val in [
    ('rho (Heckman WITH excl)', rho_with, TRUE_RHO),
    ('rho (Heckman WITHOUT excl)', rho_without, TRUE_RHO),
]:
    mean_val = arr.mean()
    bias = mean_val - true_val
    std_val = arr.std()
    rmse = np.sqrt(bias**2 + std_val**2)
    print(f'{label:35s} {mean_val:10.4f} {bias:10.4f} {std_val:10.4f} {rmse:10.4f}')

print()

# sigma
for label, arr, true_val in [
    ('sigma (Heckman WITH excl)', sigma_with, TRUE_SIGMA),
    ('sigma (Heckman WITHOUT excl)', sigma_without, TRUE_SIGMA),
]:
    mean_val = arr.mean()
    bias = mean_val - true_val
    std_val = arr.std()
    rmse = np.sqrt(bias**2 + std_val**2)
    print(f'{label:35s} {mean_val:10.4f} {bias:10.4f} {std_val:10.4f} {rmse:10.4f}')

In [None]:
# Step 4: Create histograms showing the sampling distributions

fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# ---- Row 1: beta_1 distributions ----

# (a) Heckman WITH exclusion
axes[0, 0].hist(beta1_with, bins=25, alpha=0.7, color='#27ae60',
                edgecolor='black', linewidth=0.5, density=True)
axes[0, 0].axvline(TRUE_BETA[1], color='red', linewidth=2.5, linestyle='--',
                    label=f'True $\\beta_1$ = {TRUE_BETA[1]}')
axes[0, 0].axvline(beta1_with.mean(), color='blue', linewidth=2, linestyle=':',
                    label=f'Mean = {beta1_with.mean():.3f}')
axes[0, 0].set_title('$\\hat{\\beta}_1$: Heckman WITH Exclusion', fontsize=12)
axes[0, 0].set_xlabel(r'$\hat{\beta}_1$', fontsize=11)
axes[0, 0].set_ylabel('Density', fontsize=11)
axes[0, 0].legend(fontsize=9)
axes[0, 0].grid(True, alpha=0.3)

# (b) Heckman WITHOUT exclusion
axes[0, 1].hist(beta1_without, bins=25, alpha=0.7, color='#e74c3c',
                edgecolor='black', linewidth=0.5, density=True)
axes[0, 1].axvline(TRUE_BETA[1], color='red', linewidth=2.5, linestyle='--',
                    label=f'True $\\beta_1$ = {TRUE_BETA[1]}')
axes[0, 1].axvline(beta1_without.mean(), color='blue', linewidth=2, linestyle=':',
                    label=f'Mean = {beta1_without.mean():.3f}')
axes[0, 1].set_title('$\\hat{\\beta}_1$: Heckman WITHOUT Exclusion', fontsize=12)
axes[0, 1].set_xlabel(r'$\hat{\beta}_1$', fontsize=11)
axes[0, 1].set_ylabel('Density', fontsize=11)
axes[0, 1].legend(fontsize=9)
axes[0, 1].grid(True, alpha=0.3)

# (c) OLS (biased)
axes[0, 2].hist(beta1_ols, bins=25, alpha=0.7, color='#f39c12',
                edgecolor='black', linewidth=0.5, density=True)
axes[0, 2].axvline(TRUE_BETA[1], color='red', linewidth=2.5, linestyle='--',
                    label=f'True $\\beta_1$ = {TRUE_BETA[1]}')
axes[0, 2].axvline(beta1_ols.mean(), color='blue', linewidth=2, linestyle=':',
                    label=f'Mean = {beta1_ols.mean():.3f}')
axes[0, 2].set_title('$\\hat{\\beta}_1$: OLS (No Selection Correction)', fontsize=12)
axes[0, 2].set_xlabel(r'$\hat{\beta}_1$', fontsize=11)
axes[0, 2].set_ylabel('Density', fontsize=11)
axes[0, 2].legend(fontsize=9)
axes[0, 2].grid(True, alpha=0.3)

# ---- Row 2: rho and sigma distributions ----

# (d) rho WITH exclusion
axes[1, 0].hist(rho_with, bins=25, alpha=0.7, color='#27ae60',
                edgecolor='black', linewidth=0.5, density=True)
axes[1, 0].axvline(TRUE_RHO, color='red', linewidth=2.5, linestyle='--',
                    label=f'True $\\rho$ = {TRUE_RHO}')
axes[1, 0].axvline(rho_with.mean(), color='blue', linewidth=2, linestyle=':',
                    label=f'Mean = {rho_with.mean():.3f}')
axes[1, 0].set_title('$\\hat{\\rho}$: Heckman WITH Exclusion', fontsize=12)
axes[1, 0].set_xlabel(r'$\hat{\rho}$', fontsize=11)
axes[1, 0].set_ylabel('Density', fontsize=11)
axes[1, 0].legend(fontsize=9)
axes[1, 0].grid(True, alpha=0.3)

# (e) rho WITHOUT exclusion
axes[1, 1].hist(rho_without, bins=25, alpha=0.7, color='#e74c3c',
                edgecolor='black', linewidth=0.5, density=True)
axes[1, 1].axvline(TRUE_RHO, color='red', linewidth=2.5, linestyle='--',
                    label=f'True $\\rho$ = {TRUE_RHO}')
axes[1, 1].axvline(rho_without.mean(), color='blue', linewidth=2, linestyle=':',
                    label=f'Mean = {rho_without.mean():.3f}')
axes[1, 1].set_title('$\\hat{\\rho}$: Heckman WITHOUT Exclusion', fontsize=12)
axes[1, 1].set_xlabel(r'$\hat{\rho}$', fontsize=11)
axes[1, 1].set_ylabel('Density', fontsize=11)
axes[1, 1].legend(fontsize=9)
axes[1, 1].grid(True, alpha=0.3)

# (f) sigma comparison (overlaid)
axes[1, 2].hist(sigma_with, bins=25, alpha=0.5, color='#27ae60',
                edgecolor='black', linewidth=0.5, density=True,
                label=f'With excl. (mean={sigma_with.mean():.3f})')
axes[1, 2].hist(sigma_without, bins=25, alpha=0.5, color='#e74c3c',
                edgecolor='black', linewidth=0.5, density=True,
                label=f'Without excl. (mean={sigma_without.mean():.3f})')
axes[1, 2].axvline(TRUE_SIGMA, color='red', linewidth=2.5, linestyle='--',
                    label=f'True $\\sigma$ = {TRUE_SIGMA}')
axes[1, 2].set_title('$\\hat{\\sigma}$: Comparison', fontsize=12)
axes[1, 2].set_xlabel(r'$\hat{\sigma}$', fontsize=11)
axes[1, 2].set_ylabel('Density', fontsize=11)
axes[1, 2].legend(fontsize=8)
axes[1, 2].grid(True, alpha=0.3)

plt.suptitle('Monte Carlo Simulation: Sampling Distributions\n'
             f'({n_reps} replications, n={n_obs}, true $\\rho$={TRUE_RHO})',
             fontsize=14, fontweight='bold', y=1.03)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex4_monte_carlo_histograms.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Step 4 (continued): Box plots for a cleaner comparison

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# beta_1 box plot
bp1 = axes[0].boxplot(
    [beta1_with, beta1_without, beta1_ols],
    labels=['Heckman\n(with excl.)', 'Heckman\n(no excl.)', 'OLS\n(biased)'],
    patch_artist=True,
    medianprops=dict(color='black', linewidth=2),
    boxprops=dict(linewidth=1.5),
)
colors = ['#27ae60', '#e74c3c', '#f39c12']
for patch, color in zip(bp1['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.6)
axes[0].axhline(TRUE_BETA[1], color='red', linewidth=2, linestyle='--',
                label=f'True value = {TRUE_BETA[1]}')
axes[0].set_ylabel(r'$\hat{\beta}_1$', fontsize=12)
axes[0].set_title(r'Distribution of $\hat{\beta}_1$', fontsize=13)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3, axis='y')

# rho box plot
bp2 = axes[1].boxplot(
    [rho_with, rho_without],
    labels=['Heckman\n(with excl.)', 'Heckman\n(no excl.)'],
    patch_artist=True,
    medianprops=dict(color='black', linewidth=2),
    boxprops=dict(linewidth=1.5),
)
for patch, color in zip(bp2['boxes'], ['#27ae60', '#e74c3c']):
    patch.set_facecolor(color)
    patch.set_alpha(0.6)
axes[1].axhline(TRUE_RHO, color='red', linewidth=2, linestyle='--',
                label=f'True value = {TRUE_RHO}')
axes[1].set_ylabel(r'$\hat{\rho}$', fontsize=12)
axes[1].set_title(r'Distribution of $\hat{\rho}$', fontsize=13)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3, axis='y')

# sigma box plot
bp3 = axes[2].boxplot(
    [sigma_with, sigma_without],
    labels=['Heckman\n(with excl.)', 'Heckman\n(no excl.)'],
    patch_artist=True,
    medianprops=dict(color='black', linewidth=2),
    boxprops=dict(linewidth=1.5),
)
for patch, color in zip(bp3['boxes'], ['#27ae60', '#e74c3c']):
    patch.set_facecolor(color)
    patch.set_alpha(0.6)
axes[2].axhline(TRUE_SIGMA, color='red', linewidth=2, linestyle='--',
                label=f'True value = {TRUE_SIGMA}')
axes[2].set_ylabel(r'$\hat{\sigma}$', fontsize=12)
axes[2].set_title(r'Distribution of $\hat{\sigma}$', fontsize=13)
axes[2].legend(fontsize=10)
axes[2].grid(True, alpha=0.3, axis='y')

plt.suptitle('Monte Carlo: Box Plots of Estimator Distributions\n'
             f'({n_reps} replications, n={n_obs})',
             fontsize=14, fontweight='bold', y=1.04)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex4_monte_carlo_boxplots.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Step 5: Detailed interpretation and conclusions

print('=' * 85)
print('  EXERCISE 4: MONTE CARLO CONCLUSIONS')
print('=' * 85)
print()
print('1. BIAS COMPARISON (beta_1):')
print(f'   True beta_1 = {TRUE_BETA[1]}')
print(f'   Heckman with exclusion:    mean = {beta1_with.mean():.4f}, bias = {beta1_with.mean() - TRUE_BETA[1]:.4f}')
print(f'   Heckman without exclusion: mean = {beta1_without.mean():.4f}, bias = {beta1_without.mean() - TRUE_BETA[1]:.4f}')
print(f'   OLS (ignoring selection):  mean = {beta1_ols.mean():.4f}, bias = {beta1_ols.mean() - TRUE_BETA[1]:.4f}')
print()
print('2. VARIANCE COMPARISON (beta_1):')
print(f'   Heckman with exclusion:    std = {beta1_with.std():.4f}')
print(f'   Heckman without exclusion: std = {beta1_without.std():.4f}')
print(f'   OLS (ignoring selection):  std = {beta1_ols.std():.4f}')
print(f'   Variance ratio (without/with): {(beta1_without.std()/beta1_with.std())**2:.2f}x')
print()
print('3. RMSE COMPARISON (overall accuracy):')
rmse_with = np.sqrt(np.mean((beta1_with - TRUE_BETA[1])**2))
rmse_without = np.sqrt(np.mean((beta1_without - TRUE_BETA[1])**2))
rmse_ols = np.sqrt(np.mean((beta1_ols - TRUE_BETA[1])**2))
print(f'   Heckman with exclusion:    RMSE = {rmse_with:.4f}')
print(f'   Heckman without exclusion: RMSE = {rmse_without:.4f}')
print(f'   OLS (ignoring selection):  RMSE = {rmse_ols:.4f}')
print()
print('4. SELECTION PARAMETER RECOVERY:')
print(f'   True rho = {TRUE_RHO}')
print(f'   With exclusion:    mean = {rho_with.mean():.4f}, std = {rho_with.std():.4f}')
print(f'   Without exclusion: mean = {rho_without.mean():.4f}, std = {rho_without.std():.4f}')
print()
print('5. KEY FINDINGS:')
print('   - Heckman WITH exclusion restrictions produces the least biased estimates')
print('     of both the outcome coefficient (beta_1) and the selection correlation (rho).')
print('   - Heckman WITHOUT exclusion restrictions has higher variance (due to collinearity')
print('     between the IMR and X variables) and may show more bias as well.')
print('   - OLS ignoring selection is biased because it does not correct for the')
print('     correlation between selection and outcome errors.')
print('   - The exclusion restriction provides genuine identifying variation that')
print('     stabilizes the Heckman estimator and improves both bias and precision.')

---

## Summary

### What We Demonstrated in These Solutions

| Exercise | Key Insight |
|---|---|
| 1 (Conceptual) | Evaluating instruments requires assessing BOTH relevance AND validity. Strong relevance alone (e.g., SAT scores) does not make a valid instrument. |
| 2 (Hands-On) | Different exclusion restrictions yield different estimates. Strong instruments (children_lt6) produce more stable results than weak ones (age). Multiple instruments allow over-identification checks. |
| 3 (Collinearity) | Without exclusion restrictions, the IMR is highly collinear with X, inflating condition numbers and destabilizing estimation. Exclusion restrictions provide independent variation that breaks this collinearity. |
| 4 (Monte Carlo) | Simulation confirms that Heckman with exclusion restrictions has lower bias and variance than Heckman without exclusion restrictions or naive OLS. |

### Practical Guidelines for Applied Research

1. **Always use exclusion restrictions** when a credible instrument is available
2. **Economic reasoning** for instrument validity is more important than any statistical test
3. **Test instrument relevance** with a likelihood ratio test on the probit selection equation
4. **Report sensitivity analyses** across different instrument sets
5. **Use multiple instruments** when possible for over-identification checks
6. **Be transparent** about the limitations of your identification strategy

### References

- Heckman, J.J. (1979). "Sample Selection Bias as a Specification Error." *Econometrica*, 47(1), 153-161.
- Card, D. (1995). "Using Geographic Variation in College Proximity to Estimate the Return to Schooling."
- Mroz, T.A. (1987). "The Sensitivity of an Empirical Model of Married Women's Hours of Work." *Econometrica*, 55(4), 765-799.
- Puhani, P.A. (2000). "The Heckman Correction for Sample Selection and Its Critique." *Journal of Economic Surveys*, 14(1), 53-68.