# Heckman Two-Step Selection Correction

## Learning Objectives

- Understand the sample selection problem and when OLS fails
- Formulate selection and outcome equations with proper exclusion restrictions
- Estimate the Heckman two-step model using PanelBox
- Interpret the Inverse Mills Ratio, rho, and sigma parameters
- Test for selection bias and compare OLS vs Heckman estimates
- Diagnose model fit using IMR visualizations

## Duration
75-90 minutes

## Prerequisites
- OLS regression (linear models)
- Probit/logit models (binary choice)
- Basic probability theory (normal distribution, conditional expectation)

## Dataset
Mroz (1987): Married women's labor force participation and wages
- N = 753 married women
- Outcome: hourly wage (observed only for participants)
- Selection: labor force participation (0/1)
- Rich set of demographic and household variables

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Statistical libraries
from scipy import stats
import statsmodels.api as sm

# PanelBox imports - Heckman selection model and IMR utilities
from panelbox.models.selection import PanelHeckman, compute_imr, imr_diagnostics, test_selection_effect

# Visualization configuration
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

# Set random seed for reproducibility
np.random.seed(42)

# Define paths (relative to notebook location in examples/censored/notebooks/)
BASE_DIR = Path('..')
DATA_DIR = BASE_DIR / 'data'
OUTPUT_DIR = BASE_DIR / 'outputs'
FIGURES_DIR = OUTPUT_DIR / 'figures'
TABLES_DIR = OUTPUT_DIR / 'tables'
FIGURES_DIR.mkdir(parents=True, exist_ok=True)
TABLES_DIR.mkdir(parents=True, exist_ok=True)

print('Setup complete!')

---

## 1. The Sample Selection Problem

### When OLS Fails

Suppose we want to estimate the returns to education on wages for married women. A natural approach would be:

$$\text{wage}_i = \beta_0 + \beta_1 \text{education}_i + \beta_2 \text{experience}_i + \beta_3 \text{experience}_i^2 + \varepsilon_i$$

**The problem**: We only observe wages for women who are *working*. Women who are not in the labor force have missing wages.

### Why is This a Problem?

If we simply run OLS on the observed (working) sample, we get **biased** estimates because:

1. **Non-random sample**: Women who choose to work are not a random sample of all women
2. **Systematic selection**: Unobserved factors (ability, motivation) that influence participation *also* affect wages
3. **Omitted variable bias**: The error term in the wage equation is correlated with the selection decision

### An Analogy

Imagine estimating the effect of training on performance, but only observing performance for employees who *chose* to participate in training. If more motivated employees self-select into training, naive OLS overstates the training effect.

### Key Insight

The working sample is **censored by choice**, not at random. High-ability women may be more likely to enter the labor force, creating a systematic upward bias in wage estimates.

$$E[\text{wage}_i | \text{working}] \neq E[\text{wage}_i]$$

---

## 2. Heckman's Insight

James Heckman (Nobel Prize, 2000) showed that sample selection bias can be framed as an **omitted variable problem**, and proposed an elegant two-step correction.

### The Two-Equation Framework

**Selection equation** (who participates?):

$$s_i^* = \mathbf{Z}_i'\boldsymbol{\gamma} + u_i, \qquad s_i = \mathbf{1}[s_i^* > 0]$$

- $s_i^*$ is a latent (unobserved) propensity to participate
- $s_i$ is the observed binary participation indicator
- $\mathbf{Z}_i$ includes variables that affect the participation decision
- $u_i$ is the selection error

**Outcome equation** (what is the wage?):

$$y_i = \mathbf{X}_i'\boldsymbol{\beta} + \varepsilon_i \qquad \text{(observed only if } s_i = 1\text{)}$$

- $y_i$ is the outcome of interest (wage)
- $\mathbf{X}_i$ includes variables that affect the outcome
- $\varepsilon_i$ is the outcome error

### The Critical Assumption

The errors $(u_i, \varepsilon_i)$ are **jointly bivariate normal**:

$$\begin{pmatrix} u_i \\ \varepsilon_i \end{pmatrix} \sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & \rho\sigma_\varepsilon \\ \rho\sigma_\varepsilon & \sigma_\varepsilon^2 \end{pmatrix}\right)$$

Where:
- $\sigma_\varepsilon$ = standard deviation of the outcome error
- $\rho$ = correlation between $u_i$ and $\varepsilon_i$
- If $\rho \neq 0$, OLS on the selected sample is **biased**
- If $\rho = 0$, there is no selection bias and OLS is fine

### The Omitted Variable

Heckman showed that the conditional expectation for the selected sample is:

$$E[y_i | s_i = 1] = \mathbf{X}_i'\boldsymbol{\beta} + \rho\sigma_\varepsilon \cdot \underbrace{\frac{\phi(\mathbf{Z}_i'\hat{\boldsymbol{\gamma}})}{\Phi(\mathbf{Z}_i'\hat{\boldsymbol{\gamma}})}}_{\lambda_i \text{ (Inverse Mills Ratio)}}$$

The **Inverse Mills Ratio** $\lambda_i$ captures the selection effect. Omitting it from the regression is what causes the bias.

### Exclusion Restrictions

For the model to be well-identified, $\mathbf{Z}_i$ should include at least one variable that:
- Affects the **selection** decision (participation)
- Does **NOT** directly affect the **outcome** (wages)

These are called **exclusion restrictions**. In the Mroz data:
- `children_lt6`, `children_6_18`: Young children affect whether a woman works, but not her hourly wage rate
- `husband_income`: Husband's income affects need to work, but not the woman's wage rate

---

## 3. Loading and Exploring the Data

In [None]:
# Load the Mroz (1987) dataset: married women's labor force participation
df = pd.read_csv(DATA_DIR / 'mroz_1987.csv')

# Preview dataset structure
print('Dataset shape:', df.shape)
print('\nFirst 10 rows:')
display(df.head(10))

print('\nVariable types:')
print(df.dtypes)

print('\nBasic summary statistics:')
display(df.describe())

In [None]:
# Understand the selection structure: who participates in the labor force?
n_total = len(df)
n_working = df['lfp'].sum()
n_not_working = n_total - n_working

print('Sample Selection Structure')
print('=' * 50)
print(f'Total women:          {n_total}')
print(f'Working (lfp=1):      {n_working} ({n_working/n_total:.1%})')
print(f'Not working (lfp=0):  {n_not_working} ({n_not_working/n_total:.1%})')
print(f'\nWages observed:       {df["wage"].notna().sum()}')
print(f'Wages missing:        {df["wage"].isna().sum()}')

print('\n--- Working women ---')
display(df[df['lfp'] == 1][['wage', 'education', 'experience', 'age']].describe())

print('\n--- Non-working women ---')
display(df[df['lfp'] == 0][['education', 'experience', 'age',
                             'children_lt6', 'children_6_18', 'husband_income']].describe())

---

## 4. Exploratory Analysis: Who Participates? Who Doesn't?

In [None]:
# Compare characteristics of working vs non-working women
comparison_vars = ['education', 'experience', 'age', 'children_lt6',
                   'children_6_18', 'husband_income']

comparison_table = pd.DataFrame({
    'Working (lfp=1)': df[df['lfp'] == 1][comparison_vars].mean(),
    'Not Working (lfp=0)': df[df['lfp'] == 0][comparison_vars].mean(),
})
comparison_table['Difference'] = (
    comparison_table['Working (lfp=1)'] - comparison_table['Not Working (lfp=0)']
)

print('Mean Characteristics by Labor Force Status')
print('=' * 70)
display(comparison_table.round(3))

print('\nKey patterns to notice:')
print('- Working women tend to have more education (human capital incentive)')
print('- Working women have fewer young children (childcare constraint)')
print('- Working women have lower husband income (financial need)')
print('- These patterns drive selection and create potential bias')

In [None]:
# Visualize participation patterns across key variables
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# Education distribution by LFP status
for lfp_val, label, color in [(1, 'Working', 'steelblue'), (0, 'Not Working', '#D55E00')]:
    subset = df[df['lfp'] == lfp_val]
    axes[0, 0].hist(subset['education'], bins=12, alpha=0.6, label=label,
                    color=color, edgecolor='black', density=True)
axes[0, 0].set_xlabel('Years of Education')
axes[0, 0].set_ylabel('Density')
axes[0, 0].set_title('Education by LFP Status')
axes[0, 0].legend()

# Age distribution by LFP status
for lfp_val, label, color in [(1, 'Working', 'steelblue'), (0, 'Not Working', '#D55E00')]:
    subset = df[df['lfp'] == lfp_val]
    axes[0, 1].hist(subset['age'], bins=15, alpha=0.6, label=label,
                    color=color, edgecolor='black', density=True)
axes[0, 1].set_xlabel('Age')
axes[0, 1].set_ylabel('Density')
axes[0, 1].set_title('Age by LFP Status')
axes[0, 1].legend()

# Husband income by LFP status
for lfp_val, label, color in [(1, 'Working', 'steelblue'), (0, 'Not Working', '#D55E00')]:
    subset = df[df['lfp'] == lfp_val]
    axes[0, 2].hist(subset['husband_income'], bins=15, alpha=0.6, label=label,
                    color=color, edgecolor='black', density=True)
axes[0, 2].set_xlabel('Husband Income ($1000s)')
axes[0, 2].set_ylabel('Density')
axes[0, 2].set_title('Husband Income by LFP Status')
axes[0, 2].legend()

# Participation rate by number of young children
lfp_by_children = df.groupby('children_lt6')['lfp'].mean()
axes[1, 0].bar(lfp_by_children.index, lfp_by_children.values,
               color='steelblue', edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Number of Children < 6')
axes[1, 0].set_ylabel('Participation Rate')
axes[1, 0].set_title('LFP Rate by Young Children')
axes[1, 0].set_ylim(0, 1)

# Participation rate by education level
edu_bins = pd.cut(df['education'], bins=[0, 10, 12, 14, 20],
                  labels=['<= 10', '11-12', '13-14', '15+'])
lfp_by_edu = df.groupby(edu_bins, observed=True)['lfp'].mean()
axes[1, 1].bar(range(len(lfp_by_edu)), lfp_by_edu.values,
               color='steelblue', edgecolor='black', alpha=0.7)
axes[1, 1].set_xticks(range(len(lfp_by_edu)))
axes[1, 1].set_xticklabels(lfp_by_edu.index)
axes[1, 1].set_xlabel('Education Level')
axes[1, 1].set_ylabel('Participation Rate')
axes[1, 1].set_title('LFP Rate by Education')
axes[1, 1].set_ylim(0, 1)

# Wage distribution (working women only)
working_wages = df[df['lfp'] == 1]['wage'].dropna()
axes[1, 2].hist(working_wages, bins=25, color='steelblue',
                edgecolor='black', alpha=0.7)
axes[1, 2].axvline(working_wages.mean(), color='red', linestyle='--',
                    linewidth=2, label=f'Mean = ${working_wages.mean():.2f}')
axes[1, 2].set_xlabel('Hourly Wage ($)')
axes[1, 2].set_ylabel('Frequency')
axes[1, 2].set_title('Wage Distribution (Working Women Only)')
axes[1, 2].legend()

plt.suptitle('Exploratory Analysis: Labor Force Participation Patterns',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'eda_participation_patterns.png', dpi=300, bbox_inches='tight')
plt.show()

*Figure: Six-panel exploratory analysis. Top row: density distributions of education, age, and husband income by labor force status (blue = working, orange = not working). Bottom row: participation rate by number of young children (strong negative effect), participation rate by education level (strong positive effect), and the wage distribution for working women (right-skewed, observed only for participants). These patterns reveal systematic differences between participants and non-participants, motivating the need for selection correction.*

---

## 5. The Two-Step Procedure

### Step 1: Probit Model for Selection

Estimate the probability of labor force participation:

$$P(s_i = 1 | \mathbf{Z}_i) = \Phi(\mathbf{Z}_i'\boldsymbol{\gamma})$$

Where $\mathbf{Z}_i$ includes:
- `education`, `experience`, `age` (also in outcome equation)
- `children_lt6`, `children_6_18`, `husband_income` (exclusion restrictions)

### Step 2: Augmented OLS with IMR

From the Probit estimates $\hat{\boldsymbol{\gamma}}$, compute the **Inverse Mills Ratio** for each selected observation:

$$\hat{\lambda}_i = \frac{\phi(\mathbf{Z}_i'\hat{\boldsymbol{\gamma}})}{\Phi(\mathbf{Z}_i'\hat{\boldsymbol{\gamma}})}$$

Then estimate the outcome equation by OLS, augmented with $\hat{\lambda}_i$:

$$y_i = \mathbf{X}_i'\boldsymbol{\beta} + \hat{\theta} \cdot \hat{\lambda}_i + \eta_i \qquad \text{(selected sample only)}$$

Where $\hat{\theta} = \hat{\rho} \cdot \hat{\sigma}_\varepsilon$. If $\hat{\theta}$ is statistically significant, selection bias is present.

### Intuition for the IMR

The Inverse Mills Ratio $\lambda_i$ is the **hazard rate** of the standard normal distribution. It captures:

- **For observations with low selection probability**: $\lambda_i$ is large. These women are "unlikely" participants, so their wages contain more information about unobserved ability.
- **For observations with high selection probability**: $\lambda_i$ is small. These women would work regardless of unobserved factors, so little correction is needed.

Think of $\lambda_i$ as measuring the "surprise" of being selected.

In [None]:
# Step-by-step demonstration of the Heckman procedure
# (Before using PanelBox, let's see the mechanics)

# ---- STEP 1: Probit for selection equation ----
print('STEP 1: Probit Model for Labor Force Participation')
print('=' * 60)

# Selection equation variables (Z): broader set including exclusion restrictions
Z_vars = ['education', 'experience', 'age',
          'children_lt6', 'children_6_18', 'husband_income']

Z_raw = df[Z_vars].values
Z = sm.add_constant(Z_raw)  # Add intercept
selection = df['lfp'].values

# Fit Probit using statsmodels for illustration
probit_model = sm.Probit(selection, Z)
probit_result = probit_model.fit(disp=0)

# Display Probit results
probit_table = pd.DataFrame({
    'Variable': ['const'] + Z_vars,
    'Coefficient': probit_result.params,
    'Std. Error': probit_result.bse,
    'z-stat': probit_result.tvalues,
    'p-value': probit_result.pvalues,
})

display(probit_table.round(4))

print('\nExclusion restrictions (in selection but NOT in outcome):')
print('  - children_lt6:   strong negative effect on participation')
print('  - children_6_18:  moderate negative effect on participation')
print('  - husband_income: higher husband income reduces need to work')

In [None]:
# ---- STEP 2: Compute the Inverse Mills Ratio ----
print('STEP 2: Compute the Inverse Mills Ratio')
print('=' * 60)

# Linear prediction from Probit: Z * gamma_hat
linear_pred = Z @ probit_result.params

# Compute IMR = phi(Z*gamma) / Phi(Z*gamma)
phi_vals = stats.norm.pdf(linear_pred)  # Standard normal PDF
Phi_vals = stats.norm.cdf(linear_pred)  # Standard normal CDF

# Clip CDF to avoid division by zero
Phi_clipped = np.clip(Phi_vals, 1e-10, 1 - 1e-10)
imr_manual = phi_vals / Phi_clipped

# Compare: PanelBox compute_imr utility
imr_panelbox = compute_imr(linear_pred)

# Show IMR statistics for selected (working) women
selected_mask = df['lfp'] == 1
imr_selected = imr_manual[selected_mask]

print(f'IMR statistics (working women only):')
print(f'  Mean:   {imr_selected.mean():.4f}')
print(f'  Std:    {imr_selected.std():.4f}')
print(f'  Min:    {imr_selected.min():.4f}')
print(f'  Max:    {imr_selected.max():.4f}')
print(f'  High IMR (>2): {(imr_selected > 2).sum()} observations')
print(f'\nManual vs PanelBox IMR match: {np.allclose(imr_manual, imr_panelbox)}')

In [None]:
# ---- STEP 2 continued: Augmented OLS on selected sample ----
print('STEP 2 (continued): Augmented OLS with IMR')
print('=' * 60)

# Outcome equation variables (X): regressors that affect wages
X_vars = ['education', 'experience', 'experience_sq']

# Extract selected sample (working women only)
df_working = df[df['lfp'] == 1].copy()
y_working = df_working['wage'].values
X_working_raw = df_working[X_vars].values
X_working = sm.add_constant(X_working_raw)

# IMR for working women
lambda_working = imr_selected.reshape(-1, 1)

# Augmented design matrix: [X, lambda]
X_augmented = np.column_stack([X_working, lambda_working])

# OLS on augmented model
ols_augmented = sm.OLS(y_working, X_augmented).fit()

aug_table = pd.DataFrame({
    'Variable': ['const'] + X_vars + ['lambda (IMR)'],
    'Coefficient': ols_augmented.params,
    'Std. Error': ols_augmented.bse,
    't-stat': ols_augmented.tvalues,
    'p-value': ols_augmented.pvalues,
})

display(aug_table.round(4))

# Extract selection parameters
theta_hat = ols_augmented.params[-1]  # IMR coefficient = rho * sigma
residuals = ols_augmented.resid
sigma_hat = np.sqrt(np.mean(residuals**2))
rho_hat = np.clip(theta_hat / sigma_hat, -0.99, 0.99)

print(f'\nSelection parameters:')
print(f'  theta (rho*sigma) = {theta_hat:.4f}')
print(f'  sigma_hat         = {sigma_hat:.4f}')
print(f'  rho_hat           = {rho_hat:.4f}')
print(f'\nThe IMR coefficient is the key test for selection bias.')
print(f'If lambda is significant, selection bias is present.')

---

## 6. Estimation with PanelBox

Now let us use PanelBox's `PanelHeckman` class to estimate the model in a single, clean call. This handles all the steps internally: Probit estimation, IMR computation, and augmented OLS.

In [None]:
# Prepare data for PanelHeckman
# IMPORTANT: PanelHeckman requires ALL observations (selected and not selected)

# Outcome variable: wage (set to 0 for non-participants since PanelHeckman
# needs a complete vector; these values are ignored in estimation)
y_all = df['wage'].fillna(0).values

# Selection indicator: labor force participation (0/1)
selection = df['lfp'].values.astype(float)

# Outcome equation regressors (X): variables that affect wages
X_outcome_raw = df[['education', 'experience', 'experience_sq']].values
X_outcome = sm.add_constant(X_outcome_raw)  # Add intercept

# Selection equation regressors (Z): broader set with exclusion restrictions
Z_selection_raw = df[['education', 'experience', 'age',
                       'children_lt6', 'children_6_18', 'husband_income']].values
Z_selection = sm.add_constant(Z_selection_raw)  # Add intercept

print('Data dimensions for PanelHeckman:')
print(f'  y (outcome):         {y_all.shape}')
print(f'  X (outcome regs):    {X_outcome.shape}  (const + {X_outcome.shape[1]-1} vars)')
print(f'  selection:           {selection.shape}  ({int(selection.sum())} selected)')
print(f'  Z (selection regs):  {Z_selection.shape}  (const + {Z_selection.shape[1]-1} vars)')
print(f'\nExclusion restrictions: children_lt6, children_6_18, husband_income')
print(f'  (in Z but NOT in X -- required for identification)')

In [None]:
# Estimate the Heckman two-step model
print('Estimating Heckman Two-Step Model...')
print('=' * 60)

heckman_model = PanelHeckman(
    endog=y_all,               # outcome variable (full sample)
    exog=X_outcome,            # outcome regressors with constant
    selection=selection,        # binary selection indicator
    exog_selection=Z_selection, # selection regressors with constant
    method='two_step'          # Heckman two-step procedure
)

heckman_result = heckman_model.fit()

# Display full summary
print(heckman_result.summary())

---

## 7. Interpreting the Results

### Outcome Equation Coefficients

In [None]:
# Detailed outcome equation results
outcome_var_names = ['const', 'education', 'experience', 'experience_sq']

outcome_table = pd.DataFrame({
    'Variable': outcome_var_names,
    'Coefficient': heckman_result.outcome_params,
})

print('Outcome Equation: wage = X * beta')
print('=' * 60)
display(outcome_table.round(4))

print('\nInterpretation:')
print(f'  Education:    Each additional year of education increases')
print(f'                hourly wage by ${outcome_table.iloc[1]["Coefficient"]:.2f}')
print(f'  Experience:   Returns to experience (with diminishing returns')
print(f'                captured by the squared term)')
print(f'\nThese are the CORRECTED estimates, accounting for selection bias.')

In [None]:
# Detailed selection equation results
selection_var_names = ['const', 'education', 'experience', 'age',
                       'children_lt6', 'children_6_18', 'husband_income']

selection_table = pd.DataFrame({
    'Variable': selection_var_names,
    'Coefficient': heckman_result.probit_params,
})

print('Selection Equation: P(lfp=1) = Phi(Z * gamma)')
print('=' * 60)
display(selection_table.round(4))

print('\nInterpretation of exclusion restrictions:')
print(f'  children_lt6:   Young children strongly reduce participation')
print(f'  children_6_18:  School-age children have a smaller effect')
print(f'  husband_income: Higher husband income reduces need to work')
print(f'\nThese variables affect SELECTION but not WAGES directly.')

In [None]:
# Selection parameters: rho and sigma
print('Selection Parameters')
print('=' * 60)
print(f'  sigma (error std dev):  {heckman_result.sigma:.4f}')
print(f'  rho (error correlation): {heckman_result.rho:.4f}')
print(f'  lambda = rho * sigma:    {heckman_result.rho * heckman_result.sigma:.4f}')

print(f'\nInterpretation of rho = {heckman_result.rho:.4f}:')
if heckman_result.rho > 0:
    print('  POSITIVE selection: unobserved factors that increase')
    print('  participation also increase wages.')
    print('  --> Women who choose to work have higher-than-average')
    print('      unobserved ability, so OLS on the working sample')
    print('      OVERSTATES average wage effects.')
elif heckman_result.rho < 0:
    print('  NEGATIVE selection: unobserved factors that increase')
    print('  participation decrease wages.')
    print('  --> Women who choose to work have lower-than-average')
    print('      unobserved wage potential (e.g., financial need).')
else:
    print('  No selection bias: participation and wages are independent.')

print(f'\n  sigma = {heckman_result.sigma:.4f} represents the standard deviation')
print(f'  of the outcome equation errors.')

---

## 8. The Inverse Mills Ratio: Visualization and Diagnostics

In [None]:
# Visualize the theoretical IMR function
# lambda(z) = phi(z) / Phi(z)
z_grid = np.linspace(-3, 3, 500)
imr_grid = stats.norm.pdf(z_grid) / stats.norm.cdf(z_grid)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: IMR as a function of z
axes[0].plot(z_grid, imr_grid, color='steelblue', linewidth=2.5)
axes[0].set_xlabel(r"$z = \mathbf{Z}'\hat{\gamma}$ (selection index)", fontsize=12)
axes[0].set_ylabel(r'$\lambda(z) = \phi(z) / \Phi(z)$', fontsize=12)
axes[0].set_title('Inverse Mills Ratio Function')
axes[0].axhline(y=0, color='gray', linestyle=':', alpha=0.5)
axes[0].grid(alpha=0.3)

# Annotate key regions
axes[0].annotate('Low selection prob\n(large correction)',
                 xy=(-2, stats.norm.pdf(-2)/stats.norm.cdf(-2)),
                 xytext=(-1.5, 6), fontsize=10,
                 arrowprops=dict(arrowstyle='->', color='red'),
                 color='red')
axes[0].annotate('High selection prob\n(small correction)',
                 xy=(2, stats.norm.pdf(2)/stats.norm.cdf(2)),
                 xytext=(0.5, 3), fontsize=10,
                 arrowprops=dict(arrowstyle='->', color='green'),
                 color='green')

# Right: Selection probability vs IMR
prob_grid = stats.norm.cdf(z_grid)
axes[1].plot(prob_grid, imr_grid, color='steelblue', linewidth=2.5)
axes[1].set_xlabel('Selection Probability P(s=1)', fontsize=12)
axes[1].set_ylabel(r'$\lambda$ (Inverse Mills Ratio)', fontsize=12)
axes[1].set_title('IMR vs Selection Probability')
axes[1].grid(alpha=0.3)
axes[1].set_xlim(0, 1)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'imr_theoretical.png', dpi=300, bbox_inches='tight')
plt.show()

print('Key insight: The IMR is a decreasing function of the selection probability.')
print('Observations with low P(selected) receive the largest corrections.')

*Figure: Left panel shows the Inverse Mills Ratio as a function of the selection index z. The IMR is large when z is negative (low selection probability), meaning observations with low participation likelihood receive the greatest correction. As z increases (high participation probability), the IMR approaches zero (minimal correction needed). Right panel shows the same relationship plotted against the selection probability directly, confirming the monotonically decreasing relationship.*

In [None]:
# Use PanelHeckman's built-in IMR diagnostics and plotting

# IMR diagnostics
diag = heckman_result.imr_diagnostics()

print('IMR Diagnostics')
print('=' * 50)
for key, value in diag.items():
    if isinstance(value, float):
        print(f'  {key:20s}: {value:.4f}')
    else:
        print(f'  {key:20s}: {value}')

print(f'\nInterpretation:')
print(f'  - Selection rate: {diag["selection_rate"]:.1%} of women participate')
if diag['high_imr_count'] > 0:
    print(f'  - {diag["high_imr_count"]} observations have high IMR (> 2),')
    print(f'    indicating strong selection effects for those individuals')
else:
    print(f'  - No observations have extremely high IMR values')
    print(f'    Selection correction is moderate across the sample')

In [None]:
# Use PanelHeckman's built-in plot_imr method
fig = heckman_result.plot_imr(figsize=(14, 5))
plt.savefig(FIGURES_DIR / 'imr_diagnostics.png', dpi=300, bbox_inches='tight')
plt.show()

*Figure: Two-panel IMR diagnostic plot from PanelHeckman. Left panel: scatter plot of IMR values against predicted selection probability for working women. The inverse relationship is clear -- women with lower selection probability receive larger corrections. The red dashed line at IMR = 2 marks the threshold for strong selection effects. Right panel: histogram of IMR values for the selected sample, showing the distribution of correction magnitudes.*

In [None]:
# Additional IMR visualization: how the correction varies with key variables
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Get IMR values for working women
imr_values = heckman_result.lambda_imr[selection == 1]
df_working_viz = df[df['lfp'] == 1].copy()
df_working_viz['imr'] = imr_values

# IMR vs Education
axes[0].scatter(df_working_viz['education'], df_working_viz['imr'],
                alpha=0.4, s=20, color='steelblue')
axes[0].set_xlabel('Years of Education')
axes[0].set_ylabel('Inverse Mills Ratio')
axes[0].set_title('IMR vs Education')
axes[0].grid(alpha=0.3)

# IMR vs Husband Income
axes[1].scatter(df_working_viz['husband_income'], df_working_viz['imr'],
                alpha=0.4, s=20, color='steelblue')
axes[1].set_xlabel('Husband Income ($1000s)')
axes[1].set_ylabel('Inverse Mills Ratio')
axes[1].set_title('IMR vs Husband Income')
axes[1].grid(alpha=0.3)

# IMR vs Wage
axes[2].scatter(df_working_viz['wage'], df_working_viz['imr'],
                alpha=0.4, s=20, color='steelblue')
axes[2].set_xlabel('Hourly Wage ($)')
axes[2].set_ylabel('Inverse Mills Ratio')
axes[2].set_title('IMR vs Observed Wage')
axes[2].grid(alpha=0.3)

plt.suptitle('Selection Correction Across Key Variables',
             fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'imr_by_variable.png', dpi=300, bbox_inches='tight')
plt.show()

print('The IMR captures how much selection correction each observation needs.')
print('Women with high husband income (who work despite not needing to)')
print('may have particularly high unobserved ability, reflected in higher IMR.')

*Figure: Three scatter plots showing how the Inverse Mills Ratio relates to education, husband income, and observed wages. The IMR tends to be higher for women whose observable characteristics make participation less likely (e.g., higher husband income), indicating that these women receive a larger selection correction. The relationship between IMR and wage helps visualize the selection mechanism.*

---

## 9. Testing for Selection Bias

In [None]:
# Test 1: Using PanelHeckman's built-in selection_test()
print('Test for Selection Bias')
print('=' * 60)
print('H0: rho = 0 (no selection bias, OLS is consistent)')
print('H1: rho != 0 (selection bias present, OLS is biased)')
print()

test_result = heckman_result.selection_test()

print(f'Results:')
print(f'  rho           = {test_result["rho"]:.4f}')
print(f'  z-statistic   = {test_result["z_statistic"]:.4f}')
print(f'  p-value       = {test_result["p_value"]:.4f}')
print(f'  Significant?  = {test_result["significant"]}')

if test_result['significant']:
    print(f'\nConclusion: REJECT H0 at 5% level.')
    print(f'Selection bias is statistically significant.')
    print(f'OLS on the working sample would produce biased estimates.')
    print(f'The Heckman correction is warranted.')
else:
    print(f'\nConclusion: FAIL TO REJECT H0 at 5% level.')
    print(f'No strong evidence of selection bias.')
    print(f'OLS and Heckman estimates should be similar.')

In [None]:
# Test 2: Using the selection_effect() method (more detailed)
print('Detailed Selection Effect Test')
print('=' * 60)

effect_result = heckman_result.selection_effect(alpha=0.05)

print(f'  Test statistic:  {effect_result["statistic"]:.4f}')
print(f'  p-value:         {effect_result["pvalue"]:.4f}')
print(f'  Reject H0:       {effect_result["reject"]}')
print(f'\n  {effect_result["interpretation"]}')

In [None]:
# Test 3: Direct test using IMR coefficient from augmented OLS
# This is the most transparent approach
print('Direct Test: Is the IMR Coefficient Significant?')
print('=' * 60)

# Use the test_selection_effect utility with the augmented OLS results
theta = ols_augmented.params[-1]      # IMR coefficient
theta_se = ols_augmented.bse[-1]      # IMR standard error

direct_test = test_selection_effect(
    imr_coefficient=theta,
    imr_se=theta_se,
    alpha=0.05
)

print(f'  IMR coefficient (theta = rho*sigma): {direct_test["imr_coefficient"]:.4f}')
print(f'  Standard error:                      {direct_test["imr_se"]:.4f}')
print(f'  t-statistic:                         {direct_test["statistic"]:.4f}')
print(f'  p-value:                             {direct_test["pvalue"]:.4f}')
print(f'  Reject H0 (alpha=0.05):              {direct_test["reject"]}')
print(f'\n  {direct_test["interpretation"]}')

---

## 10. OLS vs Heckman: Demonstrating the Bias

In [None]:
# Compare OLS (biased) vs Heckman (corrected) using built-in method
print('OLS vs Heckman Comparison')
print('=' * 60)

comparison = heckman_result.compare_ols_heckman()

comp_table = pd.DataFrame({
    'Variable': outcome_var_names,
    'OLS (biased)': comparison['beta_ols'],
    'Heckman (corrected)': comparison['beta_heckman'],
    'Difference': comparison['difference'],
    '% Difference': comparison['pct_difference'],
})

display(comp_table.round(4))

print(f'\nMaximum absolute difference: {comparison["max_abs_difference"]:.4f}')
print(f'\n{comparison["interpretation"]}')

In [None]:
# Visualization: side-by-side coefficient comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left panel: bar chart of coefficients
x_pos = np.arange(len(outcome_var_names))
width = 0.35

bars1 = axes[0].bar(x_pos - width/2, comparison['beta_ols'],
                     width, label='OLS (biased)', color='#D55E00',
                     alpha=0.8, edgecolor='black')
bars2 = axes[0].bar(x_pos + width/2, comparison['beta_heckman'],
                     width, label='Heckman (corrected)', color='steelblue',
                     alpha=0.8, edgecolor='black')

axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(outcome_var_names, rotation=15)
axes[0].set_ylabel('Coefficient Value')
axes[0].set_title('OLS vs Heckman: Coefficient Comparison')
axes[0].legend()
axes[0].grid(alpha=0.3, axis='y')
axes[0].axhline(y=0, color='black', linewidth=0.5)

# Right panel: percentage difference
# Exclude constant for cleaner visualization
pct_diff = comparison['pct_difference'][1:]  # Skip constant
var_names_no_const = outcome_var_names[1:]

colors = ['#D55E00' if d > 0 else 'steelblue' for d in pct_diff]
axes[1].barh(var_names_no_const, pct_diff, color=colors,
              alpha=0.8, edgecolor='black')
axes[1].axvline(x=0, color='black', linewidth=1)
axes[1].set_xlabel('Percentage Difference (OLS - Heckman) / Heckman x 100')
axes[1].set_title('Selection Bias: % Difference in Coefficients')
axes[1].grid(alpha=0.3, axis='x')

plt.suptitle('Quantifying Selection Bias: OLS vs Heckman',
             fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ols_vs_heckman.png', dpi=300, bbox_inches='tight')
plt.show()

print('Positive % difference: OLS overestimates relative to Heckman')
print('Negative % difference: OLS underestimates relative to Heckman')

*Figure: Left panel shows a side-by-side bar chart comparing OLS (orange) and Heckman (blue) coefficients for each variable in the outcome equation. Right panel shows the percentage difference between the two estimators for the slope coefficients (excluding the constant). Bars to the right of zero indicate OLS overestimation; bars to the left indicate underestimation. The magnitude of the bars quantifies the selection bias in each coefficient.*

In [None]:
# Predicted wage comparison: OLS vs Heckman
# Show how predictions differ for the working sample

# OLS predictions (on working sample)
y_pred_ols = X_working @ comparison['beta_ols']

# Heckman predictions (conditional on selection)
y_pred_heckman = heckman_result.predict(type='conditional')
y_pred_heckman_selected = y_pred_heckman[selection == 1]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: OLS predicted vs actual
axes[0].scatter(y_working, y_pred_ols, alpha=0.3, s=15, color='#D55E00')
axes[0].plot([0, y_working.max()], [0, y_working.max()],
             'k--', linewidth=1.5, alpha=0.7, label='45-degree line')
axes[0].set_xlabel('Actual Wage ($)')
axes[0].set_ylabel('Predicted Wage ($)')
axes[0].set_title('OLS Predictions (Biased)')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Right: Heckman predicted vs actual
axes[1].scatter(y_working, y_pred_heckman_selected, alpha=0.3, s=15,
                color='steelblue')
axes[1].plot([0, y_working.max()], [0, y_working.max()],
             'k--', linewidth=1.5, alpha=0.7, label='45-degree line')
axes[1].set_xlabel('Actual Wage ($)')
axes[1].set_ylabel('Predicted Wage ($)')
axes[1].set_title('Heckman Predictions (Corrected)')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.suptitle('Predicted vs Actual Wages', fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'predictions_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

# Fit statistics
rmse_ols = np.sqrt(np.mean((y_working - y_pred_ols)**2))
rmse_heckman = np.sqrt(np.mean((y_working - y_pred_heckman_selected)**2))
corr_ols = np.corrcoef(y_working, y_pred_ols)[0, 1]
corr_heckman = np.corrcoef(y_working, y_pred_heckman_selected)[0, 1]

print(f'Prediction Performance:')
print(f'  OLS:     RMSE = {rmse_ols:.3f}, Corr = {corr_ols:.3f}')
print(f'  Heckman: RMSE = {rmse_heckman:.3f}, Corr = {corr_heckman:.3f}')

*Figure: Left panel shows OLS predicted wages vs actual wages for the working sample. Right panel shows Heckman predicted wages vs actual wages. The 45-degree line represents perfect prediction. Differences between the two panels reflect the impact of the selection correction on predicted values.*

---

## 11. Summary and Key Takeaways

### What We Learned

1. **The Selection Problem**
   - When outcomes are observed only for a non-random subsample, OLS is biased
   - The bias arises because unobservables that drive selection are correlated with the outcome
   - This is a form of omitted variable bias

2. **Heckman's Two-Step Correction**
   - Step 1: Estimate a Probit model for the selection equation
   - Step 2: Compute the Inverse Mills Ratio and include it as an additional regressor in the outcome equation
   - The IMR absorbs the selection bias, yielding consistent estimates

3. **Key Parameters**
   - $\rho$ (rho): correlation between selection and outcome errors
     - $\rho > 0$: positive selection (participants have higher unobserved ability)
     - $\rho < 0$: negative selection (participants have lower unobserved ability)
     - $\rho = 0$: no selection bias
   - $\sigma$: standard deviation of outcome errors
   - $\lambda = \rho \sigma$: the IMR coefficient, directly testing for selection bias

4. **Exclusion Restrictions Are Essential**
   - At least one variable must affect selection but NOT the outcome
   - Without exclusion restrictions, the model is identified only through functional form (fragile)
   - In the Mroz data: `children_lt6`, `children_6_18`, `husband_income`

5. **Testing for Selection Bias**
   - Test $H_0: \rho = 0$ using the significance of the IMR coefficient
   - If we fail to reject, OLS is acceptable
   - If we reject, the Heckman correction is needed

### PanelBox Implementation Summary

```python
from panelbox.models.selection import PanelHeckman
import statsmodels.api as sm

# Prepare data (ALL observations, not just selected)
y = df['wage'].fillna(0).values
selection = df['lfp'].values
X = sm.add_constant(df[['education', 'experience', 'experience_sq']].values)
Z = sm.add_constant(df[['education', 'experience', 'age',
                         'children_lt6', 'children_6_18', 'husband_income']].values)

# Estimate
model = PanelHeckman(endog=y, exog=X, selection=selection,
                     exog_selection=Z, method='two_step')
result = model.fit()

# Interpret
print(result.summary())
print(f'rho = {result.rho:.4f}, sigma = {result.sigma:.4f}')

# Test for selection bias
test = result.selection_test()

# Compare with OLS
comparison = result.compare_ols_heckman()

# Diagnostics
diag = result.imr_diagnostics()
fig = result.plot_imr()
```

### Mathematical Summary

| Component | Formula |
|---|---|
| Selection | $s_i^* = \mathbf{Z}_i'\boldsymbol{\gamma} + u_i$, $s_i = \mathbf{1}[s_i^* > 0]$ |
| Outcome | $y_i = \mathbf{X}_i'\boldsymbol{\beta} + \varepsilon_i$ |
| IMR | $\lambda_i = \phi(\mathbf{Z}_i'\hat{\boldsymbol{\gamma}}) / \Phi(\mathbf{Z}_i'\hat{\boldsymbol{\gamma}})$ |
| Corrected | $y_i = \mathbf{X}_i'\boldsymbol{\beta} + \rho\sigma_\varepsilon\lambda_i + \eta_i$ |
| Bias test | $H_0: \rho = 0$ via t-test on $\hat{\theta} = \hat{\rho}\hat{\sigma}_\varepsilon$ |

### References

- Heckman, J.J. (1979). "Sample Selection Bias as a Specification Error." *Econometrica*, 47(1), 153-161.
- Mroz, T.A. (1987). "The Sensitivity of an Empirical Model of Married Women's Hours of Work to Economic and Statistical Assumptions." *Econometrica*, 55(4), 765-799.
- Wooldridge, J.M. (1995). "Selection Corrections for Panel Data Models Under Conditional Mean Independence Assumptions." *Journal of Econometrics*, 68(1), 115-132.
- Wooldridge, J.M. (2010). *Econometric Analysis of Cross Section and Panel Data* (2nd ed.). MIT Press, Chapter 19.

---

## 12. Exercises

Try these exercises to reinforce your understanding:

### Exercise 1: Alternative Exclusion Restrictions
Re-estimate the Heckman model using only `children_lt6` and `husband_income` as exclusion restrictions (drop `children_6_18` from Z). Compare the results with the full specification. Are the estimates sensitive to the choice of exclusion restrictions?

### Exercise 2: Log-Wage Specification
Estimate the Heckman model with `log(wage)` as the outcome variable instead of `wage` in levels. This is more standard in labor economics. Compare the education coefficient with the level specification. Which specification is more appropriate, and why?

### Exercise 3: Unconditional vs Conditional Predictions
Use `result.predict(type='unconditional')` and `result.predict(type='conditional')` to generate both types of predictions. Plot them against each other and explain the difference. For which women is the gap largest?

### Exercise 4: Monte Carlo Selection Bias
Generate synthetic data where you know the true parameters:
- Create a DGP with known rho = 0.5
- Estimate OLS (ignoring selection) and Heckman
- Repeat 500 times and show the bias distribution
- Verify that Heckman is unbiased while OLS is not

In [None]:
# ---- Exercise 1: Alternative Exclusion Restrictions ----
# Try re-estimating with a different set of exclusion restrictions

# Your code here:
# Z_alt_raw = df[['education', 'experience', 'age',
#                  'children_lt6', 'husband_income']].values
# Z_alt = sm.add_constant(Z_alt_raw)
# model_alt = PanelHeckman(endog=y_all, exog=X_outcome,
#                          selection=selection, exog_selection=Z_alt,
#                          method='two_step')
# result_alt = model_alt.fit()
# print(result_alt.summary())

In [None]:
# ---- Exercise 2: Log-Wage Specification ----
# Estimate with log(wage) as the dependent variable

# Your code here:
# df['log_wage'] = np.log(df['wage'])
# y_log = df['log_wage'].fillna(0).values
# model_log = PanelHeckman(endog=y_log, exog=X_outcome,
#                          selection=selection, exog_selection=Z_selection,
#                          method='two_step')
# result_log = model_log.fit()
# print(result_log.summary())

In [None]:
# ---- Exercise 3: Unconditional vs Conditional Predictions ----
# Compare the two types of predictions

# Your code here:
# y_uncond = heckman_result.predict(type='unconditional')
# y_cond = heckman_result.predict(type='conditional')
# plt.scatter(y_uncond, y_cond, alpha=0.3)
# plt.xlabel('Unconditional E[y*]')
# plt.ylabel('Conditional E[y|selected]')
# plt.title('Unconditional vs Conditional Predictions')
# plt.show()

In [None]:
# ---- Exercise 4: Monte Carlo Selection Bias ----
# Demonstrate selection bias with known DGP

# Your code here:
# Hint: Use bivariate normal errors with rho=0.5
# from scipy.stats import multivariate_normal
# n_sims = 500
# n = 1000
# beta_true = np.array([1.0, 0.5])
# gamma_true = np.array([0.0, 0.3])
# rho_true = 0.5
# ...
# Compare distributions of beta_hat_ols vs beta_hat_heckman

In [None]:
# Save all results for reference
results_summary = {
    'outcome_params': heckman_result.outcome_params.tolist(),
    'probit_params': heckman_result.probit_params.tolist(),
    'sigma': heckman_result.sigma,
    'rho': heckman_result.rho,
    'n_total': heckman_result.n_total,
    'n_selected': int(heckman_result.n_selected),
    'selection_test': heckman_result.selection_test(),
    'imr_diagnostics': heckman_result.imr_diagnostics(),
}

import json
with open(TABLES_DIR / 'heckman_results.json', 'w') as f:
    json.dump(results_summary, f, indent=2, default=str)

comp_table.to_csv(TABLES_DIR / 'ols_vs_heckman_comparison.csv', index=False)

print('Results saved to:')
print(f'  {TABLES_DIR / "heckman_results.json"}')
print(f'  {TABLES_DIR / "ols_vs_heckman_comparison.csv"}')
print('\nNotebook complete!')