# Heckman MLE Estimation

**Tutorial Series**: Censored and Selection Models with PanelBox

**Notebook**: 05 - Heckman Maximum Likelihood Estimation

**Author**: PanelBox Contributors

**Date**: 2026-02-17

**Estimated Duration**: 75-90 minutes

**Difficulty Level**: Intermediate

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand the full information maximum likelihood approach to the Heckman model
2. Derive and interpret the joint log-likelihood of selection and outcome equations
3. Estimate the Heckman model using both two-step and MLE methods in PanelBox
4. Compare parameter estimates, standard errors, and efficiency between the two methods
5. Understand when MLE is preferred over two-step estimation and vice versa
6. Generate and compare unconditional and conditional predictions
7. Diagnose convergence issues in MLE estimation

**Prerequisites**: Notebook 04 (Heckman two-step), familiarity with maximum likelihood concepts, basic understanding of sample selection bias.

---

## Table of Contents

1. [Two-Step vs MLE: Motivation](#section1)
2. [The Full Information MLE](#section2)
3. [Loading Data](#section3)
4. [Two-Step Estimation (Review)](#section4)
5. [MLE Estimation](#section5)
6. [Comparing Two-Step and MLE](#section6)
7. [When to Use Which Method](#section7)
8. [Prediction Comparison](#section8)
9. [Summary and Key Takeaways](#section9)
10. [Exercises](#exercises)

## Setup

Import all required libraries and configure the environment.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

from scipy import stats
import statsmodels.api as sm

from panelbox.models.selection import PanelHeckman

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

np.random.seed(42)

BASE_DIR = Path('..')
DATA_DIR = BASE_DIR / 'data'
OUTPUT_DIR = BASE_DIR / 'outputs'
FIGURES_DIR = OUTPUT_DIR / 'figures'
TABLES_DIR = OUTPUT_DIR / 'tables'
FIGURES_DIR.mkdir(parents=True, exist_ok=True)
TABLES_DIR.mkdir(parents=True, exist_ok=True)

print('Setup complete!')

<a id='section1'></a>
## 1. Two-Step vs MLE: Motivation

### 1.1 Recap of the Two-Step Estimator

In Notebook 04, we introduced the Heckman two-step estimator for correcting sample selection bias. The procedure is:

**Step 1** -- Estimate a probit model for the selection equation:

$$P(s_i = 1 | Z_i) = \Phi(Z_i' \gamma)$$

**Step 2** -- Compute the inverse Mills ratio (IMR) and include it in the outcome equation:

$$y_i = X_i' \beta + \rho \sigma \lambda(Z_i' \hat{\gamma}) + \eta_i$$

where $\lambda(\cdot) = \phi(\cdot) / \Phi(\cdot)$ is the inverse Mills ratio.

### 1.2 Limitations of Two-Step

While the two-step estimator is consistent and computationally simple, it has several drawbacks:

1. **Inefficiency**: The two-step estimator does not use all available information simultaneously. It treats the probit and OLS stages as separate problems, ignoring the cross-equation correlation structure.

2. **Standard error complications**: The OLS standard errors from Step 2 are incorrect because they ignore the sampling variability from Step 1 (the estimated IMR). Proper standard errors require the Murphy-Topel or bootstrap correction.

3. **Sensitivity to specification**: The two-step estimator can be sensitive to the functional form of the selection equation, particularly when identification relies heavily on the nonlinearity of the IMR rather than exclusion restrictions.

4. **No direct likelihood**: Without a likelihood function, standard model comparison tools (AIC, BIC, likelihood ratio tests) are unavailable.

### 1.3 Why MLE?

Full information maximum likelihood (FIML) estimation addresses these limitations:

- **Efficiency**: MLE is asymptotically efficient under correct model specification
- **Correct standard errors**: The information matrix provides proper standard errors automatically
- **Model comparison**: Log-likelihood enables AIC, BIC, and LR tests
- **Joint estimation**: All parameters ($\beta$, $\gamma$, $\sigma$, $\rho$) are estimated simultaneously

The trade-off is that MLE relies more heavily on the bivariate normality assumption and can encounter convergence difficulties.

<a id='section2'></a>
## 2. The Full Information MLE

### 2.1 The Joint Distribution

The Heckman model assumes that the outcome and selection errors are jointly normal:

$$\begin{pmatrix} \varepsilon_i \\ u_i \end{pmatrix} \sim N\left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} \sigma^2 & \rho\sigma \\ \rho\sigma & 1 \end{pmatrix} \right)$$

where:
- $\varepsilon_i$ is the error in the outcome equation: $y_i = X_i'\beta + \varepsilon_i$
- $u_i$ is the error in the selection equation: $s_i^* = Z_i'\gamma + u_i$, with $s_i = \mathbb{1}[s_i^* > 0]$
- $\sigma^2 = \text{Var}(\varepsilon_i)$ is the outcome error variance
- $\rho = \text{Corr}(\varepsilon_i, u_i)$ is the correlation between the errors

The variance of $u_i$ is normalized to 1 for identification (as in any probit model).

### 2.2 The Log-Likelihood Function

The joint density of observing $(y_i, s_i)$ given $(X_i, Z_i)$ can be decomposed as:

$$f(y_i, s_i | X_i, Z_i) = f(y_i | s_i = 1, X_i, Z_i)^{s_i} \cdot P(s_i = 0 | Z_i)^{1 - s_i}$$

This yields two types of log-likelihood contributions:

**For selected observations** ($s_i = 1$):

$$\ell_i^{\text{sel}} = \log\phi\left(\frac{y_i - X_i'\beta}{\sigma}\right) - \log\sigma + \log\Phi(z_i^*)$$

where:

$$z_i^* = \frac{Z_i'\gamma + \rho(y_i - X_i'\beta)/\sigma}{\sqrt{1 - \rho^2}}$$

This expression captures both the density of observing the outcome value $y_i$ (the first two terms) and the conditional probability of being selected given the outcome realization (the third term).

**For non-selected observations** ($s_i = 0$):

$$\ell_i^{\text{non}} = \log\Phi(-Z_i'\gamma)$$

This is simply the log-probability of not being selected.

### 2.3 The Full Log-Likelihood

The full log-likelihood is the sum over all observations:

$$\mathcal{L}(\beta, \gamma, \sigma, \rho) = \sum_{i: s_i=1} \ell_i^{\text{sel}} + \sum_{i: s_i=0} \ell_i^{\text{non}}$$

### 2.4 Parameter Transformations

To ensure the optimizer respects parameter constraints, PanelBox uses:

- $\sigma > 0$: Parameterize as $\sigma = \exp(\alpha_\sigma)$ where $\alpha_\sigma \in \mathbb{R}$
- $\rho \in (-1, 1)$: Parameterize as $\rho = \tanh(\alpha_\rho)$ where $\alpha_\rho \in \mathbb{R}$

These transformations convert a constrained optimization problem into an unconstrained one.

### 2.5 Initialization Strategy

MLE requires good starting values for reliable convergence. PanelBox uses a **warm start** strategy:

1. Run the two-step estimator to obtain initial $\hat{\beta}$, $\hat{\gamma}$, $\hat{\sigma}$, $\hat{\rho}$
2. Transform to unconstrained parameters: $\alpha_\sigma = \log(\hat{\sigma})$, $\alpha_\rho = \text{arctanh}(\hat{\rho})$
3. Use these as starting values for BFGS optimization

This approach significantly improves convergence reliability compared to starting from zeros or random values.

In [None]:
# Visualize the log-likelihood components for intuition

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Panel 1: Normal PDF contribution (selected observations)
z_vals = np.linspace(-4, 4, 200)
pdf_vals = stats.norm.pdf(z_vals)
log_pdf_vals = np.log(pdf_vals)

axes[0].plot(z_vals, log_pdf_vals, linewidth=2.5, color='#2980b9')
axes[0].set_xlabel('Standardized Residual $(y_i - X_i\'\\beta)/\\sigma$', fontsize=11)
axes[0].set_ylabel('$\\log\\phi(\\cdot)$', fontsize=11)
axes[0].set_title('Outcome Density Contribution', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].set_ylim([-10, 0])

# Panel 2: log(Phi(z*)) contribution (selected observations)
z_star = np.linspace(-4, 4, 200)
log_cdf_vals = np.log(stats.norm.cdf(z_star))

axes[1].plot(z_star, log_cdf_vals, linewidth=2.5, color='#27ae60')
axes[1].set_xlabel('Adjusted Selection Index $z_i^*$', fontsize=11)
axes[1].set_ylabel('$\\log\\Phi(z_i^*)$', fontsize=11)
axes[1].set_title('Selection Correction (Selected)', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim([-10, 0])

# Panel 3: log(Phi(-Zg)) contribution (non-selected observations)
zg = np.linspace(-4, 4, 200)
log_1_minus_cdf = np.log(stats.norm.cdf(-zg))

axes[2].plot(zg, log_1_minus_cdf, linewidth=2.5, color='#e74c3c')
axes[2].set_xlabel('Selection Index $Z_i\'\\gamma$', fontsize=11)
axes[2].set_ylabel('$\\log\\Phi(-Z_i\'\\gamma)$', fontsize=11)
axes[2].set_title('Non-Selection Contribution', fontsize=13, fontweight='bold')
axes[2].grid(True, alpha=0.3)
axes[2].set_ylim([-10, 0])

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'heckman_mle_likelihood_components.png', dpi=150, bbox_inches='tight')
plt.show()

print("The full log-likelihood combines these three components:")
print("  Selected obs:     log phi(residual) - log(sigma) + log Phi(z*)")
print("  Non-selected obs: log Phi(-Z'gamma)")

*Figure: The three components of the Heckman MLE log-likelihood. Left: the normal density contribution from observed outcomes. Center: the selection correction for selected observations, linking outcome and selection via the correlation parameter. Right: the probit contribution from non-selected observations.*

<a id='section3'></a>
## 3. Loading Data

We use the same Mroz (1987) dataset from Notebook 04. This classic dataset studies married women's labor force participation and wages.

- **Outcome**: `wage` (hourly wage, observed only for participants)
- **Selection**: `lfp` (labor force participation, 0/1)
- **Outcome regressors (X)**: constant, education, experience, experience squared
- **Selection regressors (Z)**: constant, education, experience, age, children under 6, children 6-18, husband's income

In [None]:
# Load the Mroz 1987 dataset
data = pd.read_csv(DATA_DIR / 'mroz_1987.csv')

print(f"Dataset shape: {data.shape}")
print(f"\nColumns: {list(data.columns)}")
print(f"\nParticipation rate: {data['lfp'].mean():.2%}")
print(f"Participants (lfp=1): {data['lfp'].sum()}")
print(f"Non-participants (lfp=0): {(1 - data['lfp']).sum():.0f}")
print(f"\nWage statistics (participants only):")
print(data.loc[data['lfp'] == 1, 'wage'].describe())

In [None]:
# Display summary statistics
print("=== Full Sample Summary Statistics ===")
display(data.describe().round(3))

In [None]:
# Prepare variables for estimation

# Selection indicator
selection = data['lfp'].values

# Outcome variable: wage (use 0 for non-participants)
wage = data['wage'].fillna(0).values

# Outcome equation regressors: const, education, experience, experience_sq
X = sm.add_constant(
    data[['education', 'experience', 'experience_sq']].values
)
x_names = ['const', 'education', 'experience', 'experience_sq']

# Selection equation regressors: const, education, experience, age,
#   children_lt6, children_6_18, husband_income
Z = sm.add_constant(
    data[['education', 'experience', 'age',
          'children_lt6', 'children_6_18', 'husband_income']].values
)
z_names = ['const', 'education', 'experience', 'age',
           'children_lt6', 'children_6_18', 'husband_income']

print(f"Outcome regressors (X): {x_names}")
print(f"Selection regressors (Z): {z_names}")
print(f"\nX shape: {X.shape}")
print(f"Z shape: {Z.shape}")
print(f"\nExclusion restrictions (in Z but not in X):")
print(f"  age, children_lt6, children_6_18, husband_income")
print(f"\nThese variables affect the participation decision but not wages directly.")

<a id='section4'></a>
## 4. Two-Step Estimation (Review)

Before turning to MLE, let us quickly re-estimate the two-step model as our baseline. This also provides the starting values that PanelBox will use internally for MLE optimization.

In [None]:
# Two-step estimation
model_2s = PanelHeckman(
    endog=wage,
    exog=X,
    selection=selection,
    exog_selection=Z,
    method='two_step'
)
result_2s = model_2s.fit()

print(result_2s.summary())

In [None]:
# Display two-step estimates in a structured table
print("=" * 65)
print("  TWO-STEP HECKMAN ESTIMATES")
print("=" * 65)

print("\nOutcome Equation (y = wage):")
print("-" * 45)
print(f"{'Variable':<20} {'Coefficient':>12}")
print("-" * 45)
for name, coef in zip(x_names, result_2s.outcome_params):
    print(f"{name:<20} {coef:>12.4f}")

print("\nSelection Equation (s = lfp):")
print("-" * 45)
print(f"{'Variable':<20} {'Coefficient':>12}")
print("-" * 45)
for name, coef in zip(z_names, result_2s.probit_params):
    print(f"{name:<20} {coef:>12.4f}")

print("\nSelection Parameters:")
print("-" * 45)
print(f"{'sigma':<20} {result_2s.sigma:>12.4f}")
print(f"{'rho':<20} {result_2s.rho:>12.4f}")
print(f"{'lambda (rho*sigma)':<20} {result_2s.rho * result_2s.sigma:>12.4f}")
print("=" * 65)

The two-step estimates give us a baseline. Note the estimated $\rho$ and $\sigma$ -- these characterize the selection mechanism. A non-zero $\rho$ indicates that the unobservables affecting wages are correlated with the unobservables affecting participation.

Now let us see whether MLE produces materially different estimates.

<a id='section5'></a>
## 5. MLE Estimation

### 5.1 Estimating with `method='mle'`

Switching to MLE in PanelBox requires only changing the `method` argument. Internally, PanelBox will:

1. Run two-step estimation to obtain initial parameter values (warm start)
2. Transform $\sigma$ and $\rho$ to unconstrained parameterizations
3. Maximize the joint log-likelihood using BFGS optimization
4. Transform parameters back and construct the results object

In [None]:
# MLE estimation
model_ml = PanelHeckman(
    endog=wage,
    exog=X,
    selection=selection,
    exog_selection=Z,
    method='mle'
)
result_ml = model_ml.fit()

print(result_ml.summary())

In [None]:
# Display MLE estimates in a structured table
print("=" * 65)
print("  MAXIMUM LIKELIHOOD HECKMAN ESTIMATES")
print("=" * 65)

print("\nOutcome Equation (y = wage):")
print("-" * 45)
print(f"{'Variable':<20} {'Coefficient':>12}")
print("-" * 45)
for name, coef in zip(x_names, result_ml.outcome_params):
    print(f"{name:<20} {coef:>12.4f}")

print("\nSelection Equation (s = lfp):")
print("-" * 45)
print(f"{'Variable':<20} {'Coefficient':>12}")
print("-" * 45)
for name, coef in zip(z_names, result_ml.probit_params):
    print(f"{name:<20} {coef:>12.4f}")

print("\nSelection Parameters:")
print("-" * 45)
print(f"{'sigma':<20} {result_ml.sigma:>12.4f}")
print(f"{'rho':<20} {result_ml.rho:>12.4f}")
print(f"{'lambda (rho*sigma)':<20} {result_ml.rho * result_ml.sigma:>12.4f}")

print("\nModel Diagnostics:")
print("-" * 45)
if result_ml.llf is not None:
    print(f"{'Log-likelihood':<20} {result_ml.llf:>12.4f}")
print(f"{'Converged':<20} {str(result_ml.converged):>12}")
print("=" * 65)

### 5.2 Understanding the MLE Output

The MLE results include several additional pieces of information compared to two-step:

- **Log-likelihood**: The maximized value of the log-likelihood function. This can be used for AIC, BIC, and likelihood ratio tests.
- **Converged**: Whether the BFGS optimizer found a local maximum. Non-convergence may indicate model misspecification, collinearity, or insufficient data.
- **Joint estimation**: All parameters are estimated simultaneously, so the correlation between $\beta$ and $\gamma$ is accounted for in the estimation.

In [None]:
# Check convergence and report diagnostics
print("=== MLE Convergence Diagnostics ===")
print(f"\nConverged: {result_ml.converged}")

if result_ml.llf is not None:
    n_params = len(result_ml.outcome_params) + len(result_ml.probit_params) + 2  # +2 for sigma, rho
    n_obs = result_ml.n_total
    
    aic = -2 * result_ml.llf + 2 * n_params
    bic = -2 * result_ml.llf + np.log(n_obs) * n_params
    
    print(f"\nLog-likelihood: {result_ml.llf:.4f}")
    print(f"Number of parameters: {n_params}")
    print(f"Number of observations: {n_obs}")
    print(f"AIC: {aic:.4f}")
    print(f"BIC: {bic:.4f}")

print(f"\nParameter bounds check:")
print(f"  sigma = {result_ml.sigma:.4f} (must be > 0: {'OK' if result_ml.sigma > 0 else 'PROBLEM'})")
print(f"  rho   = {result_ml.rho:.4f} (must be in (-1,1): {'OK' if -1 < result_ml.rho < 1 else 'PROBLEM'})")

<a id='section6'></a>
## 6. Comparing Two-Step and MLE

### 6.1 Side-by-Side Parameter Comparison

Now let us compare the estimates from both methods systematically.

In [None]:
# Use the comparison utility
import sys
sys.path.insert(0, str(BASE_DIR / 'utils'))
from comparison_tools import compare_heckman_methods

comparison = compare_heckman_methods(
    result_2s, result_ml,
    variable_names=x_names
)

print("=" * 70)
print("  TWO-STEP vs MLE: PARAMETER COMPARISON")
print("=" * 70)
print(comparison.to_string(float_format=lambda x: f"{x:.4f}"))
print("=" * 70)

In [None]:
# Build a more detailed comparison including selection equation
print("\n" + "=" * 75)
print("  COMPREHENSIVE COMPARISON: ALL PARAMETER ESTIMATES")
print("=" * 75)

# Outcome equation
print("\n--- Outcome Equation ---")
print(f"{'Variable':<18} {'Two-Step':>12} {'MLE':>12} {'Difference':>12} {'% Diff':>10}")
print("-" * 66)
for i, name in enumerate(x_names):
    ts_val = result_2s.outcome_params[i]
    ml_val = result_ml.outcome_params[i]
    diff = ts_val - ml_val
    pct = 100 * diff / (abs(ts_val) + 1e-10)
    print(f"{name:<18} {ts_val:>12.4f} {ml_val:>12.4f} {diff:>12.4f} {pct:>9.1f}%")

# Selection equation
print("\n--- Selection Equation ---")
print(f"{'Variable':<18} {'Two-Step':>12} {'MLE':>12} {'Difference':>12} {'% Diff':>10}")
print("-" * 66)
for i, name in enumerate(z_names):
    ts_val = result_2s.probit_params[i]
    ml_val = result_ml.probit_params[i]
    diff = ts_val - ml_val
    pct = 100 * diff / (abs(ts_val) + 1e-10) if abs(ts_val) > 1e-6 else 0
    print(f"{name:<18} {ts_val:>12.4f} {ml_val:>12.4f} {diff:>12.4f} {pct:>9.1f}%")

# Selection parameters
print("\n--- Selection Parameters ---")
print(f"{'Parameter':<18} {'Two-Step':>12} {'MLE':>12} {'Difference':>12}")
print("-" * 56)
print(f"{'sigma':<18} {result_2s.sigma:>12.4f} {result_ml.sigma:>12.4f} {result_2s.sigma - result_ml.sigma:>12.4f}")
print(f"{'rho':<18} {result_2s.rho:>12.4f} {result_ml.rho:>12.4f} {result_2s.rho - result_ml.rho:>12.4f}")

lambda_2s = result_2s.rho * result_2s.sigma
lambda_ml = result_ml.rho * result_ml.sigma
print(f"{'lambda (rho*sigma)':<18} {lambda_2s:>12.4f} {lambda_ml:>12.4f} {lambda_2s - lambda_ml:>12.4f}")

print("=" * 75)

In [None]:
# Visual comparison of outcome equation coefficients
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Panel 1: Outcome equation coefficients
x_pos = np.arange(len(x_names))
width = 0.35

bars1 = axes[0].bar(x_pos - width/2, result_2s.outcome_params, width,
                     label='Two-Step', color='#3498db', alpha=0.8, edgecolor='black')
bars2 = axes[0].bar(x_pos + width/2, result_ml.outcome_params, width,
                     label='MLE', color='#e74c3c', alpha=0.8, edgecolor='black')

axes[0].set_xlabel('Variable', fontsize=12)
axes[0].set_ylabel('Coefficient', fontsize=12)
axes[0].set_title('Outcome Equation: Two-Step vs MLE', fontsize=13, fontweight='bold')
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(x_names, rotation=30, ha='right')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3, axis='y')
axes[0].axhline(y=0, color='black', linewidth=0.8)

# Panel 2: Selection parameters (rho, sigma, lambda)
sel_names = ['sigma', 'rho', 'lambda']
sel_2s = [result_2s.sigma, result_2s.rho, result_2s.rho * result_2s.sigma]
sel_ml = [result_ml.sigma, result_ml.rho, result_ml.rho * result_ml.sigma]

x_pos2 = np.arange(len(sel_names))
bars3 = axes[1].bar(x_pos2 - width/2, sel_2s, width,
                     label='Two-Step', color='#3498db', alpha=0.8, edgecolor='black')
bars4 = axes[1].bar(x_pos2 + width/2, sel_ml, width,
                     label='MLE', color='#e74c3c', alpha=0.8, edgecolor='black')

axes[1].set_xlabel('Parameter', fontsize=12)
axes[1].set_ylabel('Value', fontsize=12)
axes[1].set_title('Selection Parameters: Two-Step vs MLE', fontsize=13, fontweight='bold')
axes[1].set_xticks(x_pos2)
axes[1].set_xticklabels(sel_names)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3, axis='y')
axes[1].axhline(y=0, color='black', linewidth=0.8)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'heckman_twostep_vs_mle_coefficients.png', dpi=150, bbox_inches='tight')
plt.show()

*Figure: Side-by-side comparison of parameter estimates from two-step and MLE estimation. Left panel shows outcome equation coefficients. Right panel shows the selection parameters sigma, rho, and lambda. Differences between the two methods reflect efficiency gains from joint estimation.*

In [None]:
# Selection equation comparison
fig, ax = plt.subplots(figsize=(12, 6))

x_pos3 = np.arange(len(z_names))
width = 0.35

bars_sel_2s = ax.bar(x_pos3 - width/2, result_2s.probit_params, width,
                      label='Two-Step', color='#3498db', alpha=0.8, edgecolor='black')
bars_sel_ml = ax.bar(x_pos3 + width/2, result_ml.probit_params, width,
                      label='MLE', color='#e74c3c', alpha=0.8, edgecolor='black')

ax.set_xlabel('Variable', fontsize=12)
ax.set_ylabel('Coefficient', fontsize=12)
ax.set_title('Selection Equation (Probit): Two-Step vs MLE', fontsize=13, fontweight='bold')
ax.set_xticks(x_pos3)
ax.set_xticklabels(z_names, rotation=30, ha='right')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3, axis='y')
ax.axhline(y=0, color='black', linewidth=0.8)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'heckman_selection_eq_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

*Figure: Comparison of selection equation (probit) coefficients between two-step and MLE. In two-step estimation, these come from a standalone probit; in MLE, they are estimated jointly with the outcome equation, incorporating information about the outcome process.*

### 6.2 Interpreting the Differences

Several patterns typically emerge when comparing two-step and MLE:

1. **Outcome coefficients ($\beta$)**: Usually similar, with MLE sometimes showing small refinements. Large differences may signal sensitivity to distributional assumptions.

2. **Selection coefficients ($\gamma$)**: In two-step, these come from a standalone probit. In MLE, the selection equation is informed by the outcome data, so estimates may shift slightly.

3. **Correlation parameter ($\rho$)**: Can differ meaningfully between the two methods. MLE estimates $\rho$ more directly from the joint likelihood.

4. **Variance parameter ($\sigma$)**: MLE often produces a slightly different $\sigma$ because it accounts for the full error structure.

In [None]:
# Quantify the overall agreement
outcome_diff = result_2s.outcome_params - result_ml.outcome_params
selection_diff = result_2s.probit_params - result_ml.probit_params

print("=== Agreement Assessment ===")
print(f"\nOutcome equation:")
print(f"  Mean absolute difference: {np.mean(np.abs(outcome_diff)):.4f}")
print(f"  Max absolute difference:  {np.max(np.abs(outcome_diff)):.4f}")
print(f"  RMSD:                     {np.sqrt(np.mean(outcome_diff**2)):.4f}")

print(f"\nSelection equation:")
print(f"  Mean absolute difference: {np.mean(np.abs(selection_diff)):.4f}")
print(f"  Max absolute difference:  {np.max(np.abs(selection_diff)):.4f}")
print(f"  RMSD:                     {np.sqrt(np.mean(selection_diff**2)):.4f}")

print(f"\nSelection parameters:")
print(f"  sigma difference: {abs(result_2s.sigma - result_ml.sigma):.4f}")
print(f"  rho difference:   {abs(result_2s.rho - result_ml.rho):.4f}")

# Overall verdict
max_diff = max(np.max(np.abs(outcome_diff)), np.max(np.abs(selection_diff)))
if max_diff < 0.5:
    print(f"\nVerdict: Estimates are broadly consistent (max diff = {max_diff:.4f}).")
    print("This suggests the model is well-identified and results are robust.")
else:
    print(f"\nVerdict: Non-trivial differences detected (max diff = {max_diff:.4f}).")
    print("This may reflect sensitivity to distributional assumptions or identification.")

<a id='section7'></a>
## 7. When to Use Which Method

### 7.1 Summary of Trade-Offs

| Criterion | Two-Step | MLE |
|-----------|----------|-----|
| **Consistency** | Yes (under correct probit) | Yes (under bivariate normality) |
| **Efficiency** | Less efficient | Asymptotically efficient |
| **Standard errors** | Require correction (Murphy-Topel) | Correct from information matrix |
| **Distributional reliance** | Less sensitive | Fully relies on bivariate normality |
| **Computational cost** | Very fast (closed-form steps) | Iterative optimization (slower) |
| **Convergence** | Always converges | Can fail to converge |
| **Model comparison** | No likelihood available | AIC, BIC, LR tests |
| **Small samples** | More robust | May be unreliable |
| **Large samples** | Fine, but less efficient | Preferred (efficiency gains) |

### 7.2 Practical Recommendations

In [None]:
# Create a visual decision guide
fig, ax = plt.subplots(figsize=(12, 7))

categories = [
    'Efficiency',
    'Robustness to\nMisspecification',
    'Computational\nSpeed',
    'Correct\nStandard Errors',
    'Model\nComparison Tools',
    'Small Sample\nReliability',
    'Convergence\nGuarantee'
]

# Scores (subjective, for illustration; 1-5 scale)
twostep_scores = [2, 4, 5, 2, 1, 4, 5]
mle_scores =     [5, 2, 2, 5, 5, 2, 3]

x_pos = np.arange(len(categories))
width = 0.35

bars1 = ax.barh(x_pos - width/2, twostep_scores, width,
                label='Two-Step', color='#3498db', alpha=0.8, edgecolor='black')
bars2 = ax.barh(x_pos + width/2, mle_scores, width,
                label='MLE', color='#e74c3c', alpha=0.8, edgecolor='black')

ax.set_yticks(x_pos)
ax.set_yticklabels(categories, fontsize=11)
ax.set_xlabel('Score (1 = Low, 5 = High)', fontsize=12)
ax.set_title('Two-Step vs MLE: Method Comparison', fontsize=14, fontweight='bold')
ax.legend(fontsize=12, loc='lower right')
ax.set_xlim([0, 6])
ax.grid(True, alpha=0.3, axis='x')

# Add score labels
for bar, score in zip(bars1, twostep_scores):
    ax.text(bar.get_width() + 0.1, bar.get_y() + bar.get_height()/2,
            str(score), va='center', fontsize=10, fontweight='bold', color='#2c3e50')
for bar, score in zip(bars2, mle_scores):
    ax.text(bar.get_width() + 0.1, bar.get_y() + bar.get_height()/2,
            str(score), va='center', fontsize=10, fontweight='bold', color='#2c3e50')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'heckman_method_comparison_scores.png', dpi=150, bbox_inches='tight')
plt.show()

*Figure: Qualitative comparison of two-step and MLE estimation on seven criteria. Two-step excels in robustness, speed, and convergence reliability, while MLE dominates in efficiency, standard error accuracy, and model comparison capabilities.*

### 7.3 Decision Framework

**Use Two-Step when:**
- You are in the exploratory phase and want quick, reliable estimates
- The sample size is small (< 200 selected observations)
- You are uncertain about the bivariate normality assumption
- Convergence of MLE is problematic
- You primarily need consistent estimates and are less concerned about efficiency

**Use MLE when:**
- You have a moderate to large sample (> 500 observations)
- You need correct standard errors without bootstrap or Murphy-Topel corrections
- You want to compare models using AIC, BIC, or likelihood ratio tests
- You are confident in the bivariate normality assumption
- Maximum efficiency is important (e.g., for policy analysis)

**Best practice:** Run both methods and compare. If estimates are similar, report MLE for its efficiency advantages. If they differ substantially, investigate why -- this often signals model misspecification or sensitivity to distributional assumptions.

In [None]:
# Demonstrate the computational cost difference
import time

# Time two-step estimation
n_runs = 5
times_2s = []
for _ in range(n_runs):
    start = time.time()
    model_tmp = PanelHeckman(endog=wage, exog=X, selection=selection,
                             exog_selection=Z, method='two_step')
    _ = model_tmp.fit()
    times_2s.append(time.time() - start)

# Time MLE estimation
times_ml = []
for _ in range(n_runs):
    start = time.time()
    model_tmp = PanelHeckman(endog=wage, exog=X, selection=selection,
                             exog_selection=Z, method='mle')
    _ = model_tmp.fit()
    times_ml.append(time.time() - start)

print("=== Computational Cost Comparison ===")
print(f"\nTwo-Step (avg of {n_runs} runs): {np.mean(times_2s)*1000:.1f} ms")
print(f"MLE      (avg of {n_runs} runs): {np.mean(times_ml)*1000:.1f} ms")
print(f"\nMLE is approximately {np.mean(times_ml)/np.mean(times_2s):.1f}x slower than two-step.")
print(f"\nNote: MLE internally runs two-step first to obtain starting values,")
print(f"then runs iterative BFGS optimization on the joint log-likelihood.")

### 7.4 Sample Size Considerations

The efficiency advantage of MLE over two-step grows with sample size. In small samples, the two methods may produce very similar estimates, making the choice less consequential. In large samples, MLE provides tighter confidence intervals.

Let us illustrate this by examining how parameter estimates behave as we vary the effective sample size through subsampling.

In [None]:
# Monte Carlo: compare two-step and MLE across different sample fractions
fractions = [0.3, 0.5, 0.7, 1.0]
n_total = len(selection)

results_by_fraction = []
np.random.seed(42)

for frac in fractions:
    n_sub = int(frac * n_total)
    idx = np.random.choice(n_total, size=n_sub, replace=False)
    idx.sort()
    
    wage_sub = wage[idx]
    X_sub = X[idx]
    Z_sub = Z[idx]
    sel_sub = selection[idx]
    
    # Two-step
    m_2s = PanelHeckman(endog=wage_sub, exog=X_sub, selection=sel_sub,
                         exog_selection=Z_sub, method='two_step')
    r_2s = m_2s.fit()
    
    # MLE
    m_ml = PanelHeckman(endog=wage_sub, exog=X_sub, selection=sel_sub,
                         exog_selection=Z_sub, method='mle')
    r_ml = m_ml.fit()
    
    results_by_fraction.append({
        'fraction': frac,
        'n': n_sub,
        'n_selected': int(sel_sub.sum()),
        'beta_educ_2s': r_2s.outcome_params[1],
        'beta_educ_ml': r_ml.outcome_params[1],
        'rho_2s': r_2s.rho,
        'rho_ml': r_ml.rho,
        'sigma_2s': r_2s.sigma,
        'sigma_ml': r_ml.sigma,
        'converged_ml': r_ml.converged
    })

frac_df = pd.DataFrame(results_by_fraction)

print("=== Parameter Estimates by Sample Fraction ===")
print(f"\n{'Frac':>6} {'N':>6} {'N_sel':>6} | {'educ_2s':>9} {'educ_ml':>9} | {'rho_2s':>8} {'rho_ml':>8} | {'Conv':>5}")
print("-" * 75)
for _, row in frac_df.iterrows():
    print(f"{row['fraction']:>6.1f} {int(row['n']):>6} {int(row['n_selected']):>6} | "
          f"{row['beta_educ_2s']:>9.4f} {row['beta_educ_ml']:>9.4f} | "
          f"{row['rho_2s']:>8.4f} {row['rho_ml']:>8.4f} | "
          f"{str(row['converged_ml']):>5}")

<a id='section8'></a>
## 8. Prediction Comparison

### 8.1 Types of Predictions

The Heckman model supports two types of predictions:

1. **Unconditional prediction** $E[y_i^*] = X_i'\beta$: The expected value of the latent outcome variable, regardless of whether the individual is selected. This answers: "What would this person's wage be if everyone worked?"

2. **Conditional prediction** $E[y_i | s_i = 1] = X_i'\beta + \rho\sigma\lambda(Z_i'\gamma)$: The expected value of the observed outcome, conditional on being selected. This answers: "What do we expect this person's wage to be, given that they work?"

The difference between these two predictions is the **selection correction** term $\rho\sigma\lambda(Z_i'\gamma)$.

In [None]:
# Generate predictions from both methods
pred_2s_uncond = result_2s.predict(type='unconditional')
pred_2s_cond = result_2s.predict(type='conditional')

pred_ml_uncond = result_ml.predict(type='unconditional')
pred_ml_cond = result_ml.predict(type='conditional')

# Focus on selected sample for meaningful wage comparisons
selected_mask = selection == 1
actual_wages = wage[selected_mask]

print("=== Prediction Summary (Selected Sample Only) ===")
print(f"\n{'':>25} {'Two-Step':>12} {'MLE':>12} {'Actual':>12}")
print("-" * 63)
print(f"{'Unconditional mean':>25} {pred_2s_uncond[selected_mask].mean():>12.4f} "
      f"{pred_ml_uncond[selected_mask].mean():>12.4f} {actual_wages.mean():>12.4f}")
print(f"{'Conditional mean':>25} {pred_2s_cond[selected_mask].mean():>12.4f} "
      f"{pred_ml_cond[selected_mask].mean():>12.4f} {actual_wages.mean():>12.4f}")
print(f"{'Unconditional std':>25} {pred_2s_uncond[selected_mask].std():>12.4f} "
      f"{pred_ml_uncond[selected_mask].std():>12.4f} {actual_wages.std():>12.4f}")
print(f"{'Conditional std':>25} {pred_2s_cond[selected_mask].std():>12.4f} "
      f"{pred_ml_cond[selected_mask].std():>12.4f} {actual_wages.std():>12.4f}")

# Selection correction magnitude
correction_2s = pred_2s_cond[selected_mask] - pred_2s_uncond[selected_mask]
correction_ml = pred_ml_cond[selected_mask] - pred_ml_uncond[selected_mask]

print(f"\n{'':>25} {'Two-Step':>12} {'MLE':>12}")
print("-" * 51)
print(f"{'Selection correction mean':>25} {correction_2s.mean():>12.4f} {correction_ml.mean():>12.4f}")
print(f"{'Selection correction std':>25} {correction_2s.std():>12.4f} {correction_ml.std():>12.4f}")

In [None]:
# Plot 1: Unconditional vs Conditional predictions
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Panel 1: Two-Step predictions
axes[0].scatter(pred_2s_uncond[selected_mask], pred_2s_cond[selected_mask],
                alpha=0.4, s=20, color='#3498db', label='Selected obs')
lims = [min(pred_2s_uncond[selected_mask].min(), pred_2s_cond[selected_mask].min()) - 1,
        max(pred_2s_uncond[selected_mask].max(), pred_2s_cond[selected_mask].max()) + 1]
axes[0].plot(lims, lims, 'r--', linewidth=2, label='45-degree line')
axes[0].set_xlabel('Unconditional Prediction $E[y^*]$', fontsize=11)
axes[0].set_ylabel('Conditional Prediction $E[y|s=1]$', fontsize=11)
axes[0].set_title('Two-Step: Unconditional vs Conditional', fontsize=12, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Panel 2: MLE predictions
axes[1].scatter(pred_ml_uncond[selected_mask], pred_ml_cond[selected_mask],
                alpha=0.4, s=20, color='#e74c3c', label='Selected obs')
lims2 = [min(pred_ml_uncond[selected_mask].min(), pred_ml_cond[selected_mask].min()) - 1,
         max(pred_ml_uncond[selected_mask].max(), pred_ml_cond[selected_mask].max()) + 1]
axes[1].plot(lims2, lims2, 'r--', linewidth=2, label='45-degree line')
axes[1].set_xlabel('Unconditional Prediction $E[y^*]$', fontsize=11)
axes[1].set_ylabel('Conditional Prediction $E[y|s=1]$', fontsize=11)
axes[1].set_title('MLE: Unconditional vs Conditional', fontsize=12, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'heckman_unconditional_vs_conditional.png', dpi=150, bbox_inches='tight')
plt.show()

*Figure: Unconditional versus conditional predictions for selected observations. The vertical distance from each point to the 45-degree line represents the selection correction for that individual. Points above the line indicate positive selection (conditional wage exceeds unconditional), while points below indicate negative selection.*

In [None]:
# Plot 2: Cross-method prediction comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Panel 1: Two-Step vs MLE unconditional
axes[0].scatter(pred_2s_uncond[selected_mask], pred_ml_uncond[selected_mask],
                alpha=0.4, s=20, color='#8e44ad')
r_uncond = np.corrcoef(pred_2s_uncond[selected_mask], pred_ml_uncond[selected_mask])[0, 1]
combined_range = [
    min(pred_2s_uncond[selected_mask].min(), pred_ml_uncond[selected_mask].min()) - 1,
    max(pred_2s_uncond[selected_mask].max(), pred_ml_uncond[selected_mask].max()) + 1
]
axes[0].plot(combined_range, combined_range, 'r--', linewidth=2)
axes[0].set_xlabel('Two-Step Prediction', fontsize=11)
axes[0].set_ylabel('MLE Prediction', fontsize=11)
axes[0].set_title(f'Unconditional Predictions\n(correlation = {r_uncond:.4f})',
                  fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Panel 2: Two-Step vs MLE conditional
axes[1].scatter(pred_2s_cond[selected_mask], pred_ml_cond[selected_mask],
                alpha=0.4, s=20, color='#16a085')
r_cond = np.corrcoef(pred_2s_cond[selected_mask], pred_ml_cond[selected_mask])[0, 1]
combined_range2 = [
    min(pred_2s_cond[selected_mask].min(), pred_ml_cond[selected_mask].min()) - 1,
    max(pred_2s_cond[selected_mask].max(), pred_ml_cond[selected_mask].max()) + 1
]
axes[1].plot(combined_range2, combined_range2, 'r--', linewidth=2)
axes[1].set_xlabel('Two-Step Prediction', fontsize=11)
axes[1].set_ylabel('MLE Prediction', fontsize=11)
axes[1].set_title(f'Conditional Predictions\n(correlation = {r_cond:.4f})',
                  fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'heckman_cross_method_predictions.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Correlation between two-step and MLE:")
print(f"  Unconditional predictions: {r_uncond:.6f}")
print(f"  Conditional predictions:   {r_cond:.6f}")

*Figure: Cross-method prediction comparison. Left: unconditional predictions from two-step versus MLE. Right: conditional predictions. High correlation indicates that both methods produce similar predictions, reinforcing confidence in the results.*

In [None]:
# Plot 3: Prediction residuals against actual wages
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

resid_configs = [
    (pred_2s_uncond[selected_mask], 'Two-Step Unconditional', '#3498db'),
    (pred_ml_uncond[selected_mask], 'MLE Unconditional', '#e74c3c'),
    (pred_2s_cond[selected_mask], 'Two-Step Conditional', '#27ae60'),
    (pred_ml_cond[selected_mask], 'MLE Conditional', '#f39c12'),
]

for ax, (pred, label, color) in zip(axes.flat, resid_configs):
    residuals = actual_wages - pred
    rmse = np.sqrt(np.mean(residuals**2))
    mae = np.mean(np.abs(residuals))
    
    ax.scatter(pred, residuals, alpha=0.3, s=15, color=color)
    ax.axhline(y=0, color='black', linewidth=1.5, linestyle='--')
    ax.set_xlabel('Predicted Wage', fontsize=10)
    ax.set_ylabel('Residual', fontsize=10)
    ax.set_title(f'{label}\nRMSE={rmse:.2f}, MAE={mae:.2f}', fontsize=11, fontweight='bold')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'heckman_prediction_residuals.png', dpi=150, bbox_inches='tight')
plt.show()

*Figure: Prediction residuals (actual minus predicted wage) for all four prediction types. The conditional predictions typically have lower RMSE because they account for the selection correction. Comparing two-step and MLE residuals reveals whether joint estimation materially improves predictive accuracy.*

In [None]:
# Plot 4: Distribution of the selection correction
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of selection correction magnitudes
axes[0].hist(correction_2s, bins=40, alpha=0.6, color='#3498db',
             label='Two-Step', edgecolor='black', density=True)
axes[0].hist(correction_ml, bins=40, alpha=0.6, color='#e74c3c',
             label='MLE', edgecolor='black', density=True)
axes[0].axvline(x=0, color='black', linewidth=1.5, linestyle='--')
axes[0].set_xlabel('Selection Correction ($\\rho\\sigma\\lambda$)', fontsize=11)
axes[0].set_ylabel('Density', fontsize=11)
axes[0].set_title('Distribution of Selection Corrections', fontsize=12, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Scatter: Two-Step vs MLE corrections
axes[1].scatter(correction_2s, correction_ml, alpha=0.4, s=15, color='#8e44ad')
combined_corr_range = [
    min(correction_2s.min(), correction_ml.min()) - 0.5,
    max(correction_2s.max(), correction_ml.max()) + 0.5
]
axes[1].plot(combined_corr_range, combined_corr_range, 'r--', linewidth=2, label='45-degree line')
r_correction = np.corrcoef(correction_2s, correction_ml)[0, 1]
axes[1].set_xlabel('Two-Step Correction', fontsize=11)
axes[1].set_ylabel('MLE Correction', fontsize=11)
axes[1].set_title(f'Selection Correction Comparison\n(correlation = {r_correction:.4f})',
                  fontsize=12, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'heckman_selection_correction_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

*Figure: Comparison of selection correction terms between two-step and MLE. Left: overlapping histograms showing the distribution of corrections. Right: scatter plot comparing individual-level corrections across methods. The sign and magnitude of corrections reflect the estimated rho and sigma parameters.*

In [None]:
# Plot 5: Predicted wages by education level
# Create education-wage profile for both methods

educ_range = np.arange(5, 20, 0.5)
mean_exp = data['experience'].mean()
mean_exp_sq = data['experience_sq'].mean()

# Build prediction matrices for outcome equation
X_pred = np.column_stack([
    np.ones(len(educ_range)),
    educ_range,
    np.full(len(educ_range), mean_exp),
    np.full(len(educ_range), mean_exp_sq)
])

# Unconditional predictions
wage_pred_2s = X_pred @ result_2s.outcome_params
wage_pred_ml = X_pred @ result_ml.outcome_params

fig, ax = plt.subplots(figsize=(10, 6))

ax.plot(educ_range, wage_pred_2s, linewidth=2.5, color='#3498db',
        label='Two-Step', linestyle='-')
ax.plot(educ_range, wage_pred_ml, linewidth=2.5, color='#e74c3c',
        label='MLE', linestyle='--')

# Overlay actual data (selected sample)
ax.scatter(data.loc[selected_mask, 'education'],
           actual_wages, alpha=0.15, s=10, color='gray', label='Actual wages')

ax.set_xlabel('Years of Education', fontsize=12)
ax.set_ylabel('Predicted Wage (Unconditional)', fontsize=12)
ax.set_title('Wage-Education Profile: Two-Step vs MLE\n(at mean experience)',
             fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'heckman_wage_education_profile.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Return to education (unconditional):")
print(f"  Two-Step: {result_2s.outcome_params[1]:.4f} per year")
print(f"  MLE:      {result_ml.outcome_params[1]:.4f} per year")
print(f"  Difference: {abs(result_2s.outcome_params[1] - result_ml.outcome_params[1]):.4f}")

*Figure: Predicted wage-education profiles at mean experience levels. The two methods produce very similar profiles, with any differences reflecting the efficiency gains of MLE. Actual wages from the selected sample are shown as background scatter points.*

### 8.2 Prediction Accuracy Comparison

In [None]:
# Comprehensive prediction accuracy metrics
def prediction_metrics(actual, predicted, label):
    """Compute prediction accuracy metrics."""
    residuals = actual - predicted
    return {
        'Method': label,
        'RMSE': np.sqrt(np.mean(residuals**2)),
        'MAE': np.mean(np.abs(residuals)),
        'Mean Bias': np.mean(residuals),
        'Median Bias': np.median(residuals),
        'Correlation': np.corrcoef(actual, predicted)[0, 1]
    }

metrics_list = [
    prediction_metrics(actual_wages, pred_2s_uncond[selected_mask], 'Two-Step Uncond'),
    prediction_metrics(actual_wages, pred_2s_cond[selected_mask], 'Two-Step Cond'),
    prediction_metrics(actual_wages, pred_ml_uncond[selected_mask], 'MLE Uncond'),
    prediction_metrics(actual_wages, pred_ml_cond[selected_mask], 'MLE Cond'),
]

metrics_df = pd.DataFrame(metrics_list).set_index('Method')

print("=" * 75)
print("  PREDICTION ACCURACY COMPARISON")
print("=" * 75)
print(metrics_df.to_string(float_format=lambda x: f"{x:.4f}"))
print("=" * 75)

print("\nInterpretation:")
print("  - RMSE: Root mean squared error (lower is better)")
print("  - MAE: Mean absolute error (lower is better)")
print("  - Mean Bias: Average prediction error (closer to 0 is better)")
print("  - Conditional predictions include selection correction and should")
print("    better match observed wages for the selected sample.")

<a id='section9'></a>
## 9. Summary and Key Takeaways

### 9.1 What We Learned

1. **Full information MLE** estimates all parameters of the Heckman model simultaneously by maximizing the joint log-likelihood of the selection and outcome equations.

2. **The log-likelihood** has two components: selected observations contribute both an outcome density term and a conditional selection probability, while non-selected observations contribute only the probability of non-selection.

3. **MLE is asymptotically efficient** under correct model specification, producing tighter confidence intervals than two-step estimation. However, it relies more heavily on the bivariate normality assumption.

4. **PanelBox makes switching easy**: changing `method='two_step'` to `method='mle'` is all that is needed. Internally, MLE uses two-step estimates as starting values for reliable convergence.

5. **Parameter transformations** ($\sigma = e^{\alpha_\sigma}$, $\rho = \tanh(\alpha_\rho)$) ensure that the optimizer respects the constraints $\sigma > 0$ and $\rho \in (-1, 1)$.

6. **Comparing the two methods** is good practice. When estimates agree, we gain confidence in the results. When they disagree, we should investigate model specification and distributional assumptions.

7. **Predictions** come in two flavors: unconditional ($E[y^*]$) and conditional ($E[y|s=1]$). The difference is the selection correction $\rho\sigma\lambda$.

### 9.2 Key Formulas

| Component | Formula |
|-----------|--------|
| Selected contribution | $\log\phi\left(\frac{y_i - X'\beta}{\sigma}\right) - \log\sigma + \log\Phi(z^*)$ |
| $z^*$ | $\frac{Z'\gamma + \rho(y_i - X'\beta)/\sigma}{\sqrt{1-\rho^2}}$ |
| Non-selected contribution | $\log\Phi(-Z'\gamma)$ |
| Unconditional prediction | $E[y^*] = X'\beta$ |
| Conditional prediction | $E[y|s=1] = X'\beta + \rho\sigma\lambda(Z'\gamma)$ |

### 9.3 Practical Checklist

- [ ] Always start with two-step estimation as a baseline
- [ ] Check that MLE converges successfully
- [ ] Compare estimates across methods -- large discrepancies warrant investigation
- [ ] Verify that $\rho$ is within a plausible range
- [ ] Use conditional predictions when comparing to observed outcomes
- [ ] Report both methods in sensitivity analyses
- [ ] Use MLE-based AIC/BIC for model comparison when available

In [None]:
# Final summary table
print("=" * 70)
print("  FINAL RESULTS SUMMARY")
print("=" * 70)

summary_data = {
    'Metric': [
        'N (total)', 'N (selected)', 'Selection rate',
        'Return to education (beta_educ)',
        'Experience effect (beta_exp)',
        'sigma', 'rho', 'lambda (rho*sigma)',
        'Log-likelihood', 'Converged'
    ],
    'Two-Step': [
        result_2s.n_total, result_2s.n_selected,
        f"{result_2s.n_selected/result_2s.n_total:.2%}",
        f"{result_2s.outcome_params[1]:.4f}",
        f"{result_2s.outcome_params[2]:.4f}",
        f"{result_2s.sigma:.4f}",
        f"{result_2s.rho:.4f}",
        f"{result_2s.rho * result_2s.sigma:.4f}",
        'N/A',
        'N/A (always)'
    ],
    'MLE': [
        result_ml.n_total, result_ml.n_selected,
        f"{result_ml.n_selected/result_ml.n_total:.2%}",
        f"{result_ml.outcome_params[1]:.4f}",
        f"{result_ml.outcome_params[2]:.4f}",
        f"{result_ml.sigma:.4f}",
        f"{result_ml.rho:.4f}",
        f"{result_ml.rho * result_ml.sigma:.4f}",
        f"{result_ml.llf:.4f}" if result_ml.llf is not None else 'N/A',
        str(result_ml.converged)
    ]
}

summary_df = pd.DataFrame(summary_data).set_index('Metric')
print(summary_df.to_string())
print("=" * 70)

<a id='exercises'></a>
## 10. Exercises

Test your understanding with these exercises.

---

### Exercise 1: Log-Likelihood Computation (Easy)

**Task**: Manually compute the log-likelihood contribution for a single observation using the MLE estimates.

Pick the first selected observation in the dataset. Using the MLE estimates of $\beta$, $\gamma$, $\sigma$, and $\rho$, compute:

1. The standardized residual $(y_i - X_i'\beta)/\sigma$
2. The adjusted selection index $z_i^*$
3. The full log-likelihood contribution $\ell_i^{\text{sel}}$
4. For a non-selected observation, compute $\ell_i^{\text{non}}$

**Hint**: Use `stats.norm.pdf()` and `stats.norm.cdf()` from scipy.

In [None]:
# Exercise 1: Your solution here

# Step 1: Get MLE parameter estimates
beta_hat = result_ml.outcome_params
gamma_hat = result_ml.probit_params
sigma_hat = result_ml.sigma
rho_hat = result_ml.rho

# Step 2: Pick the first selected observation
first_selected_idx = np.where(selection == 1)[0][0]
y_i = wage[first_selected_idx]
X_i = X[first_selected_idx]
Z_i = Z[first_selected_idx]

# TODO: Compute the standardized residual
# residual = ...

# TODO: Compute z_i^*
# z_star = ...

# TODO: Compute the selected log-likelihood contribution
# ell_selected = ...

# Step 3: Pick the first non-selected observation
first_nonselected_idx = np.where(selection == 0)[0][0]
Z_j = Z[first_nonselected_idx]

# TODO: Compute the non-selected log-likelihood contribution
# ell_nonselected = ...

# Print results
# print(f"Selected observation: ell_i = {ell_selected:.6f}")
# print(f"Non-selected observation: ell_j = {ell_nonselected:.6f}")

---

### Exercise 2: Sensitivity to Exclusion Restrictions (Medium)

**Task**: Investigate how the choice of exclusion restrictions affects MLE estimates.

1. Estimate the model using only `husband_income` as the exclusion restriction (remove `children_lt6`, `children_6_18`, and `age` from Z, but keep them conceptually excluded from X).
2. Estimate the model using all four exclusion restrictions (the baseline specification).
3. Compare $\hat{\rho}$, $\hat{\sigma}$, and the outcome coefficients across specifications.
4. Discuss: How sensitive are the results to the choice of exclusion restrictions?

**Hint**: Build a new Z matrix with fewer columns.

In [None]:
# Exercise 2: Your solution here

# Specification 1: Minimal exclusion restriction (husband_income only)
# Z_minimal = sm.add_constant(
#     data[['education', 'experience', 'husband_income']].values
# )

# TODO: Estimate with minimal Z

# Specification 2: Full exclusion restrictions (baseline)
# Already estimated as result_ml

# TODO: Compare the results
# print("Comparison of exclusion restriction specifications:")

---

### Exercise 3: Convergence Diagnostics (Medium)

**Task**: Explore the sensitivity of MLE convergence to starting values.

1. Estimate the Heckman MLE model using the default warm start (from two-step).
2. Try perturbing the starting values by adding noise to the two-step estimates:
   - Small perturbation: multiply two-step params by `1 + 0.1 * np.random.randn()`
   - Large perturbation: multiply by `1 + 0.5 * np.random.randn()`
3. Do all three specifications converge to the same estimates?
4. Discuss: What does this tell us about the log-likelihood surface?

**Note**: You will need to work with the internal `_log_likelihood` method and `scipy.optimize.minimize` directly.

In [None]:
# Exercise 3: Your solution here

from scipy.optimize import minimize

# Step 1: Get the default starting values (from two-step)
k_outcome = X.shape[1]
k_selection = Z.shape[1]

init_params_default = np.concatenate([
    result_2s.outcome_params,
    result_2s.probit_params,
    [np.log(result_2s.sigma)],
    [np.arctanh(np.clip(result_2s.rho, -0.99, 0.99))]
])

# TODO: Optimize with default start
# result_default = minimize(model_ml._log_likelihood, init_params_default, method='BFGS')

# TODO: Add small perturbation and re-optimize
# np.random.seed(123)
# init_params_small = init_params_default * (1 + 0.1 * np.random.randn(len(init_params_default)))
# result_small = minimize(model_ml._log_likelihood, init_params_small, method='BFGS')

# TODO: Add large perturbation and re-optimize
# init_params_large = init_params_default * (1 + 0.5 * np.random.randn(len(init_params_default)))
# result_large = minimize(model_ml._log_likelihood, init_params_large, method='BFGS')

# TODO: Compare final parameter values
# print("Convergence comparison:")

---

### Exercise 4: Monte Carlo Comparison (Hard)

**Task**: Conduct a small Monte Carlo simulation to compare the finite-sample properties of two-step and MLE.

1. Generate synthetic data from a known Heckman model:
   - True parameters: $\beta = [1, 0.5]$, $\gamma = [0.3, 0.8, -0.5]$, $\sigma = 2$, $\rho = -0.5$
   - $n = 500$ observations
2. Run 100 replications, estimating both two-step and MLE on each.
3. Compare:
   - Bias: $E[\hat{\theta}] - \theta_0$
   - RMSE: $\sqrt{E[(\hat{\theta} - \theta_0)^2]}$
   - Coverage of 95% confidence intervals (for MLE, using log-likelihood SE)
4. Which estimator has lower RMSE? Is the MLE efficiency advantage apparent in this sample size?

**Hint**: Use `np.random.multivariate_normal` to generate correlated errors.

In [None]:
# Exercise 4: Your solution here

# True parameters
# beta_true = np.array([1.0, 0.5])
# gamma_true = np.array([0.3, 0.8, -0.5])
# sigma_true = 2.0
# rho_true = -0.5

# n_obs = 500
# n_reps = 100

# results_mc = []
# np.random.seed(42)

# for rep in range(n_reps):
#     # TODO: Generate X and Z
#     # TODO: Generate correlated errors
#     # TODO: Generate selection and outcome
#     # TODO: Estimate both methods
#     # TODO: Store estimates
#     pass

# TODO: Compute bias and RMSE
# TODO: Create comparison table

print("Monte Carlo exercise: implement the simulation above.")
print("This will reveal the finite-sample efficiency comparison.")

---

## References

### Essential Reading

1. **Heckman, J.J. (1979)**. "Sample Selection Bias as a Specification Error." *Econometrica*, 47(1), 153-161.

2. **Wooldridge, J.M. (2010)**. *Econometric Analysis of Cross Section and Panel Data* (2nd ed.). MIT Press. Chapter 19.

3. **Cameron, A.C. & Trivedi, P.K. (2005)**. *Microeconometrics: Methods and Applications*. Cambridge University Press. Chapter 16.

### Additional References

4. **Greene, W.H. (2018)**. *Econometric Analysis* (8th ed.). Pearson. Chapter 19 (Sample Selection).

5. **Mroz, T.A. (1987)**. "The Sensitivity of an Empirical Model of Married Women's Hours of Work to Economic and Statistical Assumptions." *Econometrica*, 55(4), 765-799.

6. **Nawata, K. (1994)**. "Estimation of Sample Selection Bias Models by the Maximum Likelihood Estimator and Heckman's Two-Step Estimator." *Economics Letters*, 45(1), 33-40.

7. **Puhani, P.A. (2000)**. "The Heckman Correction for Sample Selection and Its Critique." *Journal of Economic Surveys*, 14(1), 53-68.

---

**Next notebook**: 06 - Advanced Topics in Selection Models (panel data extensions, correlated random effects)

**Thank you for completing this tutorial!**