# Complete Case Study: Health Expenditure and Labor Supply
## Integrating Tobit and Heckman Models with PanelBox

### Learning Objectives

1. Apply a complete censored data modeling workflow to a real research question
2. Systematically compare OLS, Pooled Tobit, and Random Effects Tobit
3. Compute and interpret McDonald-Moffitt marginal effects decomposition
4. Detect and correct sample selection bias with the Heckman two-step estimator
5. Conduct robustness and sensitivity analyses across multiple specifications
6. Produce publication-quality tables and figures summarizing results

### Duration

~90-120 minutes (advanced level)

### Prerequisites

- Notebooks 01-07 of this tutorial series (Tobit fundamentals through Heckman selection)
- Familiarity with maximum likelihood estimation and panel data concepts
- Working knowledge of NumPy, Pandas, and Matplotlib

### Notebook Structure

| Part | Topic | Duration |
|------|-------|----------|
| I | Case Study Overview & Data Loading | 10 min |
| II | Descriptive Analysis | 10 min |
| III | Naive OLS Analysis | 10 min |
| IV | Pooled Tobit | 10 min |
| V | Random Effects Tobit | 10 min |
| VI | Model Comparison | 10 min |
| VII | Marginal Effects (McDonald-Moffitt) | 15 min |
| VIII | Heckman Selection Analysis | 15 min |
| IX | Sensitivity Analysis | 10 min |
| X | Results Summary & Policy Implications | 10 min |
| -- | Exercises | 15 min |

In [None]:
# ============================================================
# Setup
# ============================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

from scipy import stats
import statsmodels.api as sm

from panelbox.models.censored import PooledTobit, RandomEffectsTobit
from panelbox.models.selection import PanelHeckman
from panelbox.marginal_effects.censored_me import compute_tobit_ame, compute_tobit_mem

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

np.random.seed(42)

BASE_DIR = Path('..')
DATA_DIR = BASE_DIR / 'data'
OUTPUT_DIR = BASE_DIR / 'outputs'
FIGURES_DIR = OUTPUT_DIR / 'figures'
TABLES_DIR = OUTPUT_DIR / 'tables'
FIGURES_DIR.mkdir(parents=True, exist_ok=True)
TABLES_DIR.mkdir(parents=True, exist_ok=True)

import sys
sys.path.insert(0, str(BASE_DIR / 'utils'))
from comparison_tools import compare_tobit_ols

print('Setup complete!')

---

## Part I: Case Study Overview and Data Loading (10 min)

### Research Context

Health expenditure is a classic example of **censored data**: many individuals report zero
spending in a given period, either because they did not need medical care or because
they could not afford it. Simply discarding zeros or ignoring the censoring mechanism
biases regression estimates toward zero (**attenuation bias**).

This case study walks through the full analysis pipeline:

1. **OLS baseline** -- demonstrates the problem with ignoring censoring
2. **Pooled Tobit** -- corrects for censoring, but ignores individual heterogeneity
3. **Random Effects Tobit** -- accounts for both censoring *and* unobserved individual effects
4. **Marginal effects** -- translates Tobit coefficients into interpretable quantities
5. **Heckman selection** -- investigates whether *selecting into positive spending* itself introduces bias

### Research Questions

1. What are the key determinants of health expenditure?
2. How large is the attenuation bias from ignoring censoring?
3. Does accounting for individual heterogeneity change substantive conclusions?
4. Is there evidence of sample selection bias in observed spending?

In [None]:
# ============================================================
# Load the health expenditure panel
# ============================================================

df = pd.read_csv(DATA_DIR / 'health_expenditure_panel.csv')

print(f'Dataset shape: {df.shape}')
print(f'Individuals: {df["id"].nunique()}')
print(f'Time periods: {df["time"].nunique()} (t = {df["time"].min()} .. {df["time"].max()})')
print(f'\nColumns: {list(df.columns)}')
print(f'\nFirst 5 rows:')
df.head()

In [None]:
# ============================================================
# Quick variable overview
# ============================================================

variable_descriptions = {
    'id': 'Individual identifier',
    'time': 'Time period',
    'expenditure': 'Health expenditure (censored at 0)',
    'income': 'Household income (thousands)',
    'age': 'Age of individual (years)',
    'chronic': 'Number of chronic conditions',
    'insurance': 'Health insurance indicator (1 = insured)',
    'female': 'Female indicator (1 = female)',
    'bmi': 'Body mass index',
}

for var, desc in variable_descriptions.items():
    print(f'  {var:15s}  {desc}')

---

## Part II: Descriptive Analysis (10 min)

Before fitting any model, we need to understand the data thoroughly:

- What fraction of observations are censored at zero?
- How does spending vary across individuals and time?
- Are there clear patterns by insurance status or chronic conditions?

In [None]:
# ============================================================
# Table 01: Summary statistics
# ============================================================

summary_stats = df.describe().T
summary_stats['zeros'] = (df == 0).sum()
summary_stats['pct_zeros'] = (df == 0).mean() * 100
summary_stats['skewness'] = df.skew()

print('Table 01: Summary Statistics')
print('=' * 80)
display(summary_stats.round(3))

# Save
summary_stats.round(3).to_csv(TABLES_DIR / 'table01_summary_statistics.csv')

# Key censoring statistic
n_censored = (df['expenditure'] == 0).sum()
pct_censored = n_censored / len(df) * 100
print(f'\nCensored observations (expenditure = 0): {n_censored} ({pct_censored:.1f}%)')

In [None]:
# ============================================================
# Figure 01: Distribution of health expenditure
# ============================================================

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Panel A: Full histogram including zeros
axes[0].hist(df['expenditure'], bins=50, edgecolor='black', alpha=0.7, color='steelblue')
axes[0].axvline(x=0, color='red', linestyle='--', linewidth=1.5, label='Censoring point')
axes[0].set_xlabel('Health Expenditure')
axes[0].set_ylabel('Frequency')
axes[0].set_title('A. Full Distribution (with zero pile-up)')
axes[0].legend()

# Annotate the zero pile
axes[0].annotate(
    f'{pct_censored:.1f}% at zero',
    xy=(0, n_censored * 0.6), xytext=(df['expenditure'].max() * 0.4, n_censored * 0.8),
    arrowprops=dict(arrowstyle='->', color='red'),
    fontsize=11, color='red', fontweight='bold'
)

# Panel B: Positive expenditure only
positive = df.loc[df['expenditure'] > 0, 'expenditure']
axes[1].hist(positive, bins=40, edgecolor='black', alpha=0.7, color='darkorange')
axes[1].set_xlabel('Health Expenditure')
axes[1].set_ylabel('Frequency')
axes[1].set_title('B. Positive Expenditure Only')

# Panel C: Log of positive expenditure
axes[2].hist(np.log1p(positive), bins=40, edgecolor='black', alpha=0.7, color='seagreen')
axes[2].set_xlabel('log(1 + Expenditure)')
axes[2].set_ylabel('Frequency')
axes[2].set_title('C. Log-Transformed (positive only)')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'fig01_expenditure_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

print('*Figure 01: Distribution of health expenditure showing the characteristic '
      'zero pile-up (Panel A), the continuous positive mass (Panel B), '
      'and the approximately log-normal shape of positive expenditures (Panel C).*')

In [None]:
# ============================================================
# Figure 02: Panel structure -- spaghetti plot
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Select a random subset of individuals for readability
sample_ids = np.random.choice(df['id'].unique(), size=min(20, df['id'].nunique()), replace=False)
df_sample = df[df['id'].isin(sample_ids)]

# Panel A: Spaghetti plot of expenditure trajectories
for pid in sample_ids:
    person = df_sample[df_sample['id'] == pid]
    axes[0].plot(person['time'], person['expenditure'], marker='o', alpha=0.6, markersize=4)

axes[0].axhline(y=0, color='red', linestyle='--', linewidth=1, alpha=0.7)
axes[0].set_xlabel('Time Period')
axes[0].set_ylabel('Health Expenditure')
axes[0].set_title('A. Individual Expenditure Trajectories (n=20 sample)')

# Panel B: Fraction censored per period
censor_by_time = df.groupby('time').apply(lambda g: (g['expenditure'] == 0).mean())
axes[1].bar(censor_by_time.index, censor_by_time.values, color='salmon', edgecolor='black')
axes[1].set_xlabel('Time Period')
axes[1].set_ylabel('Fraction Censored')
axes[1].set_title('B. Censoring Rate by Period')
axes[1].set_ylim(0, 1)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'fig02_panel_structure.png', dpi=150, bbox_inches='tight')
plt.show()

print('*Figure 02: Panel structure of the health expenditure data. Panel A shows individual '
      'expenditure trajectories over time with substantial within-individual variation and '
      'frequent censoring at zero. Panel B displays the censoring rate by time period.*')

In [None]:
# ============================================================
# Censoring patterns by covariates
# ============================================================

print('Censoring rate by subgroup:')
print('-' * 50)

# By insurance status
for ins_val in [0, 1]:
    mask = df['insurance'] == ins_val
    rate = (df.loc[mask, 'expenditure'] == 0).mean()
    label = 'Insured' if ins_val == 1 else 'Uninsured'
    print(f'  {label:20s}: {rate:.1%}')

print()

# By gender
for fem_val in [0, 1]:
    mask = df['female'] == fem_val
    rate = (df.loc[mask, 'expenditure'] == 0).mean()
    label = 'Female' if fem_val == 1 else 'Male'
    print(f'  {label:20s}: {rate:.1%}')

print()

# By chronic condition count (binned)
for lo, hi, label in [(0, 0, '0 chronic'), (1, 2, '1-2 chronic'), (3, 99, '3+ chronic')]:
    mask = (df['chronic'] >= lo) & (df['chronic'] <= hi)
    if mask.sum() > 0:
        rate = (df.loc[mask, 'expenditure'] == 0).mean()
        print(f'  {label:20s}: {rate:.1%}  (n={mask.sum()})')

---

## Part III: Naive OLS Analysis (10 min)

We begin with ordinary least squares as a **baseline**. This intentionally ignores the
censoring at zero and will produce **attenuated** (biased toward zero) coefficient estimates.
The purpose is to establish what happens when censoring is neglected.

**Key insight**: OLS treats zero-expenditure observations as informative about the
relationship between covariates and spending, when in fact those observations are
constrained by the lower bound, not by the covariates.

In [None]:
# ============================================================
# Prepare model matrices
# ============================================================

# Variable names for our models
depvar = 'expenditure'
covariates = ['income', 'age', 'chronic', 'insurance', 'female', 'bmi']

y = df[depvar].values
X_raw = df[covariates].values

# Add constant for OLS and Tobit
X = sm.add_constant(X_raw)
var_names = ['const'] + covariates

groups = df['id'].values

print(f'Dependent variable: {depvar}')
print(f'Covariates: {covariates}')
print(f'Design matrix shape: {X.shape}')
print(f'Number of groups: {len(np.unique(groups))}')

In [None]:
# ============================================================
# OLS estimation
# ============================================================

ols_model = sm.OLS(y, X)
ols_result = ols_model.fit(cov_type='cluster', cov_kwds={'groups': groups})

print('OLS Results (cluster-robust standard errors)')
print('=' * 60)

# Custom summary table
ols_table = pd.DataFrame({
    'Variable': var_names,
    'Coefficient': ols_result.params,
    'Std. Error': ols_result.bse,
    't-stat': ols_result.tvalues,
    'p-value': ols_result.pvalues,
}).set_index('Variable')

display(ols_table.round(4))

print(f'\nR-squared:     {ols_result.rsquared:.4f}')
print(f'Observations:  {int(ols_result.nobs)}')

**Interpretation (OLS)**:

The OLS coefficients give us a rough sense of the determinants of expenditure,
but these estimates are biased downward because OLS treats zero observations as if
individuals genuinely chose to spend nothing, rather than recognizing that spending
is *censored* at the lower bound.

We will now compare these to Tobit estimates to quantify the attenuation bias.

---

## Part IV: Pooled Tobit (10 min)

The Pooled Tobit model explicitly accounts for left-censoring at zero. It models a
**latent** variable $y^*_{it} = X_{it}'\beta + \varepsilon_{it}$ and assumes we
observe $y_{it} = \max(0, y^*_{it})$.

This corrects the attenuation bias but still ignores the panel structure (treats
all observations as if they came from different individuals).

In [None]:
# ============================================================
# Pooled Tobit estimation
# ============================================================

tobit_pooled = PooledTobit(
    endog=y,
    exog=X,
    groups=groups,
    censoring_point=0.0,
)
tobit_pooled.fit()

print(tobit_pooled.summary())

In [None]:
# ============================================================
# Store Pooled Tobit results in a DataFrame
# ============================================================

n_beta = len(var_names)

tobit_pooled_table = pd.DataFrame({
    'Variable': var_names,
    'Coefficient': tobit_pooled.beta,
    'Std. Error': tobit_pooled.bse[:n_beta],
    't-stat': tobit_pooled.beta / tobit_pooled.bse[:n_beta],
    'p-value': 2 * (1 - stats.norm.cdf(np.abs(tobit_pooled.beta / tobit_pooled.bse[:n_beta]))),
}).set_index('Variable')

display(tobit_pooled_table.round(4))

print(f'\nsigma:          {tobit_pooled.sigma:.4f}')
print(f'Log-likelihood: {tobit_pooled.llf:.2f}')
print(f'Observations:   {tobit_pooled.n_obs}')

**Interpretation (Pooled Tobit)**:

Notice that the Tobit coefficients are **larger in absolute value** than OLS.
This is the classic pattern: OLS attenuates coefficients when censoring is present,
and the Tobit model corrects for this.

However, the Pooled Tobit still ignores individual-specific unobserved heterogeneity,
which can bias results if correlated with the regressors.

---

## Part V: Random Effects Tobit (10 min)

The Random Effects Tobit adds an individual-specific random effect $\alpha_i \sim N(0, \sigma^2_\alpha)$
to the latent equation:

$$y^*_{it} = X_{it}'\beta + \alpha_i + \varepsilon_{it}$$

The likelihood is integrated over the distribution of $\alpha_i$ using Gauss-Hermite
quadrature. This accounts for unobserved individual heterogeneity under the assumption
that $\alpha_i$ is uncorrelated with $X_{it}$.

In [None]:
# ============================================================
# Random Effects Tobit estimation
# ============================================================

tobit_re = RandomEffectsTobit(
    endog=y,
    exog=X,
    groups=groups,
    censoring_point=0.0,
    quadrature_points=12,
)
tobit_re.fit(method='BFGS', maxiter=2000, options={'disp': False})

print(tobit_re.summary())

In [None]:
# ============================================================
# Store RE Tobit results in a DataFrame
# ============================================================

re_tobit_table = pd.DataFrame({
    'Variable': var_names,
    'Coefficient': tobit_re.beta,
    'Std. Error': tobit_re.bse[:n_beta],
    't-stat': tobit_re.beta / tobit_re.bse[:n_beta],
    'p-value': 2 * (1 - stats.norm.cdf(np.abs(tobit_re.beta / tobit_re.bse[:n_beta]))),
}).set_index('Variable')

display(re_tobit_table.round(4))

print(f'\nsigma_eps:   {tobit_re.sigma_eps:.4f}')
print(f'sigma_alpha: {tobit_re.sigma_alpha:.4f}')
print(f'rho (ICC):   {tobit_re.sigma_alpha**2 / (tobit_re.sigma_alpha**2 + tobit_re.sigma_eps**2):.4f}')
print(f'Log-lik:     {tobit_re.llf:.2f}')

**Interpretation (RE Tobit)**:

The **intra-class correlation (ICC)** $\rho = \sigma^2_\alpha / (\sigma^2_\alpha + \sigma^2_\varepsilon)$
measures the fraction of total latent variance due to individual heterogeneity.
A large ICC indicates that unobserved individual characteristics matter, validating
the use of a panel model rather than the pooled specification.

---

## Part VI: Model Comparison (10 min)

We now compare the three models side by side to understand how each estimation
choice affects the substantive conclusions.

In [None]:
# ============================================================
# Table 02: Full model comparison
# ============================================================

comparison_rows = []

for i, var in enumerate(var_names):
    row = {
        'Variable': var,
        'OLS_Coef': ols_result.params[i],
        'OLS_SE': ols_result.bse[i],
        'PooledTobit_Coef': tobit_pooled.beta[i],
        'PooledTobit_SE': tobit_pooled.bse[i],
        'RE_Tobit_Coef': tobit_re.beta[i],
        'RE_Tobit_SE': tobit_re.bse[i],
    }
    comparison_rows.append(row)

comparison_df = pd.DataFrame(comparison_rows).set_index('Variable')

# Add model-level statistics
model_stats = pd.DataFrame({
    'OLS_Coef': [ols_result.rsquared, np.nan, np.nan, ols_result.nobs],
    'OLS_SE': [np.nan] * 4,
    'PooledTobit_Coef': [np.nan, tobit_pooled.sigma, tobit_pooled.llf, tobit_pooled.n_obs],
    'PooledTobit_SE': [np.nan] * 4,
    'RE_Tobit_Coef': [np.nan, tobit_re.sigma_eps, tobit_re.llf, tobit_re.n_obs],
    'RE_Tobit_SE': [np.nan] * 4,
}, index=['R2 / sigma_alpha', 'sigma', 'Log-Likelihood', 'N'])

# Add sigma_alpha for RE model
model_stats.loc['R2 / sigma_alpha', 'RE_Tobit_Coef'] = tobit_re.sigma_alpha

full_comparison = pd.concat([comparison_df, model_stats])

print('Table 02: Model Comparison -- OLS vs Pooled Tobit vs RE Tobit')
print('=' * 90)
display(full_comparison.round(4))

# Save
full_comparison.round(4).to_csv(TABLES_DIR / 'table02_model_comparison.csv')

In [None]:
# ============================================================
# Figure 03: Coefficient comparison forest plot
# ============================================================

# Exclude the constant for visual clarity
plot_vars = covariates
idx = np.arange(len(plot_vars))
width = 0.25

fig, ax = plt.subplots(figsize=(12, 7))

ols_coefs = [ols_result.params[var_names.index(v)] for v in plot_vars]
ols_ses = [1.96 * ols_result.bse[var_names.index(v)] for v in plot_vars]

pt_coefs = [tobit_pooled.beta[var_names.index(v)] for v in plot_vars]
pt_ses = [1.96 * tobit_pooled.bse[var_names.index(v)] for v in plot_vars]

re_coefs = [tobit_re.beta[var_names.index(v)] for v in plot_vars]
re_ses = [1.96 * tobit_re.bse[var_names.index(v)] for v in plot_vars]

ax.barh(idx + width, ols_coefs, width, xerr=ols_ses, label='OLS',
        color='steelblue', alpha=0.8, capsize=3)
ax.barh(idx, pt_coefs, width, xerr=pt_ses, label='Pooled Tobit',
        color='darkorange', alpha=0.8, capsize=3)
ax.barh(idx - width, re_coefs, width, xerr=re_ses, label='RE Tobit',
        color='seagreen', alpha=0.8, capsize=3)

ax.set_yticks(idx)
ax.set_yticklabels(plot_vars, fontsize=12)
ax.axvline(x=0, color='black', linestyle='-', linewidth=0.8)
ax.set_xlabel('Coefficient Estimate', fontsize=12)
ax.set_title('Coefficient Comparison: OLS vs Pooled Tobit vs RE Tobit', fontsize=14)
ax.legend(fontsize=11, loc='lower right')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'fig03_coefficient_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print('*Figure 03: Forest plot comparing coefficient estimates across OLS, Pooled Tobit, '
      'and Random Effects Tobit. Error bars represent 95% confidence intervals. '
      'Notice the systematic attenuation of OLS coefficients relative to the Tobit models.*')

In [None]:
# ============================================================
# Attenuation ratios: Tobit / OLS
# ============================================================

print('Attenuation Ratios (Tobit / OLS):')
print('-' * 50)
print(f'{"Variable":15s} {"Pooled Tobit/OLS":>18s} {"RE Tobit/OLS":>15s}')
print('-' * 50)

for var in covariates:
    i = var_names.index(var)
    ols_c = ols_result.params[i]
    pt_c = tobit_pooled.beta[i]
    re_c = tobit_re.beta[i]
    ratio_pt = pt_c / ols_c if abs(ols_c) > 1e-6 else np.nan
    ratio_re = re_c / ols_c if abs(ols_c) > 1e-6 else np.nan
    print(f'{var:15s} {ratio_pt:18.3f} {ratio_re:15.3f}')

print('\nRatios > 1 indicate OLS was attenuated (biased toward zero).')

---

## Part VII: Marginal Effects -- McDonald-Moffitt Decomposition (15 min)

In the Tobit model, the raw coefficients $\beta$ are the marginal effects on the
**latent** variable $y^*$. To understand the effect on the **observed** variable $y$,
we decompose the marginal effect into three components (McDonald and Moffitt, 1980):

1. **Unconditional**: $\frac{\partial E[y|X]}{\partial x_k} = \beta_k \cdot \Phi(z)$
   -- Effect on overall expected value, accounting for censoring probability

2. **Conditional**: $\frac{\partial E[y|y>0, X]}{\partial x_k} = \beta_k \cdot [1 - \lambda(z)(z + \lambda(z))]$
   -- Effect among those with positive expenditure

3. **Probability**: $\frac{\partial P(y>0|X)}{\partial x_k} = \frac{\beta_k}{\sigma} \cdot \phi(z)$
   -- Effect on the probability of having positive expenditure

where $z = (X'\beta - c)/\sigma$ and $\lambda(z) = \phi(z)/\Phi(z)$ is the inverse Mills ratio.

We compute both **Average Marginal Effects (AME)** and **Marginal Effects at Means (MEM)**.

In [None]:
# ============================================================
# Assign variable names to the model for cleaner output
# ============================================================

tobit_pooled.exog_names = var_names

# ============================================================
# Compute AME for all three types
# ============================================================

ame_unconditional = tobit_pooled.marginal_effects(at='overall', which='unconditional')
ame_conditional = tobit_pooled.marginal_effects(at='overall', which='conditional')
ame_probability = tobit_pooled.marginal_effects(at='overall', which='probability')

print('Average Marginal Effects -- Unconditional E[y|X]:')
display(ame_unconditional.summary().round(4))

print('\nAverage Marginal Effects -- Conditional E[y|y>0, X]:')
display(ame_conditional.summary().round(4))

print('\nAverage Marginal Effects -- Probability P(y>0|X):')
display(ame_probability.summary().round(4))

In [None]:
# ============================================================
# Compute MEM for comparison
# ============================================================

mem_unconditional = tobit_pooled.marginal_effects(at='mean', which='unconditional')
mem_conditional = tobit_pooled.marginal_effects(at='mean', which='conditional')
mem_probability = tobit_pooled.marginal_effects(at='mean', which='probability')

print('Marginal Effects at Means -- Unconditional E[y|X_bar]:')
display(mem_unconditional.summary().round(4))

In [None]:
# ============================================================
# Table 03: Combined marginal effects table
# ============================================================

me_table = pd.DataFrame({
    'beta (latent)': pd.Series({v: tobit_pooled.beta[var_names.index(v)] for v in covariates}),
    'AME unconditional': ame_unconditional.marginal_effects[covariates],
    'AME conditional': ame_conditional.marginal_effects[covariates],
    'AME probability': ame_probability.marginal_effects[covariates],
    'MEM unconditional': mem_unconditional.marginal_effects[covariates],
})

print('Table 03: Marginal Effects Decomposition (Pooled Tobit)')
print('=' * 90)
display(me_table.round(4))

# Save
me_table.round(4).to_csv(TABLES_DIR / 'table03_marginal_effects.csv')

In [None]:
# ============================================================
# Figure 04: Marginal effects comparison bar chart
# ============================================================

fig, ax = plt.subplots(figsize=(14, 7))

idx = np.arange(len(covariates))
width = 0.2

beta_vals = [tobit_pooled.beta[var_names.index(v)] for v in covariates]
ame_uncond_vals = [ame_unconditional.marginal_effects[v] for v in covariates]
ame_cond_vals = [ame_conditional.marginal_effects[v] for v in covariates]
ame_prob_vals = [ame_probability.marginal_effects[v] for v in covariates]

bars1 = ax.bar(idx - 1.5*width, beta_vals, width, label=r'$\beta$ (latent)', color='navy', alpha=0.8)
bars2 = ax.bar(idx - 0.5*width, ame_uncond_vals, width, label='AME (unconditional)', color='steelblue', alpha=0.8)
bars3 = ax.bar(idx + 0.5*width, ame_cond_vals, width, label='AME (conditional)', color='darkorange', alpha=0.8)
bars4 = ax.bar(idx + 1.5*width, ame_prob_vals, width, label='AME (probability)', color='seagreen', alpha=0.8)

ax.set_xticks(idx)
ax.set_xticklabels(covariates, fontsize=12, rotation=15)
ax.axhline(y=0, color='black', linewidth=0.8)
ax.set_ylabel('Effect Size', fontsize=12)
ax.set_title('McDonald-Moffitt Decomposition of Marginal Effects', fontsize=14)
ax.legend(fontsize=10, loc='best')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'fig04_marginal_effects.png', dpi=150, bbox_inches='tight')
plt.show()

print('*Figure 04: McDonald-Moffitt decomposition comparing raw Tobit coefficients (latent) '
      'with unconditional, conditional, and probability marginal effects. The unconditional '
      'AME is always smaller than the raw coefficient because it accounts for the probability '
      'of censoring.*')

In [None]:
# ============================================================
# Scaling factor decomposition
# ============================================================

print('Scaling Factors (AME / beta):')
print('-' * 60)
print(f'{"Variable":15s} {"Unconditional":>15s} {"Conditional":>15s} {"Probability":>15s}')
print('-' * 60)

for var in covariates:
    i = var_names.index(var)
    b = tobit_pooled.beta[i]
    if abs(b) > 1e-8:
        s_unc = ame_unconditional.marginal_effects[var] / b
        s_con = ame_conditional.marginal_effects[var] / b
        s_prb = ame_probability.marginal_effects[var] / b
        print(f'{var:15s} {s_unc:15.4f} {s_con:15.4f} {s_prb:15.4f}')

print('\nNote: The unconditional scaling factor equals Phi(z_bar), the average '
      'probability of being uncensored.')

---

## Part VIII: Heckman Selection Analysis (15 min)

The Tobit model assumes that the same process determines **whether** spending occurs
and **how much** is spent. The Heckman selection model relaxes this by specifying
**two separate equations**:

1. **Selection equation**: $s_{it} = \mathbf{1}[Z_{it}'\gamma + u_{it} > 0]$
   -- Determines participation (e.g., labor force participation)

2. **Outcome equation**: $y_{it} = X_{it}'\beta + \varepsilon_{it}$ if $s_{it} = 1$
   -- Determines the level (e.g., wages, observed only for workers)

If $\text{Corr}(u, \varepsilon) = \rho \neq 0$, OLS on the selected sample is biased.

We use the **Mroz (1987)** dataset -- a classic sample of married women's labor supply -- to
demonstrate this selection correction workflow.

In [None]:
# ============================================================
# Load Mroz (1987) data for Heckman analysis
# ============================================================

mroz = pd.read_csv(DATA_DIR / 'mroz_1987.csv')

print(f'Mroz dataset shape: {mroz.shape}')
print(f'\nVariables: {list(mroz.columns)}')
print(f'\nLabor force participation rate: {mroz["lfp"].mean():.1%}')
print(f'Observations with wages:         {mroz["lfp"].sum()} / {len(mroz)}')

mroz.head()

In [None]:
# ============================================================
# Prepare Heckman model matrices
# ============================================================

# Selection indicator
selection = mroz['lfp'].values.astype(int)

# Outcome variable: log wage (observed only for workers)
# Fill missing wages with 0 (PanelHeckman uses the selection indicator to handle this)
wage = mroz['wage'].fillna(0).values

# Outcome equation regressors: education, experience, experience^2
outcome_vars = ['education', 'experience', 'experience_sq']
X_outcome = sm.add_constant(mroz[outcome_vars].values)
outcome_names = ['const'] + outcome_vars

# Selection equation regressors: same as outcome + exclusion restrictions
# Exclusion restrictions: children_lt6, children_6_18, husband_income
# These affect LFP but not wages directly (conditional on working)
selection_vars = ['education', 'experience', 'experience_sq', 'age',
                  'children_lt6', 'children_6_18', 'husband_income']
Z_selection = sm.add_constant(mroz[selection_vars].values)
selection_names = ['const'] + selection_vars

print(f'Outcome equation variables:   {outcome_names}')
print(f'Selection equation variables:  {selection_names}')
print(f'Exclusion restrictions:        age, children_lt6, children_6_18, husband_income')

In [None]:
# ============================================================
# Heckman two-step estimation
# ============================================================

heckman_model = PanelHeckman(
    endog=wage,
    exog=X_outcome,
    selection=selection,
    exog_selection=Z_selection,
    method='two_step',
)

heckman_result = heckman_model.fit()

print(heckman_result.summary())

In [None]:
# ============================================================
# Selection bias test
# ============================================================

sel_test = heckman_result.selection_test()

print('Test for Selection Bias (H0: rho = 0):')
print('-' * 50)
for key, val in sel_test.items():
    if isinstance(val, float):
        print(f'  {key:20s}: {val:.4f}')
    else:
        print(f'  {key:20s}: {val}')

In [None]:
# ============================================================
# Compare OLS on selected sample vs Heckman-corrected estimates
# ============================================================

comparison = heckman_result.compare_ols_heckman()

heckman_comparison_df = pd.DataFrame({
    'Variable': outcome_names,
    'OLS (selected)': comparison['beta_ols'],
    'Heckman': comparison['beta_heckman'],
    'Difference': comparison['difference'],
    'Pct Diff (%)': comparison['pct_difference'],
}).set_index('Variable')

print('OLS (on selected sample) vs Heckman Two-Step:')
print('=' * 70)
display(heckman_comparison_df.round(4))

print(f'\n{comparison["interpretation"]}')
print(f'\nrho (selection correlation): {heckman_result.rho:.4f}')
print(f'sigma:                       {heckman_result.sigma:.4f}')

In [None]:
# ============================================================
# Figure 05: IMR diagnostics for Heckman model
# ============================================================

fig = heckman_result.plot_imr(figsize=(14, 5))

plt.savefig(FIGURES_DIR / 'fig05_imr_diagnostics.png', dpi=150, bbox_inches='tight')
plt.show()

print('*Figure 05: Inverse Mills Ratio (IMR) diagnostics for the Heckman two-step '
      'estimator. The left panel shows IMR versus the predicted selection probability, '
      'while the right panel shows the distribution of IMR values among the selected '
      'sample. Observations with very high IMR (above the red threshold) experience '
      'strong selection effects.*')

In [None]:
# ============================================================
# IMR summary diagnostics
# ============================================================

imr_diag = heckman_result.imr_diagnostics()

print('IMR Diagnostics:')
print('-' * 40)
for key, val in imr_diag.items():
    if isinstance(val, float):
        print(f'  {key:25s}: {val:.4f}')
    else:
        print(f'  {key:25s}: {val}')

---

## Part IX: Sensitivity Analysis (10 min)

Robust conclusions require checking that our results are not sensitive to particular
modeling choices. We examine:

1. **Subsample stability**: Do results change when restricting to particular groups?
2. **Covariate sensitivity**: Are key results robust to adding or dropping variables?
3. **Heckman vs Tobit framing**: Does the Heckman selection model on the health data
   yield different conclusions from the Tobit model?

In [None]:
# ============================================================
# Sensitivity 1: Pooled Tobit on subsamples
# ============================================================

subsamples = {
    'Full sample': df,
    'Males only': df[df['female'] == 0],
    'Females only': df[df['female'] == 1],
    'Insured only': df[df['insurance'] == 1],
    'Uninsured only': df[df['insurance'] == 0],
}

subsample_results = {}

for label, sub_df in subsamples.items():
    y_sub = sub_df[depvar].values
    X_sub = sm.add_constant(sub_df[covariates].values)
    g_sub = sub_df['id'].values
    
    try:
        model_sub = PooledTobit(endog=y_sub, exog=X_sub, groups=g_sub, censoring_point=0.0)
        model_sub.fit()
        subsample_results[label] = {
            'n': len(y_sub),
            'pct_censored': (y_sub == 0).mean() * 100,
            'beta': model_sub.beta.copy(),
            'sigma': model_sub.sigma,
            'llf': model_sub.llf,
        }
    except Exception as e:
        subsample_results[label] = {'error': str(e)}

# Display results
print('Subsample Sensitivity Analysis (Pooled Tobit):')
print('=' * 100)

header = f'{"Subsample":20s} {"N":>6s} {"% Cens":>7s}'
for var in covariates:
    header += f' {var:>10s}'
header += f' {"sigma":>8s}'
print(header)
print('-' * 100)

for label, res in subsample_results.items():
    if 'error' in res:
        print(f'{label:20s}  ERROR: {res["error"]}')
    else:
        line = f'{label:20s} {res["n"]:>6d} {res["pct_censored"]:>6.1f}%'
        for i, var in enumerate(covariates):
            # beta index is i+1 because of the constant at index 0
            line += f' {res["beta"][i+1]:>10.4f}'
        line += f' {res["sigma"]:>8.4f}'
        print(line)

In [None]:
# ============================================================
# Sensitivity 2: Alternative covariate specifications
# ============================================================

specifications = [
    ('Baseline', covariates),
    ('Parsimonious', ['income', 'chronic', 'insurance']),
    ('Demographics only', ['income', 'age', 'female', 'bmi']),
    ('With interaction', covariates),  # We will manually add an interaction
]

spec_results = {}

for label, vars_spec in specifications:
    if label == 'With interaction':
        # Add income * insurance interaction
        X_spec_raw = df[vars_spec].values.copy()
        interaction = (df['income'] * df['insurance']).values.reshape(-1, 1)
        X_spec = sm.add_constant(np.column_stack([X_spec_raw, interaction]))
        spec_varnames = ['const'] + vars_spec + ['income_x_insurance']
    else:
        X_spec = sm.add_constant(df[vars_spec].values)
        spec_varnames = ['const'] + vars_spec
    
    try:
        model_spec = PooledTobit(endog=y, exog=X_spec, groups=groups, censoring_point=0.0)
        model_spec.fit()
        spec_results[label] = {
            'vars': spec_varnames,
            'beta': model_spec.beta,
            'se': model_spec.bse[:len(spec_varnames)],
            'sigma': model_spec.sigma,
            'llf': model_spec.llf,
            'n_params': len(spec_varnames),
        }
    except Exception as e:
        spec_results[label] = {'error': str(e)}

# Display specification comparison
print('Specification Sensitivity Analysis (Pooled Tobit):')
print('=' * 80)

for label, res in spec_results.items():
    if 'error' in res:
        print(f'\n{label}: ERROR - {res["error"]}')
    else:
        print(f'\n{label} (Log-Lik: {res["llf"]:.2f}, sigma: {res["sigma"]:.4f})')
        print('-' * 60)
        for i, var in enumerate(res['vars']):
            t_stat = res['beta'][i] / res['se'][i] if res['se'][i] > 0 else np.nan
            sig = '***' if abs(t_stat) > 3.29 else '**' if abs(t_stat) > 2.58 else '*' if abs(t_stat) > 1.96 else ''
            print(f'  {var:22s} {res["beta"][i]:>10.4f} ({res["se"][i]:.4f}) {sig}')

In [None]:
# ============================================================
# Sensitivity 3: Heckman framing on health expenditure data
# ============================================================

# Treat the health data as a selection problem:
# Selection equation: whether expenditure > 0
# Outcome equation: expenditure level given expenditure > 0

health_sel = (df['expenditure'] > 0).astype(int).values
health_y = df['expenditure'].values

# Outcome equation: income, chronic, insurance
health_X_outcome = sm.add_constant(df[['income', 'chronic', 'insurance']].values)

# Selection equation: add age, female, bmi as exclusion restrictions
health_Z_sel = sm.add_constant(
    df[['income', 'chronic', 'insurance', 'age', 'female', 'bmi']].values
)

try:
    heckman_health = PanelHeckman(
        endog=health_y,
        exog=health_X_outcome,
        selection=health_sel,
        exog_selection=health_Z_sel,
        method='two_step',
    )
    heckman_health_result = heckman_health.fit()

    print('Heckman Two-Step on Health Expenditure Data:')
    print(heckman_health_result.summary())

    print(f'\nrho: {heckman_health_result.rho:.4f}')
    if abs(heckman_health_result.rho) > 0.1:
        print('=> Evidence of selection bias: the Tobit single-equation assumption may not hold.')
    else:
        print('=> Little evidence of selection bias: Tobit assumption is reasonable.')

except Exception as e:
    print(f'Heckman estimation on health data failed: {e}')
    print('This may occur if the selection rate is extreme or the model is not identified.')

---

## Part X: Results Summary and Policy Implications (10 min)

We now consolidate all findings into a comprehensive summary.

In [None]:
# ============================================================
# Table 04: Final comprehensive results
# ============================================================

def stars(coef, se):
    """Return significance stars."""
    if se == 0 or np.isnan(se):
        return ''
    t = abs(coef / se)
    if t > 3.29:
        return '***'
    elif t > 2.58:
        return '**'
    elif t > 1.96:
        return '*'
    return ''

print('Table 04: Final Results Summary')
print('=' * 95)
print(f'{"":20s} {"OLS":>14s} {"Pooled Tobit":>14s} {"RE Tobit":>14s} {"AME (uncond.)":>14s}')
print('-' * 95)

for var in covariates:
    i = var_names.index(var)
    
    ols_c = ols_result.params[i]
    ols_s = ols_result.bse[i]
    
    pt_c = tobit_pooled.beta[i]
    pt_s = tobit_pooled.bse[i]
    
    re_c = tobit_re.beta[i]
    re_s = tobit_re.bse[i]
    
    ame_val = ame_unconditional.marginal_effects[var]
    
    line = f'{var:20s}'
    line += f' {ols_c:>10.4f}{stars(ols_c, ols_s):3s}'
    line += f' {pt_c:>10.4f}{stars(pt_c, pt_s):3s}'
    line += f' {re_c:>10.4f}{stars(re_c, re_s):3s}'
    line += f' {ame_val:>14.4f}'
    print(line)
    
    # Standard errors on second line
    se_line = f'{"":20s}'
    se_line += f' ({ols_s:>9.4f})  '
    se_line += f' ({pt_s:>9.4f})  '
    se_line += f' ({re_s:>9.4f})  '
    se_line += f' {"":>14s}'
    print(se_line)

print('-' * 95)
print(f'{"sigma":20s} {"":>14s} {tobit_pooled.sigma:>14.4f} {tobit_re.sigma_eps:>14.4f}')
print(f'{"sigma_alpha":20s} {"":>14s} {"":>14s} {tobit_re.sigma_alpha:>14.4f}')
print(f'{"R2":20s} {ols_result.rsquared:>14.4f}')
print(f'{"Log-Likelihood":20s} {"":>14s} {tobit_pooled.llf:>14.2f} {tobit_re.llf:>14.2f}')
print(f'{"N":20s} {int(ols_result.nobs):>14d} {tobit_pooled.n_obs:>14d} {tobit_re.n_obs:>14d}')
print('=' * 95)
print('\nSignificance: *** p<0.001, ** p<0.01, * p<0.05')
print('Standard errors in parentheses (cluster-robust for OLS, MLE-based for Tobit).')

In [None]:
# ============================================================
# Summary of key findings
# ============================================================

print('KEY FINDINGS')
print('=' * 70)

print('\n1. ATTENUATION BIAS')
print('-' * 70)
print('   OLS systematically underestimates the effect of covariates on')
print('   health expenditure due to ignoring censoring at zero.')
for var in covariates:
    i = var_names.index(var)
    ols_c = ols_result.params[i]
    pt_c = tobit_pooled.beta[i]
    if abs(ols_c) > 1e-6:
        ratio = pt_c / ols_c
        print(f'   {var:15s}: Tobit/OLS ratio = {ratio:.2f}')

print('\n2. INDIVIDUAL HETEROGENEITY')
print('-' * 70)
rho_icc = tobit_re.sigma_alpha**2 / (tobit_re.sigma_alpha**2 + tobit_re.sigma_eps**2)
print(f'   ICC (intra-class correlation): {rho_icc:.4f}')
print(f'   => {rho_icc*100:.1f}% of latent variance is due to individual effects.')
if rho_icc > 0.1:
    print('   The panel structure is important; RE Tobit is preferred over Pooled Tobit.')

print('\n3. MARGINAL EFFECTS DECOMPOSITION')
print('-' * 70)
print('   The unconditional AME is the policy-relevant quantity. For key variables:')
for var in ['income', 'chronic', 'insurance']:
    i = var_names.index(var)
    beta_val = tobit_pooled.beta[i]
    ame_val = ame_unconditional.marginal_effects[var]
    scale = ame_val / beta_val if abs(beta_val) > 1e-8 else np.nan
    print(f'   {var:15s}: beta={beta_val:.4f}, AME={ame_val:.4f} (scaling={scale:.3f})')

print('\n4. SELECTION BIAS (Heckman on Mroz data)')
print('-' * 70)
print(f'   rho = {heckman_result.rho:.4f}')
if abs(heckman_result.rho) > 0.1:
    print('   Selection bias is present. Heckman correction is needed.')
else:
    print('   Minimal evidence of selection bias. OLS on the selected sample is adequate.')

In [None]:
# ============================================================
# Policy implications
# ============================================================

print('POLICY IMPLICATIONS')
print('=' * 70)

print('''
Based on the Random Effects Tobit (our preferred model):

1. INCOME: Higher income is associated with higher health spending.
   The unconditional AME represents the total effect, including the
   increased probability of any spending.

2. CHRONIC CONDITIONS: Having more chronic conditions strongly increases
   both the probability of positive spending and the amount spent.
   This is the most important predictor in the model.

3. INSURANCE: Insurance coverage significantly increases health
   expenditure, likely through reduced out-of-pocket costs. This
   operates through both the intensive and extensive margins.

4. METHODOLOGICAL: Ignoring censoring (OLS) leads to underestimating
   the true effects by a substantial margin. The panel structure
   matters: the ICC indicates meaningful individual heterogeneity.
   Future work should consider fixed effects Tobit or Honore (1992)
   estimators if the random effects assumption is questionable.
''')

---

## Summary and Key Takeaways

### What We Learned

1. **OLS is biased for censored data**: Coefficients are attenuated toward zero when the
   dependent variable has a mass point at the censoring bound.

2. **Tobit corrects for censoring**: Both Pooled and RE Tobit produce larger (in absolute
   value) coefficients than OLS, reflecting the true latent-variable relationship.

3. **Panel structure matters**: The Random Effects Tobit captures individual heterogeneity
   that the Pooled Tobit ignores, as indicated by a non-trivial ICC.

4. **Raw Tobit coefficients are not marginal effects**: The McDonald-Moffitt decomposition
   is essential for correct interpretation. The unconditional AME is the policy-relevant
   quantity for most applications.

5. **Selection vs. censoring**: The Heckman model relaxes the Tobit assumption that the same
   process governs participation and level. When there are good exclusion restrictions,
   the Heckman model can detect and correct sample selection bias.

6. **Robustness matters**: Sensitivity analysis across subsamples and specifications gives
   confidence that the key conclusions are not artifacts of a particular modeling choice.

### Modeling Decision Flowchart

```
Is the dependent variable censored?
    No  --> Standard panel models (FE, RE, GMM)
    Yes --> Is the censoring mechanism the same as the outcome process?
              Yes --> Tobit model
                        Is there panel structure?
                          No  --> Pooled Tobit
                          Yes --> Random Effects Tobit
                                    Concern about RE assumption?
                                      Yes --> Honore Fixed Effects Tobit
              No  --> Heckman selection model
                        Have exclusion restrictions?
                          Yes --> Two-step or MLE
                          No  --> Identification is weak; proceed with caution
```

### Connection to Other Notebooks

| Notebook | Topic | Builds On |
|----------|-------|-----------|
| 01 | Tobit Fundamentals | -- |
| 02 | Pooled vs Panel Tobit | 01 |
| 03 | Random Effects Tobit | 01, 02 |
| 04 | Marginal Effects | 01-03 |
| 05 | Heckman Selection | 01 |
| 06 | Model Diagnostics | 01-05 |
| 07 | Advanced Topics | 01-06 |
| **08** | **Complete Case Study** | **01-07 (this notebook)** |

---

## Exercises

### Exercise 1: Extended Model Comparison (20 min)

Extend the analysis by:

a) Fitting a **log-transformed OLS** model (i.e., `log(1 + expenditure)` as dependent variable)
   and comparing the predicted values to the Tobit model.

b) Computing the **AIC and BIC** for the Pooled Tobit and RE Tobit models.
   Recall: $\text{AIC} = -2 \ln L + 2k$ and $\text{BIC} = -2 \ln L + k \ln n$.

c) Which model does each criterion prefer?

In [None]:
# Exercise 1: Your code here
# --------------------------

# (a) Log-transformed OLS
# y_log = np.log1p(y)
# ols_log = sm.OLS(y_log, X).fit(cov_type='cluster', cov_kwds={'groups': groups})
# ...

# (b) Information criteria
# k_pooled = len(tobit_pooled.params)
# k_re = len(tobit_re.params)
# n = tobit_pooled.n_obs
# aic_pooled = -2 * tobit_pooled.llf + 2 * k_pooled
# bic_pooled = -2 * tobit_pooled.llf + k_pooled * np.log(n)
# ...

# (c) Interpretation
# ...

### Exercise 2: Marginal Effects at Representative Values (15 min)

Compute the marginal effect of `insurance` for two profiles:

- **Profile A**: age=35, income=30, chronic=0, female=1, bmi=25 (young healthy woman)
- **Profile B**: age=65, income=50, chronic=3, female=0, bmi=30 (older man with chronic conditions)

Compute both the unconditional and probability marginal effects at each profile
using the Pooled Tobit coefficients.

*Hint*: For a specific observation, the unconditional ME of variable $k$ is
$\beta_k \cdot \Phi((X'\beta - c)/\sigma)$.
Evaluate at each profile's covariate values.

In [None]:
# Exercise 2: Your code here
# --------------------------

# Profile A
# x_a = np.array([1, 30, 35, 0, 0, 1, 25])  # [const, income, age, chronic, insurance, female, bmi]
# z_a = (x_a @ tobit_pooled.beta - 0) / tobit_pooled.sigma
# ...

# Profile B
# x_b = np.array([1, 50, 65, 3, 0, 0, 30])
# ...

### Exercise 3: Heckman MLE Estimation (20 min)

a) Re-estimate the Heckman model on the Mroz data using **MLE** instead of two-step.

b) Compare the two-step and MLE estimates. How different are the outcome coefficients?
   How different are the estimated $\rho$ and $\sigma$?

c) Use the `compare_heckman_methods` utility (from `comparison_tools`) to generate
   a formatted comparison table.

*Hint*: Use `PanelHeckman(..., method='mle')` and note the performance warning
for larger samples.

In [None]:
# Exercise 3: Your code here
# --------------------------

# (a) Heckman MLE
# heckman_mle = PanelHeckman(
#     endog=wage,
#     exog=X_outcome,
#     selection=selection,
#     exog_selection=Z_selection,
#     method='mle',
# )
# heckman_mle_result = heckman_mle.fit()

# (b) Compare estimates
# print(heckman_mle_result.summary())

# (c) Use comparison utility
# from comparison_tools import compare_heckman_methods
# comparison = compare_heckman_methods(heckman_result, heckman_mle_result, outcome_names)
# display(comparison)

### Exercise 4: Prediction and Model Validation (15 min)

a) Generate **in-sample predictions** from the Pooled Tobit (censored predictions)
   and compare them to OLS fitted values. Plot predicted vs. actual values for both.

b) Compute the **RMSE** for both models (on the observed, not latent, scale).

c) Which model produces predictions that are more consistent with the observed
   distribution of expenditures? Why?

In [None]:
# Exercise 4: Your code here
# --------------------------

# (a) In-sample predictions
# y_pred_tobit = tobit_pooled.predict(pred_type='censored')
# y_pred_ols = ols_result.fittedvalues

# (b) RMSE
# rmse_tobit = np.sqrt(np.mean((y - y_pred_tobit)**2))
# rmse_ols = np.sqrt(np.mean((y - y_pred_ols)**2))
# ...

# (c) Plot
# fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# ...

---

*This notebook is part of the PanelBox Censored Models Tutorial Series.*
*For questions or feedback, consult the PanelBox documentation.*