# Complete Case Study: Solutions
## Notebook 08 -- Exercise Solutions

This notebook provides complete, annotated solutions for all four exercises
from the **Complete Case Study** notebook (08). Each solution builds on
the models fitted in the main notebook.

| Exercise | Topic | Duration |
|----------|-------|----------|
| 1 | Extended Model Comparison (Log OLS, AIC/BIC) | 20 min |
| 2 | Marginal Effects at Representative Values | 15 min |
| 3 | Heckman MLE Estimation | 20 min |
| 4 | Prediction and Model Validation | 15 min |

---

## Setup

We reproduce the setup and model estimation from the main notebook so that
the exercise solutions are self-contained.

In [None]:
# ============================================================
# Imports and configuration
# ============================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
import sys

warnings.filterwarnings('ignore')

from scipy import stats
import statsmodels.api as sm

from panelbox.models.censored import PooledTobit, RandomEffectsTobit
from panelbox.models.selection import PanelHeckman
from panelbox.marginal_effects.censored_me import compute_tobit_ame, compute_tobit_mem

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

np.random.seed(42)

BASE_DIR = Path('..')
DATA_DIR = BASE_DIR / 'data'
OUTPUT_DIR = BASE_DIR / 'outputs'
FIGURES_DIR = OUTPUT_DIR / 'figures'
TABLES_DIR = OUTPUT_DIR / 'tables'
FIGURES_DIR.mkdir(parents=True, exist_ok=True)
TABLES_DIR.mkdir(parents=True, exist_ok=True)

sys.path.insert(0, str(BASE_DIR / 'utils'))
from comparison_tools import compare_tobit_ols, compare_heckman_methods

print('Setup complete!')

---

## Data Loading and Baseline Models

We load both datasets and fit the baseline models needed by the exercises.

In [None]:
# ============================================================
# Load health expenditure panel
# ============================================================

df = pd.read_csv(DATA_DIR / 'health_expenditure_panel.csv')

depvar = 'expenditure'
covariates = ['income', 'age', 'chronic', 'insurance', 'female', 'bmi']

y = df[depvar].values
X_raw = df[covariates].values
X = sm.add_constant(X_raw)
var_names = ['const'] + covariates
groups = df['id'].values

n_censored = (y == 0).sum()
pct_censored = n_censored / len(y) * 100

print(f'Dataset: {df.shape[0]} obs, {df["id"].nunique()} individuals')
print(f'Censored at zero: {n_censored} ({pct_censored:.1f}%)')

In [None]:
# ============================================================
# Fit baseline OLS
# ============================================================

ols_model = sm.OLS(y, X)
ols_result = ols_model.fit(cov_type='cluster', cov_kwds={'groups': groups})

print('OLS fitted.')
print(f'  R-squared: {ols_result.rsquared:.4f}')

In [None]:
# ============================================================
# Fit Pooled Tobit
# ============================================================

tobit_pooled = PooledTobit(
    endog=y,
    exog=X,
    groups=groups,
    censoring_point=0.0,
)
tobit_pooled.fit()
tobit_pooled.exog_names = var_names

n_beta = len(var_names)
print('Pooled Tobit fitted.')
print(f'  Log-likelihood: {tobit_pooled.llf:.2f}')
print(f'  sigma: {tobit_pooled.sigma:.4f}')

In [None]:
# ============================================================
# Fit Random Effects Tobit
# ============================================================

tobit_re = RandomEffectsTobit(
    endog=y,
    exog=X,
    groups=groups,
    censoring_point=0.0,
    quadrature_points=12,
)
tobit_re.fit(method='BFGS', maxiter=2000, options={'disp': False})

print('RE Tobit fitted.')
print(f'  Log-likelihood: {tobit_re.llf:.2f}')
print(f'  sigma_eps: {tobit_re.sigma_eps:.4f}')
print(f'  sigma_alpha: {tobit_re.sigma_alpha:.4f}')

In [None]:
# ============================================================
# Load Mroz data and fit Heckman two-step
# ============================================================

mroz = pd.read_csv(DATA_DIR / 'mroz_1987.csv')

selection = mroz['lfp'].values.astype(int)
wage = mroz['wage'].fillna(0).values

outcome_vars = ['education', 'experience', 'experience_sq']
X_outcome = sm.add_constant(mroz[outcome_vars].values)
outcome_names = ['const'] + outcome_vars

selection_vars = ['education', 'experience', 'experience_sq', 'age',
                  'children_lt6', 'children_6_18', 'husband_income']
Z_selection = sm.add_constant(mroz[selection_vars].values)
selection_names = ['const'] + selection_vars

heckman_model = PanelHeckman(
    endog=wage,
    exog=X_outcome,
    selection=selection,
    exog_selection=Z_selection,
    method='two_step',
)
heckman_result = heckman_model.fit()

print('Heckman two-step fitted.')
print(f'  rho: {heckman_result.rho:.4f}')
print(f'  sigma: {heckman_result.sigma:.4f}')
print(f'  Selected: {heckman_result.n_selected} / {heckman_result.n_total}')

---

## Exercise 1: Extended Model Comparison (20 min)

**Tasks:**

a) Fit a **log-transformed OLS** model using `log(1 + expenditure)` as the dependent
   variable and compare its predicted values with the Tobit model.

b) Compute **AIC** and **BIC** for the Pooled Tobit and RE Tobit models.
   Recall: $\text{AIC} = -2 \ln L + 2k$ and $\text{BIC} = -2 \ln L + k \ln n$.

c) Which model does each criterion prefer?

### Solution 1a: Log-Transformed OLS

In [None]:
# ============================================================
# Exercise 1a: Fit log-transformed OLS
# ============================================================

# Transform the dependent variable: log(1 + expenditure)
# This is a common "quick fix" for censored data -- it reduces
# skewness and compresses the zero pile-up, but does NOT properly
# account for the censoring mechanism.

y_log = np.log1p(y)  # log(1 + y)

ols_log_model = sm.OLS(y_log, X)
ols_log_result = ols_log_model.fit(cov_type='cluster', cov_kwds={'groups': groups})

print('Log-Transformed OLS Results')
print('=' * 65)

log_ols_table = pd.DataFrame({
    'Variable': var_names,
    'Coefficient': ols_log_result.params,
    'Std. Error': ols_log_result.bse,
    't-stat': ols_log_result.tvalues,
    'p-value': ols_log_result.pvalues,
}).set_index('Variable')

display(log_ols_table.round(4))

print(f'\nR-squared:    {ols_log_result.rsquared:.4f}')
print(f'Observations: {int(ols_log_result.nobs)}')

In [None]:
# ============================================================
# Compare predictions: Log OLS vs Tobit vs OLS (levels)
# ============================================================

# To compare predictions on the same scale, we need to transform
# the log-OLS predictions back to levels: exp(y_hat_log) - 1
# Note: this is E[log(1+y)|X], not E[y|X]. The retransformation
# introduces bias (Jensen's inequality), but we use it here for
# a rough comparison.

y_pred_log_ols = np.expm1(ols_log_result.fittedvalues)  # exp(fitted) - 1
y_pred_ols = ols_result.fittedvalues
y_pred_tobit = tobit_pooled.predict(pred_type='censored')

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Panel A: OLS (levels) predicted vs actual
axes[0].scatter(y, y_pred_ols, alpha=0.3, s=8, color='steelblue')
max_val = max(y.max(), y_pred_ols.max())
axes[0].plot([0, max_val], [0, max_val], 'r--', linewidth=1.5, label='45-degree line')
axes[0].set_xlabel('Actual Expenditure')
axes[0].set_ylabel('Predicted Expenditure')
axes[0].set_title('A. OLS (Levels)')
axes[0].legend()

# Panel B: Log OLS predicted vs actual
axes[1].scatter(y, y_pred_log_ols, alpha=0.3, s=8, color='darkorange')
max_val_log = max(y.max(), y_pred_log_ols.max())
axes[1].plot([0, max_val_log], [0, max_val_log], 'r--', linewidth=1.5, label='45-degree line')
axes[1].set_xlabel('Actual Expenditure')
axes[1].set_ylabel('Predicted Expenditure')
axes[1].set_title('B. Log OLS (retransformed)')
axes[1].legend()

# Panel C: Tobit predicted vs actual
axes[2].scatter(y, y_pred_tobit, alpha=0.3, s=8, color='seagreen')
max_val_tobit = max(y.max(), y_pred_tobit.max())
axes[2].plot([0, max_val_tobit], [0, max_val_tobit], 'r--', linewidth=1.5, label='45-degree line')
axes[2].set_xlabel('Actual Expenditure')
axes[2].set_ylabel('Predicted Expenditure')
axes[2].set_title('C. Pooled Tobit (censored predictions)')
axes[2].legend()

plt.suptitle('Exercise 1a: Predicted vs Actual -- OLS, Log OLS, Tobit', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex1a_prediction_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

# RMSE comparison (all on original scale)
rmse_ols = np.sqrt(np.mean((y - y_pred_ols)**2))
rmse_log_ols = np.sqrt(np.mean((y - y_pred_log_ols)**2))
rmse_tobit = np.sqrt(np.mean((y - y_pred_tobit)**2))

print(f'RMSE (original scale):')
print(f'  OLS (levels):        {rmse_ols:.4f}')
print(f'  Log OLS (retransf.): {rmse_log_ols:.4f}')
print(f'  Pooled Tobit:        {rmse_tobit:.4f}')

**Discussion (1a):**

The log-transformed OLS is a commonly used ad-hoc approach to handle skewed and
censored data. However, it has several drawbacks:

1. The coefficients are in the log scale, making direct comparison with Tobit
   coefficients difficult.
2. Retransformation to levels introduces bias due to Jensen's inequality:
   $E[\exp(\hat{y})] \neq \exp(E[\hat{y}])$. A smearing estimator (Duan, 1983)
   would partially correct this.
3. Unlike Tobit, log-OLS does not model the censoring mechanism explicitly and
   cannot decompose effects into the intensive and extensive margins.

The Tobit model is preferred because it directly models the data generating
process (left-censoring at zero) and yields interpretable marginal effects
through the McDonald-Moffitt decomposition.

### Solution 1b: AIC and BIC

In [None]:
# ============================================================
# Exercise 1b: Compute AIC and BIC for Pooled and RE Tobit
# ============================================================

# Pooled Tobit parameters: K betas + 1 sigma = K + 1
k_pooled = len(var_names) + 1  # betas + sigma
n = tobit_pooled.n_obs

aic_pooled = -2 * tobit_pooled.llf + 2 * k_pooled
bic_pooled = -2 * tobit_pooled.llf + k_pooled * np.log(n)

# RE Tobit parameters: K betas + sigma_eps + sigma_alpha = K + 2
k_re = len(var_names) + 2  # betas + sigma_eps + sigma_alpha
n_re = tobit_re.n_obs

aic_re = -2 * tobit_re.llf + 2 * k_re
bic_re = -2 * tobit_re.llf + k_re * np.log(n_re)

# Display results
ic_table = pd.DataFrame({
    'Model': ['Pooled Tobit', 'RE Tobit'],
    'k (params)': [k_pooled, k_re],
    'Log-Lik': [tobit_pooled.llf, tobit_re.llf],
    'AIC': [aic_pooled, aic_re],
    'BIC': [bic_pooled, bic_re],
}).set_index('Model')

print('Information Criteria Comparison')
print('=' * 65)
display(ic_table.round(2))

print(f'\nAIC difference (Pooled - RE): {aic_pooled - aic_re:.2f}')
print(f'BIC difference (Pooled - RE): {bic_pooled - bic_re:.2f}')

### Solution 1c: Interpretation

In [None]:
# ============================================================
# Exercise 1c: Which model does each criterion prefer?
# ============================================================

# Lower AIC/BIC is preferred
aic_preferred = 'RE Tobit' if aic_re < aic_pooled else 'Pooled Tobit'
bic_preferred = 'RE Tobit' if bic_re < bic_pooled else 'Pooled Tobit'

print('Model Selection Summary')
print('=' * 65)
print(f'  AIC prefers: {aic_preferred}')
print(f'  BIC prefers: {bic_preferred}')
print()

# Additional context: likelihood ratio test
# The Pooled Tobit is nested within the RE Tobit (sigma_alpha = 0)
# LR = 2 * (ll_RE - ll_Pooled)
# Under H0, LR ~ mixture of chi2(0) and chi2(1) (boundary problem)
lr_stat = 2 * (tobit_re.llf - tobit_pooled.llf)
# Conservative: use chi2(1) critical value
lr_pvalue = 0.5 * stats.chi2.sf(lr_stat, df=1)  # one-sided boundary test

print(f'Likelihood Ratio Test (H0: sigma_alpha = 0):')
print(f'  LR statistic:  {lr_stat:.4f}')
print(f'  p-value:       {lr_pvalue:.6f} (mixture chi2, conservative)')
if lr_pvalue < 0.05:
    print(f'  => Reject H0 at 5%. Individual heterogeneity is significant.')
    print(f'     The RE Tobit is preferred over the Pooled Tobit.')
else:
    print(f'  => Cannot reject H0. Pooled Tobit may be adequate.')

print()
print('Discussion:')
print('-' * 65)
print('AIC penalizes model complexity less heavily than BIC (2k vs k*ln(n)).')
print('For large n, BIC penalizes the extra parameter in the RE model more.')
print('Both criteria select the model with better fit-complexity trade-off.')
print('The LR test directly tests whether the random effect variance is zero.')

---

## Exercise 2: Marginal Effects at Representative Values (15 min)

**Task:** Compute the marginal effect of `insurance` for two profiles:

- **Profile A**: age=35, income=30, chronic=0, female=1, bmi=25
  (young healthy woman)
- **Profile B**: age=65, income=50, chronic=3, female=0, bmi=30
  (older man with chronic conditions)

Compute both the **unconditional** and **probability** marginal effects at
each profile using the Pooled Tobit coefficients.

**Formulas (left-censored at $c = 0$):**

- Unconditional ME: $\frac{\partial E[y|X]}{\partial x_k} = \beta_k \cdot \Phi\left(\frac{X'\beta}{\sigma}\right)$

- Probability ME: $\frac{\partial P(y > 0 | X)}{\partial x_k} = \frac{\beta_k}{\sigma} \cdot \phi\left(\frac{X'\beta}{\sigma}\right)$

In [None]:
# ============================================================
# Exercise 2: Define the two profiles
# ============================================================

# Variable order in X: [const, income, age, chronic, insurance, female, bmi]
# We set insurance=0 initially; the marginal effect is the discrete change
# for a binary variable, or for a continuous approximation, we evaluate at
# a given insurance level. Here we use the calculus-based marginal effect
# evaluated at the profile covariates.

# Profile A: young healthy woman
x_a = np.array([1.0, 30.0, 35.0, 0.0, 0.0, 1.0, 25.0])

# Profile B: older man with chronic conditions
x_b = np.array([1.0, 50.0, 65.0, 3.0, 0.0, 0.0, 30.0])

beta = tobit_pooled.beta
sigma = tobit_pooled.sigma

print('Pooled Tobit coefficients:')
for i, name in enumerate(var_names):
    print(f'  {name:12s}: {beta[i]:.4f}')
print(f'  {"sigma":12s}: {sigma:.4f}')

In [None]:
# ============================================================
# Exercise 2: Compute marginal effects at each profile
# ============================================================

# Index of 'insurance' in the variable list
ins_idx = var_names.index('insurance')
beta_ins = beta[ins_idx]

def compute_me_at_profile(x_profile, beta, sigma, var_idx, var_names):
    """
    Compute unconditional and probability marginal effects
    at a given covariate profile.

    For left-censoring at c = 0:
      z = X'beta / sigma
      ME_unconditional = beta_k * Phi(z)
      ME_probability   = (beta_k / sigma) * phi(z)
    """
    # Linear prediction at the profile
    xb = x_profile @ beta

    # z = (X'beta - c) / sigma, with c = 0
    z = xb / sigma

    # CDF and PDF at z
    Phi_z = stats.norm.cdf(z)
    phi_z = stats.norm.pdf(z)

    beta_k = beta[var_idx]

    me_unconditional = beta_k * Phi_z
    me_probability = (beta_k / sigma) * phi_z

    return {
        'xb': xb,
        'z': z,
        'Phi_z': Phi_z,
        'phi_z': phi_z,
        'me_unconditional': me_unconditional,
        'me_probability': me_probability,
    }


# Profile A
me_a = compute_me_at_profile(x_a, beta, sigma, ins_idx, var_names)

# Profile B
me_b = compute_me_at_profile(x_b, beta, sigma, ins_idx, var_names)

# Display results
print('Marginal Effect of Insurance at Representative Profiles')
print('=' * 70)
print(f'{"":30s} {"Profile A":>15s} {"Profile B":>15s}')
print(f'{"":30s} {"(Young woman)":>15s} {"(Older man)":>15s}')
print('-' * 70)
print(f'{"X\'beta (linear prediction)":30s} {me_a["xb"]:>15.4f} {me_b["xb"]:>15.4f}')
print(f'{"z = X\'beta / sigma":30s} {me_a["z"]:>15.4f} {me_b["z"]:>15.4f}')
print(f'{"Phi(z) [P(y > 0 | X)]":30s} {me_a["Phi_z"]:>15.4f} {me_b["Phi_z"]:>15.4f}')
print(f'{"phi(z)":30s} {me_a["phi_z"]:>15.4f} {me_b["phi_z"]:>15.4f}')
print('-' * 70)
print(f'{"ME unconditional":30s} {me_a["me_unconditional"]:>15.4f} {me_b["me_unconditional"]:>15.4f}')
print(f'{"ME probability":30s} {me_a["me_probability"]:>15.4f} {me_b["me_probability"]:>15.4f}')
print('-' * 70)
print(f'{"beta (insurance, latent)":30s} {beta_ins:>15.4f} {beta_ins:>15.4f}')

In [None]:
# ============================================================
# Visualize the differences between profiles
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart of marginal effects
labels = ['Unconditional ME', 'Probability ME']
vals_a = [me_a['me_unconditional'], me_a['me_probability']]
vals_b = [me_b['me_unconditional'], me_b['me_probability']]

idx = np.arange(len(labels))
width = 0.35

axes[0].bar(idx - width/2, vals_a, width, label='Profile A (Young woman)', color='steelblue', alpha=0.8)
axes[0].bar(idx + width/2, vals_b, width, label='Profile B (Older man)', color='darkorange', alpha=0.8)
axes[0].set_xticks(idx)
axes[0].set_xticklabels(labels, fontsize=11)
axes[0].set_ylabel('Marginal Effect of Insurance')
axes[0].set_title('Insurance Marginal Effects by Profile')
axes[0].legend(fontsize=10)
axes[0].axhline(y=0, color='black', linewidth=0.8)

# Show the latent prediction and probability of non-censoring
profile_labels = ['Profile A\n(Young woman)', 'Profile B\n(Older man)']
phi_vals = [me_a['Phi_z'], me_b['Phi_z']]

axes[1].bar(profile_labels, phi_vals, color=['steelblue', 'darkorange'], alpha=0.8, edgecolor='black')
axes[1].set_ylabel('P(expenditure > 0 | X)')
axes[1].set_title('Probability of Positive Expenditure')
axes[1].set_ylim(0, 1)
for i, v in enumerate(phi_vals):
    axes[1].text(i, v + 0.02, f'{v:.3f}', ha='center', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex2_me_profiles.png', dpi=150, bbox_inches='tight')
plt.show()

**Discussion (Exercise 2):**

The key insight is that marginal effects in the Tobit model **vary across the
covariate space**. Two individuals with different characteristics experience
different marginal effects of insurance, even though the underlying latent
coefficient $\beta_{\text{insurance}}$ is constant.

- **Profile A** (young healthy woman): Lower baseline probability of positive
  expenditure ($\Phi(z)$ is smaller because fewer chronic conditions and younger
  age yield a lower latent prediction). The unconditional ME is therefore scaled
  down more, but the probability ME may be relatively larger because $\phi(z)$
  is large near the censoring threshold.

- **Profile B** (older man, 3 chronic conditions): Higher baseline probability
  of positive expenditure ($\Phi(z)$ is closer to 1). The unconditional ME is
  closer to the raw $\beta$, but the probability ME is smaller because this
  person is already very likely to have positive expenditure.

This illustrates the nonlinearity of the Tobit model: insurance has a larger
effect on the **probability** of spending for people near the censoring
threshold, and a larger effect on the **level** for people who are likely
to spend already.

---

## Exercise 3: Heckman MLE Estimation (20 min)

**Tasks:**

a) Re-estimate the Heckman model on the Mroz data using **MLE** instead of two-step.

b) Compare the two-step and MLE estimates. How different are the outcome
   coefficients? How different are $\rho$ and $\sigma$?

c) Use the `compare_heckman_methods` utility to generate a formatted comparison.

### Solution 3a: Heckman MLE

In [None]:
# ============================================================
# Exercise 3a: Estimate Heckman with MLE
# ============================================================

heckman_mle_model = PanelHeckman(
    endog=wage,
    exog=X_outcome,
    selection=selection,
    exog_selection=Z_selection,
    method='mle',
)

heckman_mle_result = heckman_mle_model.fit()

print('Heckman MLE Results')
print('=' * 60)
print(heckman_mle_result.summary())

### Solution 3b: Compare Two-Step vs MLE

In [None]:
# ============================================================
# Exercise 3b: Side-by-side comparison
# ============================================================

print('Outcome Equation Coefficients')
print('=' * 65)
print(f'{"Variable":20s} {"Two-Step":>12s} {"MLE":>12s} {"Difference":>12s} {"% Diff":>10s}')
print('-' * 65)

for i, name in enumerate(outcome_names):
    ts_coef = heckman_result.outcome_params[i]
    ml_coef = heckman_mle_result.outcome_params[i]
    diff = ts_coef - ml_coef
    pct = 100 * diff / (abs(ml_coef) + 1e-10)
    print(f'{name:20s} {ts_coef:>12.4f} {ml_coef:>12.4f} {diff:>12.4f} {pct:>9.2f}%')

print('-' * 65)

# Selection parameters
print(f'\nSelection Parameters:')
print(f'{"rho":20s} {heckman_result.rho:>12.4f} {heckman_mle_result.rho:>12.4f} '
      f'{heckman_result.rho - heckman_mle_result.rho:>12.4f}')
print(f'{"sigma":20s} {heckman_result.sigma:>12.4f} {heckman_mle_result.sigma:>12.4f} '
      f'{heckman_result.sigma - heckman_mle_result.sigma:>12.4f}')

lambda_ts = heckman_result.rho * heckman_result.sigma
lambda_ml = heckman_mle_result.rho * heckman_mle_result.sigma
print(f'{"lambda (rho*sigma)":20s} {lambda_ts:>12.4f} {lambda_ml:>12.4f} '
      f'{lambda_ts - lambda_ml:>12.4f}')

### Solution 3c: Formatted Comparison Using Utility

In [None]:
# ============================================================
# Exercise 3c: Use compare_heckman_methods utility
# ============================================================

comparison_table = compare_heckman_methods(
    heckman_result,
    heckman_mle_result,
    variable_names=outcome_names,
)

print('Heckman Two-Step vs MLE Comparison (via compare_heckman_methods)')
print('=' * 65)
display(comparison_table.round(4))

In [None]:
# ============================================================
# Visualization: coefficient comparison
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel A: Outcome coefficients
idx = np.arange(len(outcome_names))
width = 0.35

ts_coefs = heckman_result.outcome_params
ml_coefs = heckman_mle_result.outcome_params

axes[0].barh(idx + width/2, ts_coefs, width, label='Two-Step', color='steelblue', alpha=0.8)
axes[0].barh(idx - width/2, ml_coefs, width, label='MLE', color='darkorange', alpha=0.8)
axes[0].set_yticks(idx)
axes[0].set_yticklabels(outcome_names, fontsize=11)
axes[0].axvline(x=0, color='black', linewidth=0.8)
axes[0].set_xlabel('Coefficient')
axes[0].set_title('A. Outcome Equation Coefficients')
axes[0].legend(fontsize=10)

# Panel B: Selection parameters (rho, sigma, lambda)
param_names_sel = ['rho', 'sigma', 'lambda']
ts_sel = [heckman_result.rho, heckman_result.sigma, lambda_ts]
ml_sel = [heckman_mle_result.rho, heckman_mle_result.sigma, lambda_ml]

idx2 = np.arange(len(param_names_sel))
axes[1].bar(idx2 - width/2, ts_sel, width, label='Two-Step', color='steelblue', alpha=0.8)
axes[1].bar(idx2 + width/2, ml_sel, width, label='MLE', color='darkorange', alpha=0.8)
axes[1].set_xticks(idx2)
axes[1].set_xticklabels(param_names_sel, fontsize=11)
axes[1].set_ylabel('Parameter Value')
axes[1].set_title('B. Selection Parameters')
axes[1].legend(fontsize=10)
axes[1].axhline(y=0, color='black', linewidth=0.8)

plt.suptitle('Exercise 3: Heckman Two-Step vs MLE Comparison', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex3_heckman_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

**Discussion (Exercise 3):**

Comparing two-step and MLE estimation for the Heckman model:

1. **Efficiency**: MLE is asymptotically more efficient than two-step because it
   uses all the information in the joint distribution of the selection and outcome
   errors. However, two-step is more robust to misspecification of the joint
   distribution.

2. **Outcome coefficients**: The differences are typically small when the model
   is well-specified. Large discrepancies would suggest misspecification or
   identification problems.

3. **Selection parameters ($\rho$, $\sigma$)**: MLE estimates both jointly, while
   two-step derives them sequentially. The MLE estimate of $\rho$ is generally
   more reliable because it does not depend on the ad-hoc Hessian approximation
   used in the two-step procedure.

4. **Practical guidance**: Use two-step for initial exploration and robustness
   checking. Use MLE for final estimates when the model is well-specified and
   the normality assumption is reasonable. If the two methods give very different
   results, investigate model specification carefully.

---

## Exercise 4: Prediction and Model Validation (15 min)

**Tasks:**

a) Generate **in-sample predictions** from the Pooled Tobit (censored predictions)
   and compare them to OLS fitted values. Plot predicted vs. actual for both.

b) Compute the **RMSE** for both models (on the observed scale).

c) Which model produces predictions more consistent with the observed distribution?

### Solution 4a: In-Sample Predictions

In [None]:
# ============================================================
# Exercise 4a: Generate predictions
# ============================================================

# Tobit censored predictions: E[y | X] accounting for censoring
y_pred_tobit = tobit_pooled.predict(pred_type='censored')

# OLS fitted values (can be negative, which is unrealistic)
y_pred_ols = ols_result.fittedvalues

print('Prediction Summary Statistics')
print('=' * 60)
print(f'{"Statistic":20s} {"Actual":>12s} {"OLS":>12s} {"Tobit":>12s}')
print('-' * 60)
print(f'{"Mean":20s} {y.mean():>12.4f} {y_pred_ols.mean():>12.4f} {y_pred_tobit.mean():>12.4f}')
print(f'{"Std Dev":20s} {y.std():>12.4f} {y_pred_ols.std():>12.4f} {y_pred_tobit.std():>12.4f}')
print(f'{"Min":20s} {y.min():>12.4f} {y_pred_ols.min():>12.4f} {y_pred_tobit.min():>12.4f}')
print(f'{"Max":20s} {y.max():>12.4f} {y_pred_ols.max():>12.4f} {y_pred_tobit.max():>12.4f}')
print(f'{"% Negative":20s} {(y < 0).mean()*100:>11.1f}% {(y_pred_ols < 0).mean()*100:>11.1f}% {(y_pred_tobit < 0).mean()*100:>11.1f}%')
print(f'{"% Zero":20s} {(y == 0).mean()*100:>11.1f}% {(np.abs(y_pred_ols) < 1e-10).mean()*100:>11.1f}% {(np.abs(y_pred_tobit) < 1e-10).mean()*100:>11.1f}%')

In [None]:
# ============================================================
# Exercise 4a: Plot predicted vs actual
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Panel A: OLS predicted vs actual
axes[0].scatter(y, y_pred_ols, alpha=0.3, s=8, color='steelblue', label='Predictions')
max_val = max(y.max(), y_pred_ols.max()) * 1.05
min_val = min(y.min(), y_pred_ols.min()) * 1.05
axes[0].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=1.5, label='45-degree line')
axes[0].axhline(y=0, color='gray', linestyle=':', linewidth=0.8)
axes[0].axvline(x=0, color='gray', linestyle=':', linewidth=0.8)
axes[0].set_xlabel('Actual Expenditure', fontsize=12)
axes[0].set_ylabel('Predicted Expenditure', fontsize=12)
axes[0].set_title('A. OLS Predictions', fontsize=13)
axes[0].legend(fontsize=10)

# Highlight negative predictions
neg_mask = y_pred_ols < 0
if neg_mask.sum() > 0:
    axes[0].scatter(y[neg_mask], y_pred_ols[neg_mask], alpha=0.6, s=15,
                    color='red', label=f'Negative pred. (n={neg_mask.sum()})', zorder=5)
    axes[0].legend(fontsize=10)

# Panel B: Tobit predicted vs actual
axes[1].scatter(y, y_pred_tobit, alpha=0.3, s=8, color='seagreen', label='Predictions')
max_val_t = max(y.max(), y_pred_tobit.max()) * 1.05
axes[1].plot([0, max_val_t], [0, max_val_t], 'r--', linewidth=1.5, label='45-degree line')
axes[1].axhline(y=0, color='gray', linestyle=':', linewidth=0.8)
axes[1].axvline(x=0, color='gray', linestyle=':', linewidth=0.8)
axes[1].set_xlabel('Actual Expenditure', fontsize=12)
axes[1].set_ylabel('Predicted Expenditure', fontsize=12)
axes[1].set_title('B. Tobit Predictions (censored)', fontsize=13)
axes[1].legend(fontsize=10)

plt.suptitle('Exercise 4a: Predicted vs Actual Expenditure', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex4a_pred_vs_actual.png', dpi=150, bbox_inches='tight')
plt.show()

### Solution 4b: RMSE Comparison

In [None]:
# ============================================================
# Exercise 4b: Compute RMSE for both models
# ============================================================

rmse_ols = np.sqrt(np.mean((y - y_pred_ols)**2))
rmse_tobit = np.sqrt(np.mean((y - y_pred_tobit)**2))

# Also compute MAE for robustness
mae_ols = np.mean(np.abs(y - y_pred_ols))
mae_tobit = np.mean(np.abs(y - y_pred_tobit))

# Correlation between predicted and actual
corr_ols = np.corrcoef(y, y_pred_ols)[0, 1]
corr_tobit = np.corrcoef(y, y_pred_tobit)[0, 1]

# RMSE conditional on y > 0 (positive expenditure only)
pos_mask = y > 0
rmse_ols_pos = np.sqrt(np.mean((y[pos_mask] - y_pred_ols[pos_mask])**2))
rmse_tobit_pos = np.sqrt(np.mean((y[pos_mask] - y_pred_tobit[pos_mask])**2))

# RMSE for censored observations (y = 0)
cens_mask = y == 0
rmse_ols_cens = np.sqrt(np.mean((y[cens_mask] - y_pred_ols[cens_mask])**2))
rmse_tobit_cens = np.sqrt(np.mean((y[cens_mask] - y_pred_tobit[cens_mask])**2))

print('Model Fit Comparison')
print('=' * 55)
print(f'{"Metric":25s} {"OLS":>12s} {"Tobit":>12s}')
print('-' * 55)
print(f'{"RMSE (all obs)":25s} {rmse_ols:>12.4f} {rmse_tobit:>12.4f}')
print(f'{"MAE (all obs)":25s} {mae_ols:>12.4f} {mae_tobit:>12.4f}')
print(f'{"Correlation (pred, y)":25s} {corr_ols:>12.4f} {corr_tobit:>12.4f}')
print(f'{"RMSE (y > 0 only)":25s} {rmse_ols_pos:>12.4f} {rmse_tobit_pos:>12.4f}')
print(f'{"RMSE (y = 0 only)":25s} {rmse_ols_cens:>12.4f} {rmse_tobit_cens:>12.4f}')
print('-' * 55)

# Which is better?
if rmse_tobit < rmse_ols:
    pct_improvement = 100 * (rmse_ols - rmse_tobit) / rmse_ols
    print(f'\nTobit RMSE is {pct_improvement:.1f}% lower than OLS.')
else:
    pct_improvement = 100 * (rmse_tobit - rmse_ols) / rmse_tobit
    print(f'\nOLS RMSE is {pct_improvement:.1f}% lower than Tobit.')
    print('Note: This can happen because Tobit optimizes the censored likelihood,')
    print('not the mean squared error. The Tobit model is still preferred for')
    print('consistent estimation of the structural parameters.')

### Solution 4c: Distributional Consistency

In [None]:
# ============================================================
# Exercise 4c: Compare predicted distributions
# ============================================================

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Panel A: Distribution of actual values
axes[0, 0].hist(y, bins=50, edgecolor='black', alpha=0.7, color='gray', density=True)
axes[0, 0].axvline(x=0, color='red', linestyle='--', linewidth=1.5)
axes[0, 0].set_xlabel('Expenditure')
axes[0, 0].set_ylabel('Density')
axes[0, 0].set_title('A. Observed Distribution')

# Panel B: Distribution of OLS predictions
axes[0, 1].hist(y_pred_ols, bins=50, edgecolor='black', alpha=0.7, color='steelblue', density=True)
axes[0, 1].axvline(x=0, color='red', linestyle='--', linewidth=1.5)
axes[0, 1].set_xlabel('Predicted Expenditure')
axes[0, 1].set_ylabel('Density')
axes[0, 1].set_title('B. OLS Predictions')

# Panel C: Distribution of Tobit predictions
axes[1, 0].hist(y_pred_tobit, bins=50, edgecolor='black', alpha=0.7, color='seagreen', density=True)
axes[1, 0].axvline(x=0, color='red', linestyle='--', linewidth=1.5)
axes[1, 0].set_xlabel('Predicted Expenditure')
axes[1, 0].set_ylabel('Density')
axes[1, 0].set_title('C. Tobit Predictions (censored)')

# Panel D: Residual distributions
resid_ols = y - y_pred_ols
resid_tobit = y - y_pred_tobit

axes[1, 1].hist(resid_ols, bins=50, alpha=0.5, color='steelblue', density=True, label='OLS residuals')
axes[1, 1].hist(resid_tobit, bins=50, alpha=0.5, color='seagreen', density=True, label='Tobit residuals')
axes[1, 1].axvline(x=0, color='black', linestyle='-', linewidth=0.8)
axes[1, 1].set_xlabel('Residual (Actual - Predicted)')
axes[1, 1].set_ylabel('Density')
axes[1, 1].set_title('D. Residual Distributions')
axes[1, 1].legend(fontsize=10)

plt.suptitle('Exercise 4c: Distributional Comparison', fontsize=14, y=1.01)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'ex4c_distributional_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# ============================================================
# Quantitative assessment of distributional match
# ============================================================

# Key distributional features to compare
print('Distributional Consistency Assessment')
print('=' * 65)
print(f'{"Feature":30s} {"Actual":>10s} {"OLS":>10s} {"Tobit":>10s}')
print('-' * 65)

# Fraction of predictions at or below zero
print(f'{"% predictions <= 0":30s} '
      f'{(y <= 0).mean()*100:>9.1f}% '
      f'{(y_pred_ols <= 0).mean()*100:>9.1f}% '
      f'{(y_pred_tobit <= 0).mean()*100:>9.1f}%')

# Quartiles
for q_val in [25, 50, 75]:
    q_actual = np.percentile(y, q_val)
    q_ols = np.percentile(y_pred_ols, q_val)
    q_tobit = np.percentile(y_pred_tobit, q_val)
    print(f'{f"Q{q_val}":30s} {q_actual:>10.2f} {q_ols:>10.2f} {q_tobit:>10.2f}')

# Skewness
from scipy.stats import skew, kurtosis
print(f'{"Skewness":30s} {skew(y):>10.3f} {skew(y_pred_ols):>10.3f} {skew(y_pred_tobit):>10.3f}')
print(f'{"Kurtosis":30s} {kurtosis(y):>10.3f} {kurtosis(y_pred_ols):>10.3f} {kurtosis(y_pred_tobit):>10.3f}')

print('-' * 65)

print('\nInterpretation:')
print('-' * 65)
print('The Tobit model produces predictions that are more consistent with')
print('the observed distribution because:')
print('  1. All Tobit predictions are non-negative (respects the censoring')
print('     boundary), whereas OLS can produce negative predictions.')
print('  2. The Tobit predicted distribution better approximates the mass')
print('     at/near zero observed in the actual data.')
print('  3. The Tobit model captures the right-skewed shape of the')
print('     expenditure distribution more accurately.')
print('\nEven if OLS has a slightly lower RMSE in some cases, the Tobit model')
print('provides structurally consistent predictions that respect the data')
print('generating process. This is critical for policy simulations and')
print('counterfactual analysis, where out-of-sample prediction quality matters.')

---

## Summary of Exercise Solutions

| Exercise | Key Finding |
|----------|-------------|
| 1a | Log-OLS is an ad-hoc approach that does not model censoring. Retransformation introduces Jensen's inequality bias. |
| 1b | AIC and BIC can be computed from the log-likelihood and parameter counts. Both criteria compare fit-complexity trade-offs. |
| 1c | The RE Tobit is generally preferred when the ICC is non-trivial, confirmed by the LR test for $\sigma_\alpha = 0$. |
| 2 | Marginal effects vary across the covariate space. Insurance has a larger probability effect for individuals near the censoring threshold, and a larger unconditional effect for those with higher baseline expenditure. |
| 3 | Heckman MLE and two-step generally agree when the model is well-specified. MLE is more efficient; two-step is more robust. Large discrepancies signal misspecification. |
| 4 | Tobit predictions respect the censoring boundary (non-negative) and better match the observed distribution. OLS can produce negative predictions, which are economically meaningless. |

---

*This solution notebook is part of the PanelBox Censored Models Tutorial Series.*