# Binary Choice Models: Introduction

**Tutorial Series**: Discrete Choice Econometrics with PanelBox

**Notebook**: 01 - Binary Choice Introduction

**Author**: PanelBox Contributors

**Date**: 2026-02-16

**Estimated Duration**: 60-90 minutes

**Difficulty Level**: Beginner

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand why specialized models are needed for binary outcomes
2. Recognize the limitations of Linear Probability Models (LPM)
3. Estimate and interpret Logit and Probit models using PanelBox
4. Compare model performance using pseudo-R², AIC, and BIC
5. Generate predictions and evaluate classification metrics
6. Interpret coefficients as odds ratios (Logit) and latent variable effects
7. Apply binary choice models to labor force participation decisions

---

## Table of Contents

1. [Economic Motivation](#section1)
2. [Linear Probability Model (LPM)](#section2)
3. [Link Functions](#section3)
4. [Logit Model](#section4)
5. [Probit Model](#section5)
6. [Model Comparison](#section6)
7. [Predictions and Classification](#section7)
8. [Goodness-of-Fit Tests](#section8)
9. [Application: Labor Force Participation](#section9)
10. [Exercises](#exercises)

## Setup

Import all required libraries and configure the environment.

In [None]:
# Standard library imports
import warnings
from pathlib import Path

# Data manipulation and numerical computing
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical functions
from scipy.stats import norm, logistic
from sklearn.metrics import roc_curve, auc

# PanelBox models
from panelbox import PooledOLS
from panelbox.models.discrete.binary import PooledLogit, PooledProbit

# Configuration
warnings.filterwarnings('ignore')
np.random.seed(42)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

# Matplotlib configuration
plt.style.use('seaborn-v0_8-darkgrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['legend.fontsize'] = 10

print("✓ All libraries imported successfully")
print(f"✓ Random seed set to: 42")
print(f"✓ Working directory: {Path.cwd()}")

<a id='section1'></a>
## 1. Economic Motivation

### 1.1 What are Binary Outcomes?

Binary dependent variables take only two values: 0 or 1. They appear frequently in economics:

- **Labor Economics**: Employed vs Unemployed, Union membership, Job training participation
- **Consumer Behavior**: Purchase vs No purchase, Brand choice (A vs B)
- **Finance**: Default vs No default, Dividend payment (yes/no)
- **Development**: Technology adoption, Program participation
- **Health**: Insurance coverage, Treatment uptake

### 1.2 Why Not Use OLS?

When the dependent variable is binary, traditional OLS regression faces fundamental problems:

1. **Predictions outside [0,1]**: OLS can predict probabilities < 0 or > 1
2. **Constant marginal effects**: Unrealistic assumption that effects don't vary
3. **Heteroskedasticity**: Error variance depends on X
4. **No microfoundations**: Not derived from utility theory

### 1.3 Load the Data

We'll use a dataset on **labor force participation** of married women.

In [None]:
# Load labor participation dataset
DATA_DIR = Path("..") / "data"
data = pd.read_csv(DATA_DIR / "labor_participation.csv")

print("Dataset loaded successfully!")
print(f"\nShape: {data.shape}")
print(f"Number of individuals: {data['id'].nunique()}")
print(f"Number of periods: {data['year'].nunique()}")
print(f"\nFirst 10 rows:")
data.head(10)

In [None]:
# Summary statistics
print("=== Summary Statistics ===")
data.describe()

In [None]:
# Visualize outcome variable distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
lfp_counts = data['lfp'].value_counts().sort_index()
axes[0].bar(['Not Working (0)', 'Working (1)'], lfp_counts.values, 
            color=['#e74c3c', '#27ae60'], alpha=0.7, edgecolor='black')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Labor Force Participation Distribution')
axes[0].grid(True, alpha=0.3)

# Add value labels on bars
for i, (label, value) in enumerate(zip(['Not Working', 'Working'], lfp_counts.values)):
    axes[0].text(i, value + 50, f'{value}\n({100*value/len(data):.1f}%)', 
                ha='center', va='bottom', fontweight='bold')

# Proportion over time
lfp_by_year = data.groupby('year')['lfp'].agg(['mean', 'count'])
axes[1].plot(lfp_by_year.index, lfp_by_year['mean'], marker='o', linewidth=2, markersize=8)
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Proportion Working')
axes[1].set_title('Labor Force Participation Over Time')
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim([0, 1])

plt.tight_layout()
plt.show()

print(f"\nOverall participation rate: {100*data['lfp'].mean():.2f}%")

### 1.4 Discussion: Economic Questions

Binary choice models help us answer questions like:

- How does education affect the probability of working?
- What is the effect of having children on labor force participation?
- How do these effects vary across different age groups?
- Can we predict who will participate in the labor force?

**Key insight**: We're modeling **probabilities** (bounded between 0 and 1), not the binary outcome directly.

<a id='section2'></a>
## 2. Linear Probability Model (LPM)

### 2.1 The LPM Specification

The simplest approach: estimate using OLS

$$y_{it} = \mathbf{X}_{it}' \boldsymbol{\beta} + \varepsilon_{it}$$

where $y_{it} \in \{0, 1\}$.

**Interpretation**: $E[y|\mathbf{X}] = P(y=1|\mathbf{X}) = \mathbf{X}'\boldsymbol{\beta}$

**Marginal effect**: $\frac{\partial P(y=1)}{\partial X_k} = \beta_k$ (constant!)

### 2.2 Estimate LPM using PooledOLS

In [None]:
# Estimate Linear Probability Model
lpm = PooledOLS("lfp ~ age + educ + kids + married", data, "id", "year")
lpm_results = lpm.fit(cov_type='robust')

print("=== Linear Probability Model (LPM) ===")
print(lpm_results.summary())

### 2.3 The Problems with LPM

Let's examine predictions from the LPM:

In [None]:
# Generate predictions
data['lpm_pred'] = lpm_results.predict()

# Check for predictions outside [0,1]
outside_bounds = (data['lpm_pred'] < 0) | (data['lpm_pred'] > 1)
print("=== LPM Prediction Issues ===")
print(f"Predictions outside [0,1]: {outside_bounds.sum()} ({100*outside_bounds.mean():.2f}%)")
print(f"Min prediction: {data['lpm_pred'].min():.4f}")
print(f"Max prediction: {data['lpm_pred'].max():.4f}")
print(f"\nPredictions < 0: {(data['lpm_pred'] < 0).sum()}")
print(f"Predictions > 1: {(data['lpm_pred'] > 1).sum()}")

if outside_bounds.sum() > 0:
    print("\n⚠️  WARNING: LPM produces impossible probabilities!")
    print("This violates the fundamental requirement that probabilities ∈ [0,1]")

In [None]:
# Visualize LPM problems
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Predicted vs Actual by Age
axes[0].scatter(data['age'], data['lpm_pred'], alpha=0.3, label='LPM Predictions', s=20)
axes[0].scatter(data['age'], data['lfp'], alpha=0.05, label='Actual (0/1)', color='red', s=10)
axes[0].axhline(y=0, color='black', linestyle='--', linewidth=1.5, label='Bounds')
axes[0].axhline(y=1, color='black', linestyle='--', linewidth=1.5)
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Probability / Outcome')
axes[0].set_title('LPM: Predictions Can Exceed [0,1]')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Highlight predictions outside bounds
if outside_bounds.sum() > 0:
    outside_data = data[outside_bounds]
    axes[0].scatter(outside_data['age'], outside_data['lpm_pred'], 
                   color='orange', s=50, marker='x', linewidths=2,
                   label=f'Outside [0,1] (n={len(outside_data)})', zorder=5)
    axes[0].legend()

# Plot 2: Distribution of predictions
axes[1].hist(data['lpm_pred'], bins=50, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='red', linestyle='--', linewidth=2, label='Lower Bound (0)')
axes[1].axvline(x=1, color='red', linestyle='--', linewidth=2, label='Upper Bound (1)')
axes[1].set_xlabel('Predicted Probability')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of LPM Predictions')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 2.4 When is LPM Acceptable?

Despite its problems, LPM can be useful:

✓ **Quick approximation** when predicted probabilities are mostly between 0.3 and 0.7

✓ **Robustness check** alongside Logit/Probit (should have same signs)

✓ **Simple interpretation** when constant marginal effects are acceptable

✓ **Computational simplicity** for very large datasets

However, for serious analysis, we need models that:
- Guarantee $P \in [0,1]$
- Allow non-constant marginal effects
- Have proper statistical foundations

<a id='section3'></a>
## 3. Link Functions

### 3.1 The Solution: Transformation Functions

**Idea**: Transform the linear predictor $\mathbf{X}'\boldsymbol{\beta}$ to ensure $P \in [0,1]$

$$P(y_{it}=1|\mathbf{X}_{it}) = G(\mathbf{X}_{it}'\boldsymbol{\beta})$$

where $G(\cdot)$ is a **link function** (CDF) with properties:
- $G: \mathbb{R} \rightarrow [0,1]$
- $G(z)$ is monotonically increasing
- $\lim_{z \to -\infty} G(z) = 0$ and $\lim_{z \to \infty} G(z) = 1$

### 3.2 Common Link Functions

**Logistic (Logit)**:
$$\Lambda(z) = \frac{\exp(z)}{1 + \exp(z)} = \frac{1}{1 + \exp(-z)}$$

**Normal (Probit)**:
$$\Phi(z) = \int_{-\infty}^{z} \phi(t) dt = \int_{-\infty}^{z} \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{t^2}{2}\right) dt$$

**Linear (LPM)**:
$$L(z) = z \quad \text{(NOT bounded to [0,1]!)}$$

### 3.3 Visualize Link Functions

In [None]:
# Create range of values for linear predictor
z = np.linspace(-6, 6, 1000)

# Calculate link functions
logistic_cdf = 1 / (1 + np.exp(-z))
normal_cdf = norm.cdf(z)
lpm_approx = 0.5 + 0.15 * z  # Linear approximation around z=0

# Plot comparison
plt.figure(figsize=(12, 7))
plt.plot(z, logistic_cdf, label='Logistic: Λ(z) [Logit]', linewidth=2.5, color='#3498db')
plt.plot(z, normal_cdf, label='Normal: Φ(z) [Probit]', linewidth=2.5, 
         linestyle='--', color='#e74c3c')
plt.plot(z, lpm_approx, label='Linear [LPM]', linewidth=2.5, 
         linestyle=':', color='#f39c12')

# Reference lines
plt.axhline(y=0, color='gray', linestyle='-', linewidth=0.8, alpha=0.5)
plt.axhline(y=1, color='gray', linestyle='-', linewidth=0.8, alpha=0.5)
plt.axhline(y=0.5, color='gray', linestyle='--', linewidth=0.8, alpha=0.3)
plt.axvline(x=0, color='gray', linestyle='-', linewidth=0.8, alpha=0.3)

# Formatting
plt.xlabel('Linear Predictor (Xβ)', fontsize=13)
plt.ylabel('Probability P(y=1|X)', fontsize=13)
plt.title('Comparison of Link Functions', fontsize=15, fontweight='bold')
plt.legend(fontsize=12, loc='upper left')
plt.grid(True, alpha=0.3)
plt.ylim([-0.1, 1.1])
plt.xlim([-6, 6])

# Add annotations
plt.text(3, 0.95, 'Both bounded to [0,1]', fontsize=10, 
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.text(-4, 0.3, 'LPM can exceed bounds!', fontsize=10, color='red',
         bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))

plt.tight_layout()
plt.show()

### 3.4 Key Differences

**Logistic vs Normal CDF**:
- Very similar for $|z| < 2$ (most predictions fall here)
- Logistic has slightly **heavier tails**
- In practice, choice between Logit/Probit **rarely matters**

**Why use Logit?**
- Closed-form for probabilities and odds ratios
- Easier coefficient interpretation (odds ratios)
- Slightly simpler computation

**Why use Probit?**
- Natural if assuming normal errors
- Connection to latent variable models
- Rule of thumb: marginal effect at mean ≈ 0.4β

In [None]:
# Quantify similarity between Logit and Probit
difference = np.abs(logistic_cdf - normal_cdf)
print("=== Logistic vs Normal CDF ===")
print(f"Maximum absolute difference: {difference.max():.6f}")
print(f"Mean absolute difference: {difference.mean():.6f}")
print(f"\nFor |z| < 2 (where most predictions lie):")
mask = np.abs(z) < 2
print(f"Maximum difference: {difference[mask].max():.6f}")
print(f"Mean difference: {difference[mask].mean():.6f}")
print("\n→ Conclusion: Logit and Probit predictions are nearly identical!")

<a id='section4'></a>
## 4. Logit Model

### 4.1 Model Specification

**Probability**:
$$P(y_{it}=1|\mathbf{X}_{it}) = \Lambda(\mathbf{X}_{it}'\boldsymbol{\beta}) = \frac{\exp(\mathbf{X}_{it}'\boldsymbol{\beta})}{1 + \exp(\mathbf{X}_{it}'\boldsymbol{\beta})}$$

**Log-likelihood**:
$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{N} \sum_{t=1}^{T} \left[ y_{it} \log(\Lambda(\mathbf{X}_{it}'\boldsymbol{\beta})) + (1-y_{it}) \log(1-\Lambda(\mathbf{X}_{it}'\boldsymbol{\beta})) \right]$$

**Estimation**: Maximum Likelihood Estimation (MLE)

### 4.2 Estimate Pooled Logit

In [None]:
# Estimate Pooled Logit model
logit = PooledLogit("lfp ~ age + educ + kids + married", data, "id", "year")
logit_results = logit.fit(cov_type='cluster')

print("=== Pooled Logit Model ===")
print(logit_results.summary())

### 4.3 Coefficient Interpretation

**IMPORTANT**: Logit coefficients are **log-odds ratios**, NOT marginal effects!

**Log-odds interpretation**:
- $\beta_k > 0$: Increasing $X_k$ increases probability of $y=1$
- $\beta_k < 0$: Increasing $X_k$ decreases probability of $y=1$

**Odds ratio interpretation**:
- $\text{OR} = \exp(\beta_k)$
- Multiplicative effect on odds: $\frac{P(y=1|X)}{P(y=0|X)}$

### 4.4 Calculate Odds Ratios

In [None]:
print("=== Coefficient Interpretation ===")
print("\n1. Coefficients (β) represent LOG-ODDS RATIOS:")
print(logit_results.params)

print("\n2. Odds Ratios (exp(β)):")
odds_ratios = np.exp(logit_results.params)
print(odds_ratios)

print("\n=== Detailed Interpretation ===")

# Education
beta_educ = logit_results.params['educ']
or_educ = odds_ratios['educ']
print(f"\nEducation (educ):")
print(f"  β = {beta_educ:.4f}")
print(f"  OR = exp({beta_educ:.4f}) = {or_educ:.4f}")
print(f"  Interpretation:")
print(f"    - One additional year of education changes log-odds by {beta_educ:.4f}")
print(f"    - Odds of working multiply by {or_educ:.4f}")
print(f"    - ≈ {100*(or_educ-1):.2f}% change in odds")

# Kids
beta_kids = logit_results.params['kids']
or_kids = odds_ratios['kids']
print(f"\nChildren (kids):")
print(f"  β = {beta_kids:.4f}")
print(f"  OR = exp({beta_kids:.4f}) = {or_kids:.4f}")
print(f"  Interpretation:")
print(f"    - One additional child changes log-odds by {beta_kids:.4f}")
print(f"    - Odds of working multiply by {or_kids:.4f}")
print(f"    - ≈ {100*(or_kids-1):.2f}% change in odds (DECREASE)")

# Married
beta_married = logit_results.params['married']
or_married = odds_ratios['married']
print(f"\nMarried:")
print(f"  β = {beta_married:.4f}")
print(f"  OR = exp({beta_married:.4f}) = {or_married:.4f}")
print(f"  Interpretation:")
print(f"    - Being married (vs not) changes log-odds by {beta_married:.4f}")
print(f"    - Odds of working multiply by {or_married:.4f}")
if or_married > 1:
    print(f"    - ≈ {100*(or_married-1):.2f}% increase in odds")
else:
    print(f"    - ≈ {100*(1-or_married):.2f}% decrease in odds")

In [None]:
# WARNING about marginal effects
print("\n" + "="*60)
print("⚠️  IMPORTANT WARNING ⚠️")
print("="*60)
print("\nCoefficients β ≠ Marginal Effects!")
print("\nMarginal effects are:")
print("  ∂P(y=1)/∂X_k = β_k × Λ(Xβ) × [1 - Λ(Xβ)]")
print("\nThey depend on:")
print("  1. The coefficient β_k")
print("  2. The values of ALL covariates (X)")
print("  3. The current probability level")
print("\nMarginal effects will be covered in detail in Notebook 04.")
print("="*60)

### 4.5 Model Diagnostics

In [None]:
print("=== Model Diagnostics ===")

# Convergence
print(f"\nConvergence: {logit_results.converged}")
if hasattr(logit_results, 'method'):
    print(f"Optimization method: {logit_results.method}")

# Likelihood and information criteria
print(f"\nLog-Likelihood: {logit_results.llf:.2f}")
print(f"AIC: {logit_results.aic:.2f}")
print(f"BIC: {logit_results.bic:.2f}")

# Pseudo-R² (basic)
if hasattr(logit_results, 'pseudo_r2'):
    pr2_dict = logit_results.pseudo_r2()
    print(f"\nPseudo-R² (McFadden): {pr2_dict.get('mcfadden', 'N/A')}")

print("\n✓ Model converged successfully")

<a id='section5'></a>
## 5. Probit Model

### 5.1 Model Specification

**Probability**:
$$P(y_{it}=1|\mathbf{X}_{it}) = \Phi(\mathbf{X}_{it}'\boldsymbol{\beta}) = \int_{-\infty}^{\mathbf{X}_{it}'\boldsymbol{\beta}} \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{t^2}{2}\right) dt$$

**Latent variable interpretation**:
$$y_{it}^* = \mathbf{X}_{it}'\boldsymbol{\beta} + \varepsilon_{it}, \quad \varepsilon_{it} \sim N(0,1)$$
$$y_{it} = \mathbb{1}[y_{it}^* > 0]$$

### 5.2 Estimate Pooled Probit

In [None]:
# Estimate Pooled Probit model
probit = PooledProbit("lfp ~ age + educ + kids + married", data, "id", "year")
probit_results = probit.fit(cov_type='cluster')

print("=== Pooled Probit Model ===")
print(probit_results.summary())

### 5.3 Logit vs Probit Comparison

**Key point**: Coefficients are NOT directly comparable!

**Why?** Different variance scaling:
- Logit error variance: $\text{Var}(\varepsilon) = \pi^2/3 \approx 3.29$
- Probit error variance: $\text{Var}(\varepsilon) = 1$

**Rule of thumb**: $\beta_{\text{probit}} \approx 0.625 \times \beta_{\text{logit}}$

In [None]:
# Compare Logit and Probit coefficients
comparison = pd.DataFrame({
    'Logit_β': logit_results.params,
    'Probit_β': probit_results.params,
    'Probit_scaled_β': probit_results.params / 0.625,  # Scale to Logit scale
    'Ratio (P/L)': probit_results.params / logit_results.params
})

print("=== Logit vs Probit Coefficient Comparison ===")
print(comparison)

print("\nNotes:")
print("  1. Coefficients are NOT directly comparable (different scales)")
print("  2. Rule of thumb: β_probit ≈ 0.625 × β_logit")
print(f"  3. Average ratio observed: {(probit_results.params / logit_results.params).mean():.3f}")
print("  4. What MATTERS: Signs and relative magnitudes should be consistent")

### 5.4 When to Use Probit vs Logit

**Use Probit when**:
- You have theoretical reasons to assume normal errors
- Latent variable interpretation is natural for your application
- You want to use the 0.4β rule for quick marginal effect approximations
- Extending to multivariate probit (correlated errors across equations)

**Use Logit when**:
- You want easier coefficient interpretation (odds ratios)
- Computational simplicity is important
- Following conventions in your field (Logit more common in applied work)
- Extending to conditional logit or nested logit models

**In practice**: Choice rarely matters for final conclusions!

<a id='section6'></a>
## 6. Model Comparison

Let's systematically compare LPM, Logit, and Probit.

### 6.1 Coefficient Comparison

In [None]:
# Create comprehensive comparison table
coef_comparison = pd.DataFrame({
    'LPM': lpm_results.params,
    'Logit': logit_results.params,
    'Probit': probit_results.params
})

print("=== Coefficient Comparison Across Models ===")
print(coef_comparison)

print("\n" + "="*60)
print("INTERPRETATION GUIDE")
print("="*60)
print("\nMagnitudes are NOT directly comparable across models because:")
print("  • LPM: Coefficients = constant marginal effects on probability")
print("  • Logit: Coefficients = log-odds ratios")
print("  • Probit: Coefficients = latent variable effects (different scale)")
print("\nWhat you SHOULD compare:")
print("  ✓ Signs (positive/negative direction)")
print("  ✓ Statistical significance")
print("  ✓ Relative magnitudes within each model")
print("  ✓ Predicted probabilities (next section)")
print("="*60)

### 6.2 Prediction Comparison

While coefficients differ, **predicted probabilities** should be very similar.

In [None]:
# Generate predictions from all three models
data['lpm_prob'] = lpm_results.predict()
data['logit_prob'] = logit_results.predict(type='prob')
data['probit_prob'] = probit_results.predict(type='prob')

# Clip LPM predictions to [0,1] for fair comparison
data['lpm_prob_clipped'] = data['lpm_prob'].clip(0, 1)

# Correlation matrix
pred_corr = data[['lpm_prob_clipped', 'logit_prob', 'probit_prob']].corr()
print("=== Correlation of Predicted Probabilities ===")
print(pred_corr)
print("\n→ Logit and Probit predictions are nearly identical (r ≈ 1.0)")
print("→ LPM also highly correlated, but with systematic differences")

In [None]:
# Visualize prediction comparisons
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# LPM vs Logit
axes[0].scatter(data['logit_prob'], data['lpm_prob_clipped'], alpha=0.3, s=10)
axes[0].plot([0, 1], [0, 1], 'r--', label='45-degree line', linewidth=2)
axes[0].set_xlabel('Logit Prediction')
axes[0].set_ylabel('LPM Prediction (clipped)')
axes[0].set_title(f'LPM vs Logit\n(corr = {pred_corr.loc["lpm_prob_clipped", "logit_prob"]:.4f})')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_xlim([0, 1])
axes[0].set_ylim([0, 1])

# Probit vs Logit
axes[1].scatter(data['logit_prob'], data['probit_prob'], alpha=0.3, s=10, color='green')
axes[1].plot([0, 1], [0, 1], 'r--', label='45-degree line', linewidth=2)
axes[1].set_xlabel('Logit Prediction')
axes[1].set_ylabel('Probit Prediction')
axes[1].set_title(f'Probit vs Logit\n(corr = {pred_corr.loc["logit_prob", "probit_prob"]:.4f})')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim([0, 1])
axes[1].set_ylim([0, 1])

# LPM vs Probit
axes[2].scatter(data['probit_prob'], data['lpm_prob_clipped'], alpha=0.3, s=10, color='orange')
axes[2].plot([0, 1], [0, 1], 'r--', label='45-degree line', linewidth=2)
axes[2].set_xlabel('Probit Prediction')
axes[2].set_ylabel('LPM Prediction (clipped)')
axes[2].set_title(f'LPM vs Probit\n(corr = {pred_corr.loc["lpm_prob_clipped", "probit_prob"]:.4f})')
axes[2].legend()
axes[2].grid(True, alpha=0.3)
axes[2].set_xlim([0, 1])
axes[2].set_ylim([0, 1])

plt.tight_layout()
plt.show()

print("\nKey Observation: Logit and Probit predictions lie almost exactly on the 45° line!")

### 6.3 Model Fit Statistics

In [None]:
# Get pseudo-R² for Logit and Probit
logit_pr2 = logit_results.pseudo_r2() if hasattr(logit_results, 'pseudo_r2') else {}
probit_pr2 = probit_results.pseudo_r2() if hasattr(probit_results, 'pseudo_r2') else {}

# Create model comparison table
fit_stats = pd.DataFrame({
    'LPM': [
        lpm_results.rsquared,
        np.nan,
        np.nan
    ],
    'Logit': [
        logit_pr2.get('mcfadden', np.nan),
        logit_results.aic,
        logit_results.bic
    ],
    'Probit': [
        probit_pr2.get('mcfadden', np.nan),
        probit_results.aic,
        probit_results.bic
    ]
}, index=['R²/Pseudo-R²', 'AIC', 'BIC'])

print("=== Model Fit Comparison ===")
print(fit_stats)

print("\nInterpretation:")
print("  • Lower AIC/BIC = better fit (penalizes complexity)")
print("  • Pseudo-R² NOT comparable to OLS R² (different interpretation)")
print("  • Logit and Probit have nearly identical fit")

# Determine best model by AIC
best_aic = fit_stats.loc['AIC'].idxmin()
best_bic = fit_stats.loc['BIC'].idxmin()
print(f"\nBest model by AIC: {best_aic}")
print(f"Best model by BIC: {best_bic}")
print("\n✓ Recommendation: Use Logit or Probit (strongly preferred over LPM)")

### 6.4 Summary: Model Selection

**Conclusion**:
1. ✓ **Logit and Probit** produce virtually identical predictions
2. ✓ Both are **strongly preferred** over LPM for formal analysis
3. ✓ Choice between Logit/Probit is often **arbitrary**
4. ○ LPM can be used as a **robustness check** or quick approximation

**Practical advice**: Use **Logit** unless you have specific reasons to prefer Probit.

<a id='section7'></a>
## 7. Predictions and Classification

### 7.1 Types of Predictions

Binary choice models can generate three types of predictions:

1. **Probabilities**: $P(y=1|\mathbf{X})$ (default, most useful)
2. **Linear predictor**: $\mathbf{X}'\boldsymbol{\beta}$ (for advanced diagnostics)
3. **Classes**: $\hat{y} \in \{0, 1\}$ (for classification tasks)

### 7.2 Generate Predictions

In [None]:
print("=== Types of Predictions from Logit Model ===")

# 1. Predicted probabilities (default)
probs = logit_results.predict(type='prob')
print("\n1. Predicted Probabilities P(y=1|X):")
print(f"   Shape: {probs.shape}")
print(f"   Range: [{probs.min():.4f}, {probs.max():.4f}]")
print(f"   First 10 values:\n{probs[:10]}")

# 2. Linear index (Xβ)
linear_pred = logit_results.predict(type='linear')
print("\n2. Linear Predictor (Xβ):")
print(f"   Shape: {linear_pred.shape}")
print(f"   Range: [{linear_pred.min():.4f}, {linear_pred.max():.4f}]")
print(f"   First 10 values:\n{linear_pred[:10]}")

# 3. Predicted classes (threshold=0.5)
classes = logit_results.predict(type='class', threshold=0.5)
print("\n3. Predicted Classes (threshold=0.5):")
print(f"   Shape: {classes.shape}")
print(f"   Unique values: {np.unique(classes)}")
print(f"   First 10 values:\n{classes[:10]}")
print(f"\n   Predicted proportion of y=1: {classes.mean():.4f}")
print(f"   Actual proportion of y=1: {data['lfp'].mean():.4f}")

### 7.3 Confusion Matrix

The **confusion matrix** shows how well classifications match actual outcomes:

|                | Predicted 0 | Predicted 1 |
|----------------|-------------|-------------|
| **Actual 0**   | TN          | FP          |
| **Actual 1**   | FN          | TP          |

- **TN**: True Negatives (correctly predicted 0)
- **TP**: True Positives (correctly predicted 1)  
- **FP**: False Positives (predicted 1, actually 0)
- **FN**: False Negatives (predicted 0, actually 1)

In [None]:
# Generate confusion matrix
if hasattr(logit_results, 'classification_table'):
    confusion = logit_results.classification_table(threshold=0.5)
    print("=== Confusion Matrix (threshold=0.5) ===")
    print(confusion)
else:
    # Manual calculation
    from sklearn.metrics import confusion_matrix
    classes = (logit_results.predict(type='prob') > 0.5).astype(int)
    cm = confusion_matrix(data['lfp'], classes)
    print("=== Confusion Matrix (threshold=0.5) ===")
    print(pd.DataFrame(cm, 
                      index=['Actual 0', 'Actual 1'],
                      columns=['Predicted 0', 'Predicted 1']))
    
    # Extract values
    tn, fp, fn, tp = cm.ravel()
    print(f"\nBreakdown:")
    print(f"  True Negatives (TN):  {tn}")
    print(f"  False Positives (FP): {fp}")
    print(f"  False Negatives (FN): {fn}")
    print(f"  True Positives (TP):  {tp}")
    print(f"\nTotal correct: {tn + tp} / {len(data)} = {(tn+tp)/len(data):.4f}")

### 7.4 Classification Metrics

Key performance metrics:

- **Accuracy**: $(TP + TN) / N$ — overall correctness
- **Precision**: $TP / (TP + FP)$ — of predicted positives, % actually positive
- **Recall (Sensitivity)**: $TP / (TP + FN)$ — of actual positives, % correctly predicted
- **F1-Score**: $2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$ — harmonic mean

In [None]:
# Calculate classification metrics
if hasattr(logit_results, 'classification_metrics'):
    metrics = logit_results.classification_metrics(threshold=0.5)
    print("=== Classification Metrics (threshold=0.5) ===")
    for metric, value in metrics.items():
        print(f"{metric:15s}: {value:.4f}")
else:
    # Manual calculation
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
    
    y_true = data['lfp'].values
    y_pred = (logit_results.predict(type='prob') > 0.5).astype(int)
    
    metrics = {
        'Accuracy': accuracy_score(y_true, y_pred),
        'Precision': precision_score(y_true, y_pred),
        'Recall': recall_score(y_true, y_pred),
        'F1-Score': f1_score(y_true, y_pred)
    }
    
    print("=== Classification Metrics (threshold=0.5) ===")
    for metric, value in metrics.items():
        print(f"{metric:15s}: {value:.4f}")

print("\n=== Interpretation ===")
print(f"Accuracy:  {metrics.get('Accuracy', 0)*100:.2f}% of all predictions are correct")
print(f"Precision: {metrics.get('Precision', 0)*100:.2f}% of predicted 'working' are actually working")
print(f"Recall:    {metrics.get('Recall', 0)*100:.2f}% of actual 'working' are correctly predicted")
print(f"F1-Score:  Harmonic mean = {metrics.get('F1-Score', 0):.4f}")

### 7.5 ROC Curve and AUC

**ROC Curve**: Plots True Positive Rate vs False Positive Rate across all thresholds

**AUC**: Area Under the ROC Curve — probability model ranks random positive higher than random negative

- AUC = 0.5: Random classifier
- AUC = 1.0: Perfect classifier
- AUC > 0.7: Generally considered good

In [None]:
# Calculate ROC curve
y_true = data['lfp'].values
y_scores = logit_results.predict(type='prob')

fpr, tpr, thresholds = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(10, 7))
plt.plot(fpr, tpr, label=f'Logit (AUC = {roc_auc:.4f})', linewidth=2.5, color='#2980b9')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier (AUC = 0.5)', linewidth=2)
plt.xlabel('False Positive Rate', fontsize=13)
plt.ylabel('True Positive Rate (Recall)', fontsize=13)
plt.title('ROC Curve - Logit Model', fontsize=15, fontweight='bold')
plt.legend(fontsize=12, loc='lower right')
plt.grid(True, alpha=0.3)
plt.xlim([0, 1])
plt.ylim([0, 1])

# Add text annotation
plt.text(0.6, 0.2, f'AUC = {roc_auc:.4f}\n(Excellent)' if roc_auc > 0.8 else f'AUC = {roc_auc:.4f}',
         fontsize=14, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.7))

plt.tight_layout()
plt.show()

print(f"\n=== ROC Analysis ===")
print(f"AUC (Area Under ROC Curve): {roc_auc:.4f}")
print(f"\nInterpretation:")
print(f"  The model has a {roc_auc:.1%} probability of ranking a randomly chosen")
print(f"  working person higher than a randomly chosen non-working person.")
if roc_auc > 0.8:
    print(f"\n  ✓ AUC > 0.8: Excellent discrimination")
elif roc_auc > 0.7:
    print(f"\n  ✓ AUC > 0.7: Good discrimination")
else:
    print(f"\n  ○ AUC < 0.7: Moderate discrimination")

### 7.6 Distribution of Predicted Probabilities

Visualize how well the model separates the two classes.

In [None]:
# Plot predicted probability distributions by actual class
fig, ax = plt.subplots(figsize=(12, 6))

data_class_0 = data[data['lfp'] == 0]['logit_prob']
data_class_1 = data[data['lfp'] == 1]['logit_prob']

ax.hist(data_class_0, bins=40, alpha=0.6, label=f'Actual = 0 (Not Working, n={len(data_class_0)})', 
        color='#e74c3c', edgecolor='black')
ax.hist(data_class_1, bins=40, alpha=0.6, label=f'Actual = 1 (Working, n={len(data_class_1)})', 
        color='#27ae60', edgecolor='black')
ax.axvline(x=0.5, color='black', linestyle='--', linewidth=2.5, 
           label='Classification Threshold (0.5)', zorder=5)

ax.set_xlabel('Predicted Probability of Working', fontsize=13)
ax.set_ylabel('Frequency', fontsize=13)
ax.set_title('Distribution of Predicted Probabilities by Actual Outcome', 
             fontsize=15, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nIdeal scenario: Two distributions are well-separated")
print("→ Model successfully distinguishes between the two groups")
print(f"\nMean predicted prob for y=0: {data_class_0.mean():.4f}")
print(f"Mean predicted prob for y=1: {data_class_1.mean():.4f}")
print(f"Separation (difference): {abs(data_class_1.mean() - data_class_0.mean()):.4f}")

<a id='section8'></a>
## 8. Goodness-of-Fit Tests

### 8.1 Pseudo-R² Measures

**IMPORTANT**: Pseudo-R² ≠ OLS R²

Common pseudo-R² measures:
- **McFadden**: $1 - \frac{\ell(\boldsymbol{\beta})}{\ell_0}$ where $\ell_0$ is log-likelihood of intercept-only model
- **Cox-Snell**: $1 - \exp\left(\frac{2(\ell_0 - \ell(\boldsymbol{\beta}))}{N}\right)$
- **Nagelkerke**: Adjusted Cox-Snell to have maximum of 1

**Interpretation**:
- NOT "proportion of variance explained"
- Use for model comparison, not absolute fit
- Values 0.2-0.4 considered good (much lower than OLS R²)

In [None]:
# Calculate all pseudo-R² measures
if hasattr(logit_results, 'pseudo_r2'):
    pseudo_r2_dict = logit_results.pseudo_r2()
    
    print("=== Pseudo-R² Measures ===")
    for name, value in pseudo_r2_dict.items():
        print(f"{name:20s}: {value:.4f}")
else:
    # Basic McFadden calculation
    ll_full = logit_results.llf
    # Estimate null model
    null_model = PooledLogit("lfp ~ 1", data, "id", "year")
    null_results = null_model.fit()
    ll_null = null_results.llf
    
    mcfadden = 1 - (ll_full / ll_null)
    print("=== Pseudo-R² Measures ===")
    print(f"McFadden R²: {mcfadden:.4f}")

print("\n" + "="*70)
print("⚠️  IMPORTANT NOTES ON PSEUDO-R²")
print("="*70)
print("1. Pseudo-R² is NOT the same as OLS R²")
print("2. Cannot be interpreted as 'proportion of variance explained'")
print("3. Values typically much lower than OLS R² (0.2-0.4 is considered GOOD)")
print("4. Use for model comparison, not absolute fit assessment")
print("5. Different pseudo-R² measures can give different values")
print("="*70)

### 8.2 Hosmer-Lemeshow Test

Tests if observed and expected frequencies match across probability deciles.

**Null hypothesis**: Model is correctly specified (good fit)

**Test statistic**: $H = \sum_{g=1}^{G} \frac{(O_g - E_g)^2}{E_g \times (1-E_g/N_g)}$

- $H \sim \chi^2_{G-2}$ under $H_0$
- Typically use $G=10$ groups
- **Large p-value** = fail to reject $H_0$ = good fit

In [None]:
# Perform Hosmer-Lemeshow goodness-of-fit test
if hasattr(logit_results, 'hosmer_lemeshow_test'):
    hl_test = logit_results.hosmer_lemeshow_test(n_groups=10)
    
    print("=== Hosmer-Lemeshow Goodness-of-Fit Test ===")
    print(f"Chi-square statistic: {hl_test['statistic']:.4f}")
    print(f"p-value: {hl_test['p_value']:.4f}")
    print(f"Degrees of freedom: {hl_test['df']}")
    
    print("\nInterpretation:")
    print("  H0: Model is correctly specified (observed frequencies ~ expected frequencies)")
    if hl_test['p_value'] > 0.05:
        print(f"  Result: Fail to reject H0 (p={hl_test['p_value']:.4f} > 0.05)")
        print("  Conclusion: No evidence of model misspecification ✓")
    else:
        print(f"  Result: Reject H0 (p={hl_test['p_value']:.4f} ≤ 0.05)")
        print("  Conclusion: Evidence of model misspecification ✗")
        print("  Action: Consider adding interactions, polynomials, or other variables")
else:
    print("=== Hosmer-Lemeshow Test ===")
    print("Function not yet implemented in PanelBox.")
    print("This test will be available in future versions.")

### 8.3 Link Test

Tests if link function is correctly specified.

**Procedure**:
1. Calculate $\hat{y} = \mathbf{X}'\hat{\boldsymbol{\beta}}$ (linear predictor)
2. Regress $y$ on $\hat{y}$ and $\hat{y}^2$
3. Test if $\hat{y}^2$ is significant

**Interpretation**:
- $\hat{y}^2$ NOT significant → link function is appropriate
- $\hat{y}^2$ significant → may need different link or additional variables

In [None]:
# Perform link test (specification test)
if hasattr(logit_results, 'link_test'):
    link_test = logit_results.link_test()
    
    print("=== Link Test (Specification Test) ===")
    print(f"Coefficient on (Xβ)²: {link_test['coef']:.4f}")
    print(f"Standard error: {link_test.get('se', 'N/A')}")
    print(f"p-value: {link_test['p_value']:.4f}")
    
    print("\nInterpretation:")
    print("  H0: Link function is correctly specified (coefficient on (Xβ)² = 0)")
    if link_test['p_value'] > 0.05:
        print(f"  Result: (Xβ)² is NOT significant (p={link_test['p_value']:.4f})")
        print("  Conclusion: Logit link function is appropriate ✓")
    else:
        print(f"  Result: (Xβ)² IS significant (p={link_test['p_value']:.4f})")
        print("  Conclusion: May need different link function or additional variables ✗")
        print("  Suggestions: Try adding quadratic terms, interactions, or Probit")
else:
    print("=== Link Test ===")
    print("Function not yet implemented in PanelBox.")
    print("This test will be available in future versions.")

<a id='section9'></a>
## 9. Application: Labor Force Participation

### 9.1 Research Question

**"What factors influence women's labor force participation decisions?"**

We'll estimate a comprehensive model including:
- Age (with quadratic term to capture lifecycle effects)
- Education
- Number of children
- Marital status
- Work experience

### 9.2 Full Model Estimation

In [None]:
# Estimate comprehensive model with quadratic age term
full_formula = "lfp ~ age + I(age**2) + educ + kids + married + exper"
logit_full = PooledLogit(full_formula, data, "id", "year")
results_full = logit_full.fit(cov_type='cluster')

print("="*80)
print(" " * 20 + "LABOR FORCE PARTICIPATION MODEL")
print("="*80)
print(results_full.summary())

### 9.3 Economic Interpretation

In [None]:
print("="*80)
print(" " * 25 + "ECONOMIC INTERPRETATION")
print("="*80)

# Age effects (quadratic)
beta_age = results_full.params['age']
beta_age2 = results_full.params['I(age ** 2)']
print("\n1. AGE EFFECTS (Quadratic Specification)")
print("   " + "-" * 60)
print(f"   Linear term (age):     β = {beta_age:.4f}")
print(f"   Quadratic term (age²): β = {beta_age2:.4f}")

if beta_age2 < 0:
    optimal_age = -beta_age / (2 * beta_age2)
    print(f"\n   → Inverted-U shape: Participation increases then decreases with age")
    print(f"   → Peak participation at age: {optimal_age:.1f} years")
elif beta_age2 > 0:
    print(f"\n   → U-shape: Participation decreases then increases with age")
else:
    print(f"\n   → Linear relationship with age")

# Education
beta_educ = results_full.params['educ']
or_educ = np.exp(beta_educ)
print("\n2. EDUCATION")
print("   " + "-" * 60)
print(f"   Coefficient: β = {beta_educ:.4f}")
print(f"   Odds Ratio: exp(β) = {or_educ:.4f}")
print(f"\n   → Each additional year of education MULTIPLIES odds of working by {or_educ:.4f}")
print(f"   → ≈ {100*(or_educ-1):.2f}% increase in odds per year of schooling")
if beta_educ > 0:
    print(f"   → POSITIVE effect: More education → Higher labor force participation")

# Kids
beta_kids = results_full.params['kids']
or_kids = np.exp(beta_kids)
print("\n3. NUMBER OF CHILDREN")
print("   " + "-" * 60)
print(f"   Coefficient: β = {beta_kids:.4f}")
print(f"   Odds Ratio: exp(β) = {or_kids:.4f}")
print(f"\n   → Each additional child MULTIPLIES odds of working by {or_kids:.4f}")
if or_kids < 1:
    print(f"   → ≈ {100*(1-or_kids):.2f}% DECREASE in odds per child")
    print(f"   → NEGATIVE effect: More children → Lower labor force participation")
else:
    print(f"   → ≈ {100*(or_kids-1):.2f}% increase in odds per child")

# Marriage
beta_married = results_full.params['married']
or_married = np.exp(beta_married)
print("\n4. MARITAL STATUS")
print("   " + "-" * 60)
print(f"   Coefficient: β = {beta_married:.4f}")
print(f"   Odds Ratio: exp(β) = {or_married:.4f}")
print(f"\n   → Being married (vs single) MULTIPLIES odds of working by {or_married:.4f}")
if or_married < 1:
    print(f"   → ≈ {100*(1-or_married):.2f}% DECREASE in odds")
    print(f"   → NEGATIVE effect: Marriage associated with lower participation")
else:
    print(f"   → ≈ {100*(or_married-1):.2f}% increase in odds")
    print(f"   → POSITIVE effect: Marriage associated with higher participation")

# Experience
beta_exper = results_full.params['exper']
or_exper = np.exp(beta_exper)
print("\n5. WORK EXPERIENCE")
print("   " + "-" * 60)
print(f"   Coefficient: β = {beta_exper:.4f}")
print(f"   Odds Ratio: exp(β) = {or_exper:.4f}")
print(f"\n   → Each additional year of experience MULTIPLIES odds by {or_exper:.4f}")
print(f"   → ≈ {100*(or_exper-1):.2f}% change in odds per year")
if beta_exper > 0:
    print(f"   → POSITIVE effect: More experience → Higher participation (human capital)")

print("\n" + "="*80)

### 9.4 Predictions for Representative Profiles

Let's predict participation probabilities for different types of women.

In [None]:
# Create representative profiles
profiles = pd.DataFrame({
    'age': [30, 30, 40, 40],
    'educ': [12, 16, 12, 16],  # High school vs college
    'kids': [0, 0, 2, 2],
    'married': [0, 0, 1, 1],
    'exper': [5, 5, 10, 10]
})

# Predict probabilities (need to handle formula transformation)
from panelbox.utils import prepare_data_for_prediction

# Manually calculate linear predictor
profiles_pred = profiles.copy()
xb = (results_full.params['Intercept'] +
      results_full.params['age'] * profiles_pred['age'] +
      results_full.params['I(age ** 2)'] * (profiles_pred['age'] ** 2) +
      results_full.params['educ'] * profiles_pred['educ'] +
      results_full.params['kids'] * profiles_pred['kids'] +
      results_full.params['married'] * profiles_pred['married'] +
      results_full.params['exper'] * profiles_pred['exper'])

# Apply logit transformation
profiles_pred['prob_working'] = 1 / (1 + np.exp(-xb))

print("="*80)
print(" " * 15 + "PREDICTED PROBABILITIES FOR REPRESENTATIVE WOMEN")
print("="*80)
print(profiles_pred.to_string(index=False))

print("\n" + "="*80)
print("INTERPRETATION:")
print("="*80)

for i, row in profiles_pred.iterrows():
    print(f"\nProfile {i+1}:")
    print(f"  • Age: {int(row['age'])} years")
    print(f"  • Education: {int(row['educ'])} years ({'College' if row['educ'] >= 16 else 'High School'})")
    print(f"  • Children: {int(row['kids'])}")
    print(f"  • Marital status: {'Married' if row['married'] else 'Single'}")
    print(f"  • Experience: {int(row['exper'])} years")
    print(f"  " + "-" * 60)
    print(f"  → Predicted probability of working: {100*row['prob_working']:.1f}%")

print("\n" + "="*80)

### 9.5 Visualization: Age Profile

How does the probability of working vary with age?

In [None]:
# Create age profile (holding other variables constant)
age_range = np.arange(20, 65, 1)
age_profile = pd.DataFrame({
    'age': age_range,
    'educ': 12,      # High school
    'kids': 1,       # 1 child (average)
    'married': 0,    # Single
    'exper': 5       # 5 years experience
})

# Calculate predictions
xb_age = (results_full.params['Intercept'] +
          results_full.params['age'] * age_profile['age'] +
          results_full.params['I(age ** 2)'] * (age_profile['age'] ** 2) +
          results_full.params['educ'] * age_profile['educ'] +
          results_full.params['kids'] * age_profile['kids'] +
          results_full.params['married'] * age_profile['married'] +
          results_full.params['exper'] * age_profile['exper'])

age_profile['prob'] = 1 / (1 + np.exp(-xb_age))

# Plot
plt.figure(figsize=(12, 7))
plt.plot(age_profile['age'], age_profile['prob'], linewidth=3, color='#2980b9')
plt.xlabel('Age (years)', fontsize=13)
plt.ylabel('Probability of Working', fontsize=13)
plt.title('Labor Force Participation by Age\n(12 years education, 1 child, single, 5 years experience)',
          fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.axhline(y=0.5, color='red', linestyle='--', alpha=0.5, linewidth=2, label='50% threshold')

# Find peak
peak_idx = age_profile['prob'].idxmax()
peak_age = age_profile.loc[peak_idx, 'age']
peak_prob = age_profile.loc[peak_idx, 'prob']
plt.scatter([peak_age], [peak_prob], color='red', s=200, zorder=5, marker='o')
plt.annotate(f'Peak: Age {peak_age:.0f}\nP = {peak_prob:.2f}',
             xy=(peak_age, peak_prob), xytext=(peak_age+5, peak_prob-0.05),
             fontsize=11, bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7),
             arrowprops=dict(arrowstyle='->', color='red', lw=2))

plt.ylim([0, 1])
plt.xlim([20, 65])
plt.legend(fontsize=11)
plt.tight_layout()
plt.show()

print(f"\nKey Finding: Participation probability peaks at age {peak_age:.0f} ({peak_prob:.1%})")
print(f"This reflects the lifecycle pattern of female labor force participation.")

<a id='exercises'></a>
## 10. Exercises

Test your understanding with these exercises!

---

### Exercise 1: Home Purchase Decision (Easy)

**Scenario**: Predict whether households purchase a home based on income, age, and credit score.

**Task**:
1. Generate synthetic data with `n=2000` observations
2. Estimate LPM, Logit, and Probit models
3. Compare predicted probabilities
4. Which model would you prefer and why?

**Starter code**:

In [None]:
# Exercise 1: Your solution here

# Step 1: Generate synthetic data
np.random.seed(123)
n = 2000

# TODO: Create DataFrame with:
#   - income (normal, mean=50000, sd=15000)
#   - age (uniform, 25-65)
#   - credit_score (normal, mean=700, sd=50)
#   - bought_home (binary, based on logit model)

# Step 2: Estimate models
# TODO: Estimate LPM, Logit, Probit

# Step 3: Compare predictions
# TODO: Create scatter plots

# Step 4: Discussion
# TODO: Write your conclusion

**Hint**: Use formula: `bought_home ~ income + age + credit_score`

---

### Exercise 2: Odds Ratios Interpretation (Medium)

Using the labor participation Logit model from Section 9:

**Task**:
1. Calculate odds ratios for all variables
2. Write a one-paragraph interpretation of each coefficient
3. Which variables have the strongest effects?
4. Compare the effect sizes

**Questions to answer**:
- How do you interpret the odds ratio for education?
- What does a negative coefficient mean in terms of odds?
- Can you rank variables by importance?

In [None]:
# Exercise 2: Your solution here

# Step 1: Calculate odds ratios
# TODO: Use results_full from Section 9

# Step 2: Interpret each variable
# TODO: Write interpretations

# Step 3: Rank by effect size
# TODO: Compare absolute values

---

### Exercise 3: Model Comparison (Medium)

Compare Logit and Probit models systematically.

**Task**:
1. Estimate both models with identical specifications
2. Compare AIC, BIC, and pseudo-R²
3. Test if predictions are statistically different using correlation
4. Create visualization comparing predicted probabilities
5. Calculate mean absolute difference in predictions

**Bonus**: Estimate a model with interaction terms (e.g., `educ:married`) and repeat comparison.

In [None]:
# Exercise 3: Your solution here

# Step 1: Estimate both models
# TODO: Use full specification from Section 9

# Step 2: Compare fit statistics
# TODO: Create comparison table

# Step 3: Test prediction differences
# TODO: Calculate correlation and MAE

# Step 4: Visualize
# TODO: Scatter plot with 45-degree line

# Bonus: Interaction model
# TODO: Add interaction terms

---

### Exercise 4: Classification Threshold Analysis (Hard)

Explore how classification threshold affects performance.

**Task**:
1. Calculate classification metrics for thresholds: 0.3, 0.4, 0.5, 0.6, 0.7
2. Plot precision-recall trade-off curve
3. Which threshold maximizes F1-score?
4. Discuss when you might choose a threshold other than 0.5

**Advanced**:
- Calculate cost-sensitive threshold if false negatives cost 2x false positives
- Plot threshold vs accuracy/precision/recall on same graph

In [None]:
# Exercise 4: Your solution here

# Step 1: Loop over thresholds
thresholds_to_test = [0.3, 0.4, 0.5, 0.6, 0.7]
results_by_threshold = []

# TODO: Calculate metrics for each threshold

# Step 2: Plot precision-recall trade-off
# TODO: Create plot

# Step 3: Find optimal threshold
# TODO: Maximize F1-score

# Step 4: Discussion
# TODO: Write discussion on threshold selection

**Discussion prompts**:
- Medical diagnosis: Would you use threshold = 0.5?
- Spam detection: What threshold makes sense?
- Loan approval: How does cost asymmetry affect choice?

---

## Summary and Key Takeaways

### What We Learned

1. **Binary outcomes require specialized models** — LPM has fundamental problems

2. **Link functions** (Logistic, Normal) ensure $P \in [0,1]$

3. **Logit and Probit** give nearly identical results in practice

4. **Coefficients** are NOT marginal effects:
   - Logit: log-odds ratios → interpret via $\exp(\beta)$
   - Probit: latent variable effects

5. **Pseudo-R²** is NOT like OLS R² — use for comparison only

6. **Classification metrics** (accuracy, precision, recall) assess prediction quality

7. **Applied interpretation** requires understanding odds ratios and probability predictions

### Next Steps

**Notebook 02**: Fixed Effects Logit — control for unobserved heterogeneity

**Notebook 03**: Random Effects — model individual-specific effects

**Notebook 04**: Marginal Effects — calculate proper marginal effects and elasticities

---

## References

### Essential Reading

1. **Wooldridge, J. M. (2010)**. *Econometric Analysis of Cross Section and Panel Data* (2nd ed.). MIT Press. Chapter 15.

2. **Cameron, A. C., & Trivedi, P. K. (2005)**. *Microeconometrics: Methods and Applications*. Cambridge University Press. Chapter 14.

3. **Greene, W. H. (2018)**. *Econometric Analysis* (8th ed.). Pearson. Chapter 17.

### Classic Papers

4. **Hosmer, D. W., & Lemeshow, S. (2000)**. *Applied Logistic Regression* (2nd ed.). Wiley.

5. **Mroz, T. A. (1987)**. "The sensitivity of an empirical model of married women's hours of work to economic and statistical assumptions." *Econometrica*, 765-799.

---

**Thank you for completing this tutorial!**

Questions or feedback? Visit: https://github.com/panelbox/panelbox/issues