# Tutorial: Panel Discrete Choice Models with PanelBox

## Introduction

This tutorial demonstrates how to use PanelBox for estimating discrete choice models with panel data. We'll work through a practical example of labor force participation, covering:

1. **Pooled models** (Logit and Probit)
2. **Fixed Effects Logit** (Conditional Maximum Likelihood)
3. **Random Effects Probit** (Butler & Moffitt quadrature)
4. **Marginal Effects** interpretation
5. **Model Selection** using specification tests
6. **Common pitfalls** and best practices


## 1. Setup and Data Loading

First, let's import the necessary libraries and generate some example data for labor force participation.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import PanelBox discrete choice models
import panelbox as pb
from panelbox.models.discrete import (
    PooledLogit,
    PooledProbit,
    FixedEffectsLogit,
    RandomEffectsProbit
)

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print(f"PanelBox version: {pb.__version__}")

## 2. Generate Synthetic Panel Data

For this tutorial, we'll simulate data on labor force participation with the following variables:

- **labor_force**: Binary outcome (1 = in labor force, 0 = not)
- **age**: Age of individual
- **education**: Years of education
- **married**: Marital status (1 = married, 0 = not)
- **children**: Number of children under 6
- **health**: Self-reported health status (1-5 scale)

In [None]:
def generate_labor_data(n_individuals=500, n_periods=8, seed=42):
    """Generate synthetic panel data for labor force participation."""
    np.random.seed(seed)
    
    # Create panel structure
    person_id = np.repeat(range(1, n_individuals + 1), n_periods)
    year = np.tile(range(2015, 2015 + n_periods), n_individuals)
    
    # Individual-specific effects (unobserved heterogeneity)
    alpha_i = np.repeat(np.random.normal(0, 0.8, n_individuals), n_periods)
    
    # Time-invariant characteristics
    education_i = np.repeat(np.random.normal(12, 3, n_individuals), n_periods)
    
    # Time-varying characteristics
    age_base = np.repeat(np.random.uniform(25, 55, n_individuals), n_periods)
    age = age_base + np.tile(range(n_periods), n_individuals)
    
    # Marital status (can change over time)
    married = np.random.binomial(1, 0.6, len(person_id))
    
    # Number of young children (changes over time)
    children = np.random.poisson(0.8, len(person_id))
    children = np.minimum(children, 4)  # Cap at 4
    
    # Health status (1-5 scale)
    health = np.random.choice([1, 2, 3, 4, 5], len(person_id), p=[0.05, 0.15, 0.40, 0.30, 0.10])
    
    # Generate labor force participation (latent variable model)
    # True coefficients
    beta_age = -0.02
    beta_age2 = -0.0002
    beta_education = 0.15
    beta_married = -0.3  # Negative for women (traditional effect)
    beta_children = -0.5  # Strong negative effect
    beta_health = 0.3
    
    # Latent variable
    y_star = (alpha_i + 
              beta_age * age + 
              beta_age2 * age**2 +
              beta_education * education_i + 
              beta_married * married + 
              beta_children * children + 
              beta_health * health +
              np.random.normal(0, 1, len(person_id)))
    
    # Binary outcome
    labor_force = (y_star > 0).astype(int)
    
    # Create DataFrame
    data = pd.DataFrame({
        'person_id': person_id,
        'year': year,
        'labor_force': labor_force,
        'age': age,
        'age_squared': age**2,
        'education': education_i,
        'married': married,
        'children': children,
        'health': health
    })
    
    return data

# Generate the data
data = generate_labor_data(n_individuals=500, n_periods=8)
data = data.set_index(['person_id', 'year'])

print("Data shape:", data.shape)
print("\nFirst few observations:")
print(data.head(10))
print("\nLabor force participation rate:", data['labor_force'].mean())
print("\nSummary statistics:")
print(data.describe())

## 3. Data Exploration

Before modeling, let's explore the data to understand patterns.

In [None]:
# Visualize labor force participation by key variables
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# By age
age_bins = pd.cut(data['age'], bins=5)
age_participation = data.groupby(age_bins)['labor_force'].mean()
axes[0, 0].bar(range(len(age_participation)), age_participation.values)
axes[0, 0].set_xticklabels([str(i) for i in age_participation.index], rotation=45)
axes[0, 0].set_title('Participation by Age')
axes[0, 0].set_ylabel('Participation Rate')

# By education
edu_bins = pd.cut(data['education'], bins=5)
edu_participation = data.groupby(edu_bins)['labor_force'].mean()
axes[0, 1].bar(range(len(edu_participation)), edu_participation.values)
axes[0, 1].set_xticklabels([str(i) for i in edu_participation.index], rotation=45)
axes[0, 1].set_title('Participation by Education')

# By marital status
marital_participation = data.groupby('married')['labor_force'].mean()
axes[0, 2].bar(['Not Married', 'Married'], marital_participation.values)
axes[0, 2].set_title('Participation by Marital Status')

# By number of children
children_participation = data.groupby('children')['labor_force'].mean()
axes[1, 0].bar(children_participation.index, children_participation.values)
axes[1, 0].set_title('Participation by Number of Children')
axes[1, 0].set_xlabel('Number of Children')
axes[1, 0].set_ylabel('Participation Rate')

# By health status
health_participation = data.groupby('health')['labor_force'].mean()
axes[1, 1].bar(health_participation.index, health_participation.values)
axes[1, 1].set_title('Participation by Health Status')
axes[1, 1].set_xlabel('Health (1=Poor, 5=Excellent)')

# Time trend
time_participation = data.groupby(data.index.get_level_values('year'))['labor_force'].mean()
axes[1, 2].plot(time_participation.index, time_participation.values, marker='o')
axes[1, 2].set_title('Participation Over Time')
axes[1, 2].set_xlabel('Year')

plt.tight_layout()
plt.show()

# Within-individual variation
within_variation = data.groupby(level='person_id')['labor_force'].agg(['mean', 'std'])
print(f"\nProportion with no within-variation: {(within_variation['std'] == 0).mean():.1%}")
print(f"Average within-person variation: {within_variation['std'].mean():.3f}")

## 4. Pooled Logit Model (Baseline)

We start with a pooled logit model that ignores the panel structure.

In [None]:
# Estimate Pooled Logit
pooled_logit = PooledLogit.from_formula(
    'labor_force ~ age + age_squared + education + married + children + health',
    data=data
)
pooled_logit_result = pooled_logit.fit()

# Display results
print("="*60)
print("POOLED LOGIT RESULTS")
print("="*60)
print(pooled_logit_result.summary())

# Key statistics
print(f"\nMcFadden Pseudo R²: {pooled_logit_result.pseudo_r2('mcfadden'):.4f}")
print(f"Log-likelihood: {pooled_logit_result.llf:.2f}")
print(f"AIC: {pooled_logit_result.aic:.2f}")
print(f"BIC: {pooled_logit_result.bic:.2f}")

## 5. Pooled Probit Model

For comparison, let's also estimate a pooled probit model.

In [None]:
# Estimate Pooled Probit
pooled_probit = PooledProbit.from_formula(
    'labor_force ~ age + age_squared + education + married + children + health',
    data=data
)
pooled_probit_result = pooled_probit.fit()

print("="*60)
print("POOLED PROBIT RESULTS")
print("="*60)
print(pooled_probit_result.summary())

# Compare with Logit
print("\n" + "="*60)
print("COMPARISON: LOGIT vs PROBIT")
print("="*60)

comparison = pd.DataFrame({
    'Logit Coef': pooled_logit_result.params,
    'Probit Coef': pooled_probit_result.params,
    'Ratio': pooled_logit_result.params / pooled_probit_result.params
})
print(comparison)
print(f"\nAverage ratio (Logit/Probit): {comparison['Ratio'].mean():.3f}")
print("(Theory suggests ~1.6 for the ratio)")

## 6. Fixed Effects Logit

Now let's control for unobserved individual heterogeneity using Fixed Effects Logit.

**Important:** FE Logit drops individuals with no within-variation in the outcome.

In [None]:
# Note: age and education don't vary within individuals in our setup,
# so we exclude them from the FE model
fe_logit = FixedEffectsLogit.from_formula(
    'labor_force ~ married + children + health',
    data=data
)
fe_logit_result = fe_logit.fit()

print("="*60)
print("FIXED EFFECTS LOGIT RESULTS")
print("="*60)
print(fe_logit_result.summary())

print(f"\nNumber of groups: {fe_logit_result.n_groups}")
print(f"Groups dropped (no within variation): {fe_logit_result.n_dropped_groups}")
print(f"Observations used: {fe_logit_result.nobs}")

## 7. Random Effects Probit

Random Effects models allow us to include time-invariant variables while still accounting for individual heterogeneity.

In [None]:
# Random Effects Probit
re_probit = RandomEffectsProbit.from_formula(
    'labor_force ~ age + age_squared + education + married + children + health',
    data=data
)
re_probit_result = re_probit.fit()

print("="*60)
print("RANDOM EFFECTS PROBIT RESULTS")
print("="*60)
print(re_probit_result.summary())

# Variance decomposition
sigma_alpha = re_probit_result.sigma_alpha
rho = sigma_alpha**2 / (sigma_alpha**2 + 1)  # For probit, error variance = 1
print(f"\nSigma_alpha (RE std dev): {sigma_alpha:.4f}")
print(f"Rho (intraclass correlation): {rho:.4f}")
print(f"Proportion of variance due to individual effects: {rho:.1%}")

## 8. Marginal Effects

For nonlinear models, coefficients don't directly represent marginal effects. Let's calculate them.

In [None]:
# Average Marginal Effects (AME) for Pooled Logit
ame_logit = pooled_logit_result.marginal_effects(kind='average')

print("="*60)
print("AVERAGE MARGINAL EFFECTS - POOLED LOGIT")
print("="*60)
print(ame_logit.summary())

# Marginal Effects at Means (MEM)
mem_logit = pooled_logit_result.marginal_effects(kind='at_means')

print("\n" + "="*60)
print("MARGINAL EFFECTS AT MEANS - POOLED LOGIT")
print("="*60)
print(mem_logit.summary())

# Compare AME vs MEM
comparison = pd.DataFrame({
    'AME': ame_logit.effects,
    'MEM': mem_logit.effects,
    'Difference': ame_logit.effects - mem_logit.effects
})
print("\n" + "="*60)
print("COMPARISON: AME vs MEM")
print("="*60)
print(comparison)

# Interpretation
print("\n" + "="*60)
print("INTERPRETATION OF MARGINAL EFFECTS")
print("="*60)
print(f"One additional child reduces the probability of labor force participation by {abs(ame_logit.effects['children']*100):.1f} percentage points on average.")
print(f"Being married reduces the probability by {abs(ame_logit.effects['married']*100):.1f} percentage points.")
print(f"One year of additional education increases the probability by {ame_logit.effects['education']*100:.1f} percentage points.")

## 9. Model Comparison and Selection

Let's compare the different models and perform specification tests.

In [None]:
# Create comparison table
models = {
    'Pooled Logit': pooled_logit_result,
    'Pooled Probit': pooled_probit_result,
    'FE Logit': fe_logit_result,
    'RE Probit': re_probit_result
}

# Extract common coefficients for comparison
common_vars = ['married', 'children', 'health']
comparison = pd.DataFrame()

for name, result in models.items():
    coefs = []
    ses = []
    for var in common_vars:
        if var in result.params.index:
            coefs.append(result.params[var])
            ses.append(result.bse[var])
        else:
            coefs.append(np.nan)
            ses.append(np.nan)
    
    comparison[name] = coefs
    comparison[name + ' SE'] = ses

comparison.index = common_vars

print("="*80)
print("MODEL COMPARISON - COEFFICIENTS")
print("="*80)
print(comparison)

# Model fit statistics
fit_stats = pd.DataFrame()
for name, result in models.items():
    stats = {
        'Log-Likelihood': result.llf,
        'AIC': result.aic if hasattr(result, 'aic') else np.nan,
        'BIC': result.bic if hasattr(result, 'bic') else np.nan,
        'N': result.nobs
    }
    fit_stats[name] = stats

print("\n" + "="*80)
print("MODEL FIT STATISTICS")
print("="*80)
print(fit_stats.T)

# Hausman test (FE vs RE) - would typically be done here
print("\n" + "="*80)
print("SPECIFICATION TESTS")
print("="*80)
print("Note: Hausman test for FE vs RE would be performed here.")
print("If p < 0.05, reject RE in favor of FE (correlation between effects and regressors).")

## 10. Predictions and Classification

Let's evaluate model performance using predictions.

In [None]:
# Get predictions from Pooled Logit
predictions = pooled_logit_result.predict(data)

# Classification at different thresholds
thresholds = [0.3, 0.5, 0.7]
for threshold in thresholds:
    predicted_class = (predictions > threshold).astype(int)
    accuracy = (predicted_class == data['labor_force']).mean()
    sensitivity = ((predicted_class == 1) & (data['labor_force'] == 1)).sum() / (data['labor_force'] == 1).sum()
    specificity = ((predicted_class == 0) & (data['labor_force'] == 0)).sum() / (data['labor_force'] == 0).sum()
    
    print(f"\nThreshold = {threshold}")
    print(f"  Accuracy: {accuracy:.3f}")
    print(f"  Sensitivity: {sensitivity:.3f}")
    print(f"  Specificity: {specificity:.3f}")

# ROC Curve
from sklearn.metrics import roc_curve, auc

fpr, tpr, _ = roc_curve(data['labor_force'], predictions)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Pooled Logit Model')
plt.legend(loc="lower right")
plt.show()

# Classification table
print("\n" + "="*60)
print("CLASSIFICATION TABLE (Threshold = 0.5)")
print("="*60)
print(pooled_logit_result.classification_table(threshold=0.5))

## 11. Common Pitfalls and Best Practices

### Common Pitfalls to Avoid:

1. **Interpreting coefficients as marginal effects**
   - In nonlinear models, coefficients ≠ marginal effects
   - Always calculate AME or MEM for interpretation

2. **Ignoring panel structure**
   - Pooled models assume independence across observations
   - Use cluster-robust standard errors at minimum

3. **Fixed Effects with time-invariant variables**
   - FE models cannot identify effects of time-invariant variables
   - Use RE or Mundlak correction if these are important

4. **Ignoring incidental parameters problem**
   - FE Probit is inconsistent with fixed T
   - Use FE Logit (conditional MLE) instead

5. **Not checking within-variation**
   - FE models drop units with no within-variation
   - Check how many observations are actually used

### Best Practices:

1. **Start with pooled models as baseline**
2. **Use Hausman test to choose between FE and RE**
3. **Report marginal effects for interpretation**
4. **Check robustness with different link functions**
5. **Use cluster-robust standard errors**
6. **Validate with out-of-sample predictions**

In [None]:
# Example: Cluster-robust standard errors for pooled model
pooled_logit_robust = PooledLogit.from_formula(
    'labor_force ~ age + age_squared + education + married + children + health',
    data=data
)
pooled_logit_robust_result = pooled_logit_robust.fit(cov_type='cluster', cov_kwds={'groups': data.index.get_level_values('person_id')})

# Compare standard errors
se_comparison = pd.DataFrame({
    'Default SE': pooled_logit_result.bse,
    'Cluster-Robust SE': pooled_logit_robust_result.bse,
    'Ratio': pooled_logit_robust_result.bse / pooled_logit_result.bse
})

print("="*60)
print("STANDARD ERRORS COMPARISON")
print("="*60)
print(se_comparison)
print("\nCluster-robust SEs are typically larger, accounting for within-person correlation.")

## 12. Conclusion

This tutorial covered the main discrete choice models available in PanelBox:

- **Pooled Logit/Probit**: Simple but ignores panel structure
- **Fixed Effects Logit**: Controls for unobserved heterogeneity
- **Random Effects Probit**: Allows time-invariant variables while accounting for heterogeneity

Key takeaways:
1. Choice of model depends on assumptions about correlation between individual effects and regressors
2. Marginal effects are essential for interpretation
3. Panel structure matters - ignoring it can lead to biased inference
4. Different models may lead to different conclusions - robustness checks are important

### Next Steps

- Try ordered choice models for ordinal outcomes (`OrderedLogit`, `OrderedProbit`)
- Explore count models for count data (`PoissonFixedEffects`, `NegativeBinomial`)
- Use `PanelExperiment` for systematic model comparison
- Apply these methods to your own data!

In [None]:
# Save results for reporting
print("Saving results...")

# Export to LaTeX (for papers)
# pooled_logit_result.to_latex('pooled_logit_results.tex')

# Export to HTML (for presentations)
# pooled_logit_result.to_html('pooled_logit_results.html')

print("Tutorial complete!")