# Tutorial: Panel Count Models with PanelBox

## Introduction

This tutorial demonstrates how to use PanelBox for estimating count models with panel data. We'll work through a practical example of patent applications, covering:

1. **Pooled Poisson** regression
2. **Overdispersion** testing and diagnostics
3. **Negative Binomial** models for overdispersed data
4. **Fixed Effects Poisson** (conditional MLE)
5. **Random Effects** count models
6. **Zero-inflated** and **Hurdle** models
7. **Model selection** and interpretation


## 1. Setup and Data Generation

Let's start by importing libraries and generating synthetic data on firm patent applications.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Import PanelBox count models
import panelbox as pb
from panelbox.models.count import (
    PooledPoisson,
    PoissonFixedEffects,
    RandomEffectsPoisson,
    NegativeBinomial,
    ZeroInflatedPoisson,
    HurdlePoisson
)

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print(f"PanelBox version: {pb.__version__}")

## 2. Generate Panel Data for Patent Applications

We'll simulate data on firm patent applications with:
- **patents**: Number of patent applications (count outcome)
- **rd_intensity**: R&D spending as % of revenue
- **firm_size**: Log of firm employees
- **industry**: Industry dummy variables
- **year**: Time effects

In [None]:
def generate_patent_data(n_firms=300, n_years=10, seed=42):
    """Generate synthetic panel data for patent applications."""
    np.random.seed(seed)
    
    # Panel structure
    firm_id = np.repeat(range(1, n_firms + 1), n_years)
    year = np.tile(range(2010, 2010 + n_years), n_firms)
    
    # Firm-specific effects (unobserved innovation capability)
    alpha_i = np.repeat(np.random.gamma(2, 0.5, n_firms), n_years)
    
    # Industry categories (time-invariant)
    industries = ['Tech', 'Pharma', 'Manufacturing', 'Other']
    industry_probs = [0.3, 0.2, 0.3, 0.2]
    firm_industries = np.random.choice(industries, n_firms, p=industry_probs)
    industry = np.repeat(firm_industries, n_years)
    
    # Create industry dummies
    industry_tech = (industry == 'Tech').astype(int)
    industry_pharma = (industry == 'Pharma').astype(int)
    industry_manuf = (industry == 'Manufacturing').astype(int)
    
    # R&D intensity (% of revenue, time-varying)
    rd_base = np.repeat(np.random.uniform(0, 10, n_firms), n_years)
    rd_intensity = rd_base + np.random.normal(0, 1, len(firm_id))
    rd_intensity = np.maximum(0, rd_intensity)  # Cannot be negative
    
    # Firm size (log employees, slowly changing)
    size_base = np.repeat(np.random.normal(6, 1.5, n_firms), n_years)
    size_trend = np.tile(np.arange(n_years) * 0.02, n_firms)
    firm_size = size_base + size_trend + np.random.normal(0, 0.2, len(firm_id))
    
    # Market competition (time-varying, affects all firms)
    competition = np.tile(np.random.uniform(0.3, 0.7, n_years), n_firms)
    
    # Generate patent counts (Negative Binomial for overdispersion)
    # True parameters
    beta_rd = 0.15
    beta_size = 0.3
    beta_competition = -0.5
    beta_tech = 0.8
    beta_pharma = 0.6
    beta_manuf = 0.2
    
    # Linear predictor
    lambda_log = (alpha_i + 
                  beta_rd * rd_intensity + 
                  beta_size * firm_size +
                  beta_competition * competition +
                  beta_tech * industry_tech +
                  beta_pharma * industry_pharma +
                  beta_manuf * industry_manuf)
    
    lambda_i = np.exp(lambda_log)
    
    # Add overdispersion using Negative Binomial
    dispersion = 0.5  # Higher = more overdispersion
    patents = np.random.negative_binomial(lambda_i / dispersion, 1 / (1 + dispersion))
    
    # Create DataFrame
    data = pd.DataFrame({
        'firm_id': firm_id,
        'year': year,
        'patents': patents,
        'rd_intensity': rd_intensity,
        'firm_size': firm_size,
        'competition': competition,
        'industry': industry,
        'industry_tech': industry_tech,
        'industry_pharma': industry_pharma,
        'industry_manuf': industry_manuf
    })
    
    return data

# Generate data
data = generate_patent_data(n_firms=300, n_years=10)
data = data.set_index(['firm_id', 'year'])

print("Data shape:", data.shape)
print("\nFirst few observations:")
print(data.head(10))
print("\nSummary statistics:")
print(data.describe())
print(f"\nMean patents: {data['patents'].mean():.2f}")
print(f"Variance of patents: {data['patents'].var():.2f}")
print(f"Variance/Mean ratio: {data['patents'].var()/data['patents'].mean():.2f}")
print("(Ratio > 1 suggests overdispersion)")

## 3. Exploratory Data Analysis

Let's explore the distribution of patents and check for overdispersion.

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Distribution of patents
axes[0, 0].hist(data['patents'], bins=30, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Number of Patents')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Patent Applications')
axes[0, 0].axvline(data['patents'].mean(), color='red', linestyle='--', label=f'Mean={data["patents"].mean():.1f}')
axes[0, 0].legend()

# Patents by R&D intensity
rd_bins = pd.qcut(data['rd_intensity'], q=4)
patents_by_rd = data.groupby(rd_bins)['patents'].mean()
axes[0, 1].bar(range(len(patents_by_rd)), patents_by_rd.values)
axes[0, 1].set_xticklabels([f'Q{i+1}' for i in range(4)])
axes[0, 1].set_xlabel('R&D Intensity Quartile')
axes[0, 1].set_ylabel('Mean Patents')
axes[0, 1].set_title('Patents by R&D Intensity')

# Patents by firm size
size_bins = pd.qcut(data['firm_size'], q=4)
patents_by_size = data.groupby(size_bins)['patents'].mean()
axes[0, 2].bar(range(len(patents_by_size)), patents_by_size.values)
axes[0, 2].set_xticklabels([f'Q{i+1}' for i in range(4)])
axes[0, 2].set_xlabel('Firm Size Quartile')
axes[0, 2].set_ylabel('Mean Patents')
axes[0, 2].set_title('Patents by Firm Size')

# Patents by industry
patents_by_industry = data.groupby('industry')['patents'].mean().sort_values()
axes[1, 0].bar(range(len(patents_by_industry)), patents_by_industry.values)
axes[1, 0].set_xticklabels(patents_by_industry.index, rotation=45)
axes[1, 0].set_xlabel('Industry')
axes[1, 0].set_ylabel('Mean Patents')
axes[1, 0].set_title('Patents by Industry')

# Time trend
patents_by_year = data.groupby(data.index.get_level_values('year'))['patents'].mean()
axes[1, 1].plot(patents_by_year.index, patents_by_year.values, marker='o')
axes[1, 1].set_xlabel('Year')
axes[1, 1].set_ylabel('Mean Patents')
axes[1, 1].set_title('Patents Over Time')

# Variance-Mean relationship (check for overdispersion)
# Group data and compute mean and variance
grouped = data.groupby('firm_id')['patents'].agg(['mean', 'var'])
axes[1, 2].scatter(grouped['mean'], grouped['var'], alpha=0.5)
axes[1, 2].plot([0, grouped['mean'].max()], [0, grouped['mean'].max()], 'r--', label='Var=Mean (Poisson)')
axes[1, 2].set_xlabel('Mean Patents per Firm')
axes[1, 2].set_ylabel('Variance of Patents per Firm')
axes[1, 2].set_title('Variance-Mean Relationship')
axes[1, 2].legend()

plt.tight_layout()
plt.show()

# Check for excess zeros
zero_prop = (data['patents'] == 0).mean()
print(f"\nProportion of zeros: {zero_prop:.1%}")
if zero_prop > 0.3:
    print("High proportion of zeros - consider zero-inflated models")

## 4. Pooled Poisson Model (Baseline)

Start with the simplest count model - Pooled Poisson.

In [None]:
# Estimate Pooled Poisson
poisson_model = PooledPoisson.from_formula(
    'patents ~ rd_intensity + firm_size + competition + industry_tech + industry_pharma + industry_manuf',
    data=data
)
poisson_result = poisson_model.fit()

print("="*60)
print("POOLED POISSON RESULTS")
print("="*60)
print(poisson_result.summary())

# Interpretation
print("\n" + "="*60)
print("INTERPRETATION (Multiplicative Effects)")
print("="*60)
for var in ['rd_intensity', 'firm_size', 'competition']:
    coef = poisson_result.params[var]
    effect = (np.exp(coef) - 1) * 100
    print(f"{var}: 1-unit increase → {effect:.1f}% change in expected patents")

# Industry effects
for var in ['industry_tech', 'industry_pharma', 'industry_manuf']:
    coef = poisson_result.params[var]
    effect = (np.exp(coef) - 1) * 100
    industry_name = var.replace('industry_', '').title()
    print(f"{industry_name} vs Other: {effect:.1f}% more patents")

## 5. Testing for Overdispersion

Poisson assumes variance equals mean. Let's test this assumption.

In [None]:
# Overdispersion test
od_test = poisson_result.overdispersion_test()

print("="*60)
print("OVERDISPERSION TEST")
print("="*60)
print(f"Dispersion parameter: {od_test['dispersion']:.3f}")
print(f"Chi-squared statistic: {od_test['statistic']:.2f}")
print(f"P-value: {od_test['p_value']:.4f}")

if od_test['p_value'] < 0.05:
    print("\n*** Significant overdispersion detected ***")
    print("Consider using Negative Binomial or Quasi-Poisson models")
else:
    print("\nNo significant overdispersion - Poisson is appropriate")

# Visual check: Residuals vs Fitted
residuals = poisson_result.resid_pearson
fitted = poisson_result.fittedvalues

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(fitted, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Pearson Residuals')
plt.title('Residuals vs Fitted Values')

# Q-Q plot
plt.subplot(1, 2, 2)
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot of Residuals')

plt.tight_layout()
plt.show()

## 6. Negative Binomial Model

If overdispersion is present, Negative Binomial is more appropriate.

In [None]:
# Estimate Negative Binomial
nb_model = NegativeBinomial.from_formula(
    'patents ~ rd_intensity + firm_size + competition + industry_tech + industry_pharma + industry_manuf',
    data=data
)
nb_result = nb_model.fit()

print("="*60)
print("NEGATIVE BINOMIAL RESULTS")
print("="*60)
print(nb_result.summary())

# Dispersion parameter
print(f"\nDispersion parameter (alpha): {nb_result.alpha:.4f}")
print("(Smaller alpha = less overdispersion)")

# Compare with Poisson
comparison = pd.DataFrame({
    'Poisson Coef': poisson_result.params,
    'NB Coef': nb_result.params[:-1],  # Exclude alpha
    'Poisson SE': poisson_result.bse,
    'NB SE': nb_result.bse[:-1]
})

print("\n" + "="*60)
print("COMPARISON: POISSON vs NEGATIVE BINOMIAL")
print("="*60)
print(comparison)

# Likelihood ratio test
lr_stat = 2 * (nb_result.llf - poisson_result.llf)
lr_pvalue = stats.chi2.sf(lr_stat, df=1)
print(f"\nLikelihood Ratio Test (NB vs Poisson):")
print(f"  LR statistic: {lr_stat:.2f}")
print(f"  P-value: {lr_pvalue:.4f}")
if lr_pvalue < 0.05:
    print("  → Negative Binomial is significantly better")

## 7. Fixed Effects Poisson

Control for firm-specific unobserved heterogeneity.

In [None]:
# Fixed Effects Poisson (industry dummies drop out as time-invariant)
fe_poisson = PoissonFixedEffects.from_formula(
    'patents ~ rd_intensity + firm_size + competition',
    data=data
)
fe_result = fe_poisson.fit()

print("="*60)
print("FIXED EFFECTS POISSON RESULTS")
print("="*60)
print(fe_result.summary())

print(f"\nNumber of firms: {fe_result.n_groups}")
print(f"Firms dropped (all zeros): {fe_result.n_dropped_groups}")
print(f"Observations used: {fe_result.nobs}")

# Compare time-varying coefficients
time_varying_vars = ['rd_intensity', 'firm_size', 'competition']
fe_comparison = pd.DataFrame({
    'Pooled': [poisson_result.params[v] for v in time_varying_vars],
    'Fixed Effects': [fe_result.params[v] for v in time_varying_vars]
}, index=time_varying_vars)

print("\n" + "="*60)
print("COMPARISON: POOLED vs FIXED EFFECTS")
print("="*60)
print(fe_comparison)
print("\nNote: Differences suggest correlation between firm effects and regressors")

## 8. Random Effects Poisson

Allow for both time-invariant variables and unobserved heterogeneity.

In [None]:
# Random Effects Poisson
re_poisson = RandomEffectsPoisson.from_formula(
    'patents ~ rd_intensity + firm_size + competition + industry_tech + industry_pharma + industry_manuf',
    data=data
)
re_result = re_poisson.fit()

print("="*60)
print("RANDOM EFFECTS POISSON RESULTS")
print("="*60)
print(re_result.summary())

# Variance decomposition
sigma_alpha = re_result.sigma_alpha
print(f"\nSigma_alpha (RE std dev): {sigma_alpha:.4f}")
print(f"Proportion of variance due to firm effects: {sigma_alpha**2 / (sigma_alpha**2 + 1):.1%}")

# Compare all models
all_models_comparison = pd.DataFrame({
    'Pooled': poisson_result.params,
    'Negative Binomial': nb_result.params[:-1],
    'Random Effects': re_result.params[:-1]  # Exclude sigma
})

print("\n" + "="*60)
print("ALL MODELS COMPARISON")
print("="*60)
print(all_models_comparison)

## 9. Zero-Inflated and Hurdle Models

For data with excess zeros, specialized models may be needed.

In [None]:
# Check if we need zero-inflated models
observed_zeros = (data['patents'] == 0).sum()
expected_zeros_poisson = len(data) * np.exp(-poisson_result.fittedvalues.mean())

print("="*60)
print("ZERO INFLATION ANALYSIS")
print("="*60)
print(f"Observed zeros: {observed_zeros} ({observed_zeros/len(data):.1%})")
print(f"Expected zeros (Poisson): {expected_zeros_poisson:.0f} ({expected_zeros_poisson/len(data):.1%})")

if observed_zeros > expected_zeros_poisson * 1.5:
    print("\n*** Excess zeros detected - Zero-inflated model recommended ***")
    
    # Fit Zero-Inflated Poisson
    zip_model = ZeroInflatedPoisson.from_formula(
        'patents ~ rd_intensity + firm_size + competition',
        inflate='rd_intensity + firm_size',  # Zero-inflation equation
        data=data
    )
    zip_result = zip_model.fit()
    
    print("\n" + "="*60)
    print("ZERO-INFLATED POISSON RESULTS")
    print("="*60)
    print(zip_result.summary())
    
    # Hurdle Model alternative
    hurdle_model = HurdlePoisson.from_formula(
        'patents ~ rd_intensity + firm_size + competition',
        data=data
    )
    hurdle_result = hurdle_model.fit()
    
    print("\n" + "="*60)
    print("HURDLE POISSON RESULTS")
    print("="*60)
    print(hurdle_result.summary())
    
    # Model comparison
    print("\n" + "="*60)
    print("MODEL COMPARISON - AIC")
    print("="*60)
    print(f"Poisson AIC: {poisson_result.aic:.2f}")
    print(f"Zero-Inflated Poisson AIC: {zip_result.aic:.2f}")
    print(f"Hurdle Poisson AIC: {hurdle_result.aic:.2f}")
else:
    print("\nNo significant excess zeros - standard count models are appropriate")

## 10. Marginal Effects and Interpretation

For count models, we often want to know the marginal effect on the expected count.

In [None]:
# Calculate marginal effects for Poisson model
me_poisson = poisson_result.marginal_effects(kind='average')

print("="*60)
print("MARGINAL EFFECTS - POOLED POISSON")
print("="*60)
print(me_poisson.summary())

# Interpretation
print("\n" + "="*60)
print("ECONOMIC INTERPRETATION")
print("="*60)

for var in ['rd_intensity', 'firm_size', 'competition']:
    effect = me_poisson.effects[var]
    print(f"{var}: 1-unit increase → {effect:.3f} additional patents on average")

# Industry effects
for var in ['industry_tech', 'industry_pharma', 'industry_manuf']:
    effect = me_poisson.effects[var]
    industry_name = var.replace('industry_', '').title()
    print(f"{industry_name} firms: {effect:.2f} more patents than Other industry")

# Calculate elasticities
print("\n" + "="*60)
print("ELASTICITIES (% change in patents for 1% change in X)")
print("="*60)

for var in ['rd_intensity', 'firm_size']:
    elasticity = poisson_result.params[var]  # For Poisson, elasticity = coefficient
    print(f"{var}: {elasticity:.3f}")

## 11. Predictions and Model Validation

Evaluate model performance using predictions.

In [None]:
# Generate predictions from different models
pred_poisson = poisson_result.predict(data)
pred_nb = nb_result.predict(data)

# Calculate prediction metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error

models_predictions = {
    'Poisson': pred_poisson,
    'Negative Binomial': pred_nb
}

print("="*60)
print("PREDICTION ACCURACY")
print("="*60)

for name, predictions in models_predictions.items():
    mae = mean_absolute_error(data['patents'], predictions)
    rmse = np.sqrt(mean_squared_error(data['patents'], predictions))
    print(f"\n{name}:")
    print(f"  MAE: {mae:.3f}")
    print(f"  RMSE: {rmse:.3f}")

# Visualize predictions vs actual
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for idx, (name, predictions) in enumerate(models_predictions.items()):
    axes[idx].scatter(data['patents'], predictions, alpha=0.5)
    axes[idx].plot([0, data['patents'].max()], [0, data['patents'].max()], 'r--')
    axes[idx].set_xlabel('Actual Patents')
    axes[idx].set_ylabel('Predicted Patents')
    axes[idx].set_title(f'{name} Model')
    
    # Add correlation
    corr = np.corrcoef(data['patents'], predictions)[0, 1]
    axes[idx].text(0.05, 0.95, f'Correlation: {corr:.3f}', 
                   transform=axes[idx].transAxes, verticalalignment='top')

plt.tight_layout()
plt.show()

# Distribution of predictions
plt.figure(figsize=(10, 5))
plt.hist(data['patents'], bins=30, alpha=0.5, label='Actual', density=True)
plt.hist(pred_poisson, bins=30, alpha=0.5, label='Poisson Predicted', density=True)
plt.hist(pred_nb, bins=30, alpha=0.5, label='NB Predicted', density=True)
plt.xlabel('Number of Patents')
plt.ylabel('Density')
plt.title('Distribution: Actual vs Predicted')
plt.legend()
plt.show()

## 12. Best Practices and Common Pitfalls

### Common Pitfalls:

1. **Ignoring overdispersion**
   - Always test variance-mean relationship
   - Use Negative Binomial if overdispersed

2. **Not checking for excess zeros**
   - Compare observed vs expected zeros
   - Consider ZIP or Hurdle models

3. **Interpreting coefficients directly**
   - Coefficients are on log scale
   - Calculate marginal effects or IRRs

4. **Ignoring panel structure**
   - Use FE/RE models for panel data
   - Account for within-unit correlation

5. **Not checking model assumptions**
   - Examine residual plots
   - Test for autocorrelation in panels

### Best Practices:

1. **Start with Poisson, test assumptions**
2. **Use robust standard errors**
3. **Consider exposure/offset variables if needed**
4. **Report Incidence Rate Ratios (IRR) for interpretability**
5. **Validate with out-of-sample predictions**
6. **Compare multiple model specifications**

In [None]:
# Example: Incidence Rate Ratios
print("="*60)
print("INCIDENCE RATE RATIOS (IRR)")
print("="*60)

irr = np.exp(poisson_result.params)
irr_ci_lower = np.exp(poisson_result.params - 1.96 * poisson_result.bse)
irr_ci_upper = np.exp(poisson_result.params + 1.96 * poisson_result.bse)

irr_table = pd.DataFrame({
    'IRR': irr,
    '95% CI Lower': irr_ci_lower,
    '95% CI Upper': irr_ci_upper
})

print(irr_table)
print("\nInterpretation: IRR > 1 means factor increases count")
print("                IRR < 1 means factor decreases count")
print("                IRR = 1.5 means 50% increase in expected count")

## 13. Model Selection Summary

Let's create a final comparison to choose the best model.

In [None]:
# Create comprehensive model comparison
model_comparison = pd.DataFrame({
    'Model': ['Poisson', 'Negative Binomial', 'FE Poisson', 'RE Poisson'],
    'Log-Likelihood': [
        poisson_result.llf,
        nb_result.llf,
        fe_result.llf,
        re_result.llf
    ],
    'AIC': [
        poisson_result.aic,
        nb_result.aic,
        getattr(fe_result, 'aic', np.nan),
        re_result.aic
    ],
    'BIC': [
        poisson_result.bic,
        nb_result.bic,
        getattr(fe_result, 'bic', np.nan),
        re_result.bic
    ]
})

print("="*60)
print("FINAL MODEL COMPARISON")
print("="*60)
print(model_comparison)
print("\nBest model by AIC:", model_comparison.loc[model_comparison['AIC'].idxmin(), 'Model'])
print("Best model by BIC:", model_comparison.loc[model_comparison['BIC'].idxmin(), 'Model'])

# Recommendations
print("\n" + "="*60)
print("RECOMMENDATIONS")
print("="*60)

if od_test['p_value'] < 0.05:
    print("✓ Overdispersion detected → Use Negative Binomial")
else:
    print("✓ No overdispersion → Poisson is appropriate")

if zero_prop > 0.3:
    print("✓ High zero proportion → Consider Zero-Inflated models")

print("✓ For causal inference → Use Fixed Effects")
print("✓ For prediction with time-invariant vars → Use Random Effects")

## 14. Conclusion

This tutorial covered the main count models in PanelBox:

- **Poisson**: Baseline model, assumes variance = mean
- **Negative Binomial**: Handles overdispersion
- **Fixed Effects Poisson**: Controls for unobserved heterogeneity
- **Random Effects Poisson**: Balances efficiency and bias
- **Zero-Inflated/Hurdle**: For excess zeros

Key takeaways:
1. Always test for overdispersion and excess zeros
2. Panel structure matters - use appropriate FE/RE models
3. Interpret results using marginal effects or IRRs
4. Validate models with predictions and diagnostics
5. Choose models based on data characteristics and research goals

### Next Steps

- Apply to your own count data
- Explore Quasi-Poisson models for mild overdispersion
- Try panel-specific extensions (dynamic count models)
- Use bootstrap for robust inference


In [None]:
print("Tutorial complete! Ready for count data analysis with PanelBox.")