# Notebook 02: Overdispersion and Negative Binomial Regression

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Recognize and diagnose** overdispersion in count data
2. **Understand** the Negative Binomial distribution and NB2 parametrization
3. **Estimate** Negative Binomial regression models using PanelBox
4. **Perform** likelihood ratio tests comparing Poisson vs NB
5. **Interpret** the dispersion parameter $\alpha$
6. **Choose** between Poisson and NB models appropriately

### Prerequisites
- Completed Notebook 01 (Poisson Introduction)
- Understanding of variance-mean relationship
- Familiarity with likelihood ratio tests

### Duration
- **Estimated time**: 60 minutes
- **Sections**: 7 main sections

### Dataset
- **File**: `firm_patents.csv` — Patent counts for 1,500 manufacturing firms over 5 years (7,500 obs)
- **Key feature**: Severe overdispersion (Var/Mean $\approx$ 18)

### References
- Cameron, A. C., & Trivedi, P. K. (2013). *Regression Analysis of Count Data*. Cambridge University Press.
- Hilbe, J. M. (2011). *Negative Binomial Regression*. Cambridge University Press.

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from scipy import stats
from scipy.special import gammaln
import warnings
warnings.filterwarnings('ignore')

# PanelBox imports
import statsmodels.api as sm
from panelbox.models.count import PooledPoisson, NegativeBinomial

# Visualization configuration
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['figure.dpi'] = 100
plt.rcParams['font.size'] = 11
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12

# Set random seed for reproducibility
np.random.seed(42)

# Define paths
BASE_DIR = Path('..')
DATA_DIR = BASE_DIR / 'data'
OUTPUT_DIR = BASE_DIR / 'outputs'
FIGURES_DIR = OUTPUT_DIR / 'figures' / '02_negbin'
TABLES_DIR = OUTPUT_DIR / 'tables' / '02_negbin'

# Create output directories
FIGURES_DIR.mkdir(parents=True, exist_ok=True)
TABLES_DIR.mkdir(parents=True, exist_ok=True)

print('Setup complete!')
print(f'Figures will be saved to: {FIGURES_DIR}')
print(f'Tables will be saved to: {TABLES_DIR}')

## Section 0: Data Loading and Exploration

We use a dataset of patent counts for manufacturing firms. This data exhibits **severe overdispersion** — a violation of the Poisson equidispersion assumption that motivates the Negative Binomial model.

In [None]:
# Load dataset
df = pd.read_csv(DATA_DIR / 'firm_patents.csv')

print('Dataset shape:', df.shape)
print(f'N firms: {df["firm_id"].nunique()}')
print(f'T years: {df["year"].nunique()} ({df["year"].min()}-{df["year"].max()})')
print()
print('First rows:')
display(df.head(10))

print('\nVariable types:')
print(df.dtypes)

print('\nSummary statistics:')
display(df.describe().round(2))

## Section 1: The Problem of Equidispersion

### Poisson Equidispersion Assumption

Recall from Notebook 01 that the Poisson model assumes:

$$\text{Var}[Y|\mathbf{X}] = E[Y|\mathbf{X}] = \mu$$

This is called **equidispersion**: the conditional variance equals the conditional mean.

### Why Is This a Problem?

In practice, count data often violates this assumption. **Overdispersion** occurs when:

$$\text{Var}[Y|\mathbf{X}] > E[Y|\mathbf{X}]$$

#### Common causes of overdispersion:
1. **Unobserved heterogeneity**: Firms differ in unmeasured ways that affect patenting
2. **Clustering**: Patents may come in "bunches" (e.g., from a single research program)
3. **Omitted variables**: Important predictors missing from the model
4. **Contagion**: One patent may lead to others (cumulative innovation)

#### Consequences of ignoring overdispersion:
- Standard errors are **too small** (underestimated)
- Test statistics are **inflated**
- Confidence intervals are **too narrow**
- Leads to **spurious significance** — finding effects that don't exist

In [None]:
# ============================================================
# Explore the distribution of patents
# ============================================================

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# 1. Histogram of patent counts
ax = axes[0]
max_display = int(min(df['patents'].quantile(0.99), 40))
bins = np.arange(0, max_display + 2) - 0.5
ax.hist(df['patents'].clip(upper=max_display), bins=bins,
        color='steelblue', edgecolor='white', alpha=0.8)
ax.set_xlabel('Number of Patents')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of Patent Counts')

# Add mean and variance annotations
mean_pat = df['patents'].mean()
var_pat = df['patents'].var()
ax.axvline(mean_pat, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_pat:.1f}')
ax.legend(fontsize=10)

textstr = f'Mean = {mean_pat:.2f}\nVar = {var_pat:.2f}\nVar/Mean = {var_pat/mean_pat:.1f}'
ax.text(0.95, 0.95, textstr, transform=ax.transAxes, fontsize=10,
        verticalalignment='top', horizontalalignment='right',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

# 2. Boxplot by year
ax = axes[1]
df.boxplot(column='patents', by='year', ax=ax,
           flierprops=dict(markersize=2, alpha=0.3))
ax.set_xlabel('Year')
ax.set_ylabel('Number of Patents')
ax.set_title('Patents by Year')
plt.sca(ax)
plt.title('Patents by Year')

# 3. Log-scale histogram
ax = axes[2]
ax.hist(np.log1p(df['patents']), bins=30, color='darkgreen',
        edgecolor='white', alpha=0.8)
ax.set_xlabel('log(1 + Patents)')
ax.set_ylabel('Frequency')
ax.set_title('Log-transformed Distribution')

fig.suptitle('')  # Remove auto-generated suptitle from boxplot
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'patent_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

# Summary statistics
print(f'Mean patents: {mean_pat:.2f}')
print(f'Variance: {var_pat:.2f}')
print(f'Overdispersion index (Var/Mean): {var_pat/mean_pat:.1f}')
print(f'Zero count: {(df["patents"] == 0).sum()} ({100*(df["patents"]==0).mean():.1f}%)')
print(f'Max patents: {df["patents"].max()}')
print(f'Median patents: {df["patents"].median():.0f}')

In [None]:
# ============================================================
# Variance-Mean Relationship by Firm
# ============================================================

# Compute within-firm variance and mean
firm_stats = df.groupby('firm_id')['patents'].agg(['mean', 'var']).dropna()
firm_stats.columns = ['mean', 'variance']
firm_stats = firm_stats[firm_stats['variance'] > 0]  # Need positive variance

fig, ax = plt.subplots(figsize=(10, 7))

# Scatter plot
ax.scatter(firm_stats['mean'], firm_stats['variance'],
           alpha=0.3, s=20, color='steelblue', label='Firm-level data')

# 45-degree line (Poisson reference)
ref_range = np.linspace(0, firm_stats['mean'].max() * 1.1, 100)
ax.plot(ref_range, ref_range, 'r--', linewidth=2,
        label='Poisson: Var = Mean (45-degree line)')

# NB2 reference: Var = mu + alpha*mu^2
alpha_approx = 2.0
ax.plot(ref_range, ref_range + alpha_approx * ref_range**2, 'g-', linewidth=2,
        label=f'NB2: Var = mu + {alpha_approx}*mu^2', alpha=0.7)

ax.set_xlabel('Within-firm Mean', fontsize=12)
ax.set_ylabel('Within-firm Variance', fontsize=12)
ax.set_title('Variance vs Mean by Firm\n(Points above 45-degree line indicate overdispersion)', fontsize=14)
ax.legend(fontsize=11)
ax.grid(alpha=0.3)

# Set sensible axis limits
ax.set_xlim(0, firm_stats['mean'].quantile(0.98) * 1.1)
ax.set_ylim(0, firm_stats['variance'].quantile(0.98) * 1.1)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'variance_mean_firms.png', dpi=300, bbox_inches='tight')
plt.show()

# Percentage of firms with Var > Mean
pct_over = (firm_stats['variance'] > firm_stats['mean']).mean() * 100
print(f'\nPercentage of firms with Var > Mean: {pct_over:.1f}%')
print(f'Median within-firm Var/Mean ratio: {(firm_stats["variance"]/firm_stats["mean"]).median():.1f}')

In [None]:
# ============================================================
# Demonstrate Consequences: Fit Poisson Model
# ============================================================

# Prepare data
df['log_rd'] = np.log(df['rd_spending'])
df['log_emp'] = np.log(df['employees'])

y = df['patents'].values
X_vars = df[['log_rd', 'log_emp', 'firm_age', 'tech_sector',
             'public_funding', 'international']].values
X = sm.add_constant(X_vars)

var_names = ['const', 'log_rd', 'log_emp', 'firm_age',
             'tech_sector', 'public_funding', 'international']

# Fit Poisson model
print('Fitting Poisson model...')
print('=' * 60)
poisson_model = PooledPoisson(
    endog=y,
    exog=X,
    entity_id=df['firm_id'].values,
    time_id=df['year'].values
)
poisson_result = poisson_model.fit(se_type='cluster')

# Store log-likelihood for later
poisson_llf = poisson_model.llf

# Display basic results
print(f'Log-likelihood: {poisson_llf:.2f}')
print(f'Number of observations: {len(y)}')
print()

# Create coefficient table
poisson_se = np.sqrt(np.diag(poisson_result.vcov))
poisson_t = poisson_result.params / poisson_se
poisson_p = 2 * (1 - stats.norm.cdf(np.abs(poisson_t)))

poisson_table = pd.DataFrame({
    'Variable': var_names,
    'Coefficient': poisson_result.params,
    'Std. Error': poisson_se,
    'z-statistic': poisson_t,
    'p-value': poisson_p
})

def add_stars(p):
    if p < 0.001: return '***'
    elif p < 0.01: return '**'
    elif p < 0.05: return '*'
    else: return ''

poisson_table['Sig'] = poisson_table['p-value'].apply(add_stars)

print('Poisson Regression Results')
print('=' * 70)
display(poisson_table.round(4))
print('\nSignificance: *** p<0.001, ** p<0.01, * p<0.05')

# Compute overdispersion index
fitted_poisson = np.exp(X @ poisson_result.params)
pearson_resid = (y - fitted_poisson) / np.sqrt(fitted_poisson)
disp_index = np.sum(pearson_resid**2) / (len(y) - len(var_names))
print(f'\nPearson dispersion statistic: {disp_index:.2f}')
print(f'(Should be ~1 under Poisson; values >> 1 indicate overdispersion)')

## Section 2: Detecting Overdispersion

Before fitting a Negative Binomial model, we should formally test for overdispersion. Several diagnostic methods are available:

1. **Overdispersion Index**: Simple ratio Var(y)/E(y)
2. **Pearson Dispersion**: Sum of squared Pearson residuals / (n - k)
3. **Cameron-Trivedi Test**: Regression-based auxiliary test
4. **Visual Diagnostics**: Rootogram and residual plots

In [None]:
# ============================================================
# Method 1: Cameron-Trivedi Test (manual implementation)
# ============================================================

# The Cameron-Trivedi test is based on an auxiliary regression:
#   (y - mu)^2 - y = alpha * mu^2 + error
# If alpha > 0, overdispersion is present.
#
# Alternatively (simpler form):
#   [(y - mu)^2 - y] / mu = alpha * mu + error

mu_hat = fitted_poisson

# Auxiliary dependent variable
aux_y = ((y - mu_hat)**2 - y) / mu_hat

# Auxiliary regressor
aux_x = mu_hat.reshape(-1, 1)

# OLS regression (no intercept)
from numpy.linalg import lstsq
slope, _, _, _ = lstsq(aux_x, aux_y, rcond=None)
alpha_ct = slope[0]

# Standard error via OLS
residuals_ct = aux_y - alpha_ct * mu_hat
se_ct = np.sqrt(np.sum(residuals_ct**2) / (len(y) - 1)) / np.sqrt(np.sum(mu_hat**2))
t_stat_ct = alpha_ct / se_ct
p_value_ct = 2 * (1 - stats.norm.cdf(np.abs(t_stat_ct)))

print('Cameron-Trivedi Test for Overdispersion')
print('=' * 50)
print(f'H0: Var(Y|X) = E(Y|X)  (equidispersion)')
print(f'H1: Var(Y|X) = E(Y|X) + alpha * E(Y|X)^2')
print()
print(f'Estimated alpha: {alpha_ct:.4f}')
print(f'Standard error:  {se_ct:.4f}')
print(f't-statistic:     {t_stat_ct:.2f}')
print(f'p-value:         {p_value_ct:.6f}')
print()
if p_value_ct < 0.001:
    print('RESULT: Strong evidence of overdispersion (p < 0.001)')
    print('=> Poisson model is inadequate. Consider Negative Binomial.')
elif p_value_ct < 0.05:
    print('RESULT: Evidence of overdispersion (p < 0.05)')
else:
    print('RESULT: No significant evidence of overdispersion')

In [None]:
# ============================================================
# Method 2: Multiple Overdispersion Diagnostics
# ============================================================

# 1. Unconditional overdispersion index
unconditional_oi = df['patents'].var() / df['patents'].mean()

# 2. Pearson dispersion (conditional)
pearson_disp = np.sum(pearson_resid**2) / (len(y) - len(var_names))

# 3. Deviance-based
deviance_contribs = np.where(
    y > 0,
    2 * (y * np.log(y / mu_hat) - (y - mu_hat)),
    2 * mu_hat
)
deviance = np.sum(deviance_contribs)
deviance_disp = deviance / (len(y) - len(var_names))

# Create diagnostic summary table
diagnostics = pd.DataFrame({
    'Diagnostic': [
        'Unconditional Var/Mean',
        'Pearson Dispersion',
        'Deviance Dispersion',
        'Cameron-Trivedi alpha',
        'Cameron-Trivedi p-value'
    ],
    'Value': [
        f'{unconditional_oi:.2f}',
        f'{pearson_disp:.2f}',
        f'{deviance_disp:.2f}',
        f'{alpha_ct:.4f}',
        f'{p_value_ct:.2e}'
    ],
    'Expected (Poisson)': [
        '~1.0',
        '~1.0',
        '~1.0',
        '~0.0',
        '> 0.05'
    ],
    'Conclusion': [
        'Severe overdispersion' if unconditional_oi > 2 else 'Mild',
        'Severe overdispersion' if pearson_disp > 2 else 'Mild',
        'Severe overdispersion' if deviance_disp > 2 else 'Mild',
        'Significant' if p_value_ct < 0.05 else 'Not significant',
        'Reject Poisson' if p_value_ct < 0.05 else 'Fail to reject'
    ]
})

print('Overdispersion Diagnostic Summary')
print('=' * 80)
display(diagnostics)

# Save table
diagnostics.to_csv(TABLES_DIR / 'table_01_overdispersion_tests.csv', index=False)
print(f'\nSaved to {TABLES_DIR / "table_01_overdispersion_tests.csv"}')

In [None]:
# ============================================================
# Visual Diagnostics for Overdispersion
# ============================================================

fig, axes = plt.subplots(1, 3, figsize=(17, 5))

# 1. Rootogram: Observed vs Poisson-predicted frequencies
ax = axes[0]
max_count = min(int(df['patents'].quantile(0.95)), 30)
observed_freq = np.bincount(np.minimum(y, max_count))
counts = np.arange(len(observed_freq))

# Expected under Poisson
expected_freq = np.zeros(len(observed_freq))
for k in range(len(observed_freq)):
    if k < max_count:
        expected_freq[k] = len(y) * np.mean(stats.poisson.pmf(k, mu_hat))
    else:
        expected_freq[k] = len(y) * np.mean(1 - stats.poisson.cdf(k - 1, mu_hat))

bar_width = 0.35
ax.bar(counts - bar_width/2, np.sqrt(observed_freq), bar_width,
       label='Observed', color='steelblue', alpha=0.8)
ax.bar(counts + bar_width/2, np.sqrt(expected_freq), bar_width,
       label='Poisson predicted', color='coral', alpha=0.8)
ax.set_xlabel('Patent Count')
ax.set_ylabel('sqrt(Frequency)')
ax.set_title('Rootogram: Observed vs Poisson')
ax.legend(fontsize=9)
ax.set_xlim(-0.5, max_count + 0.5)

# 2. Pearson residuals vs fitted
ax = axes[1]
ax.scatter(np.log(mu_hat + 0.5), pearson_resid, alpha=0.1, s=5, color='steelblue')
ax.axhline(y=0, color='red', linestyle='--', linewidth=1)
ax.axhline(y=2, color='orange', linestyle=':', linewidth=1)
ax.axhline(y=-2, color='orange', linestyle=':', linewidth=1)
ax.set_xlabel('log(Fitted Values)')
ax.set_ylabel('Pearson Residuals')
ax.set_title('Pearson Residuals vs Fitted Values')
pct_outside = (np.abs(pearson_resid) > 2).mean() * 100
ax.text(0.05, 0.95, f'{pct_outside:.1f}% outside +/-2',
        transform=ax.transAxes, fontsize=10,
        verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

# 3. Binned variance-mean diagnostic
ax = axes[2]
n_bins = 15
bin_edges = np.percentile(mu_hat, np.linspace(0, 100, n_bins + 1))
bin_edges = np.unique(bin_edges)
bin_idx = np.digitize(mu_hat, bin_edges) - 1
bin_idx = np.clip(bin_idx, 0, len(bin_edges) - 2)

bin_means = []
bin_vars = []
for b in range(len(bin_edges) - 1):
    mask = bin_idx == b
    if mask.sum() > 10:
        bin_means.append(np.mean(y[mask]))
        bin_vars.append(np.var(y[mask]))

bin_means = np.array(bin_means)
bin_vars = np.array(bin_vars)

ax.scatter(bin_means, bin_vars, s=60, color='steelblue',
           edgecolor='navy', zorder=5, label='Binned data')
ref = np.linspace(0, max(bin_means) * 1.1, 100)
ax.plot(ref, ref, 'r--', linewidth=2, label='Poisson: Var = Mean')
ax.set_xlabel('Binned Mean')
ax.set_ylabel('Binned Variance')
ax.set_title('Binned Variance vs Mean')
ax.legend(fontsize=9)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'overdispersion_diagnostic.png', dpi=300, bbox_inches='tight')
plt.show()

print('All diagnostics point to severe overdispersion.')
print('The Poisson model is inadequate for this data.')

## Section 3: The Negative Binomial Model

### NB2 Parametrization

The **Negative Binomial (NB2)** model generalizes the Poisson by adding a dispersion parameter $\alpha$:

$$E[Y|\mathbf{X}] = \mu = \exp(\mathbf{X}'\boldsymbol{\beta})$$

$$\text{Var}[Y|\mathbf{X}] = \mu + \alpha \cdot \mu^2$$

### Key Properties:

| Parameter | Interpretation |
|-----------|---------------|
| $\alpha = 0$ | Reduces to Poisson (equidispersion) |
| $\alpha > 0$ | Overdispersion present |
| Small $\alpha$ (~0.1) | Mild overdispersion |
| Large $\alpha$ (~2+) | Severe overdispersion |

### NB1 vs NB2

- **NB1**: $\text{Var} = \mu(1 + \alpha)$ — variance is linear in mean
- **NB2**: $\text{Var} = \mu + \alpha\mu^2$ — variance is quadratic in mean (more common)

PanelBox implements the **NB2** parametrization, which is the standard in econometrics.

### Log-likelihood

The NB2 log-likelihood for observation $i$ is:

$$\ell_i = \ln\Gamma(y_i + r) - \ln\Gamma(y_i + 1) - \ln\Gamma(r) + r\ln\left(\frac{r}{r + \mu_i}\right) + y_i\ln\left(\frac{\mu_i}{r + \mu_i}\right)$$

where $r = 1/\alpha$ is the shape parameter and $\Gamma(\cdot)$ is the gamma function.

In [None]:
# ============================================================
# Estimate Negative Binomial Model
# ============================================================

print('Fitting Negative Binomial (NB2) model...')
print('=' * 60)

nb_model = NegativeBinomial(
    endog=y,
    exog=X,
    entity_id=df['firm_id'].values,
    time_id=df['year'].values
)
nb_result = nb_model.fit()

print(f'Converged: {nb_result.converged}')
print(f'Log-likelihood: {nb_result.llf:.2f}')
print(f'Number of observations: {len(y)}')
print()

# Extract and display alpha
print('Dispersion Parameter')
print('-' * 40)
print(f'Alpha (a):     {nb_result.alpha:.4f}')
print(f'log(Alpha):    {np.log(nb_result.alpha):.4f}')

# SE of alpha from last element of vcov
alpha_se = np.sqrt(nb_result.vcov[-1, -1]) * nb_result.alpha  # Delta method
alpha_ci_low = nb_result.alpha * np.exp(-1.96 * np.sqrt(nb_result.vcov[-1, -1]))
alpha_ci_high = nb_result.alpha * np.exp(1.96 * np.sqrt(nb_result.vcov[-1, -1]))

print(f'SE(Alpha):     {alpha_se:.4f}')
print(f'95% CI:        [{alpha_ci_low:.4f}, {alpha_ci_high:.4f}]')
print()

# Interpretation
if nb_result.alpha < 0.5:
    severity = 'mild'
elif nb_result.alpha < 2.0:
    severity = 'moderate'
else:
    severity = 'severe'
print(f'Interpretation: Alpha = {nb_result.alpha:.2f} indicates {severity} overdispersion.')
print(f'At mean count = {mean_pat:.1f}: Var/Mean = 1 + alpha*mean = {1 + nb_result.alpha*mean_pat:.1f}')

In [None]:
# ============================================================
# NB Results Table
# ============================================================

# Extract coefficients (excluding alpha)
nb_params = nb_result.params_exog
nb_se = np.sqrt(np.diag(nb_result.vcov))[:-1]  # Exclude alpha SE
nb_t = nb_params / nb_se
nb_p = 2 * (1 - stats.norm.cdf(np.abs(nb_t)))

nb_table = pd.DataFrame({
    'Variable': var_names,
    'Coefficient': nb_params,
    'Std. Error': nb_se,
    'z-statistic': nb_t,
    'p-value': nb_p
})

nb_table['Sig'] = nb_table['p-value'].apply(add_stars)

print('Negative Binomial (NB2) Regression Results')
print('=' * 70)
display(nb_table.round(4))
print(f'\nAlpha (dispersion): {nb_result.alpha:.4f}')
print(f'Log-likelihood: {nb_result.llf:.2f}')
print('Significance: *** p<0.001, ** p<0.01, * p<0.05')

# Save
nb_table.to_csv(TABLES_DIR / 'table_02_nb_estimates.csv', index=False)
print(f'\nSaved to {TABLES_DIR / "table_02_nb_estimates.csv"}')

In [None]:
# ============================================================
# Compare Poisson vs NB Coefficients
# ============================================================

comparison = pd.DataFrame({
    'Variable': var_names,
    'Poisson_Coef': poisson_result.params,
    'Poisson_SE': poisson_se,
    'NB_Coef': nb_params,
    'NB_SE': nb_se,
    'SE_Ratio': nb_se / poisson_se
})

print('Coefficient Comparison: Poisson vs Negative Binomial')
print('=' * 80)
display(comparison.round(4))
print()
print('Key observations:')
print(f'  - Average SE ratio (NB/Poisson): {comparison["SE_Ratio"].mean():.2f}')
print(f'  - NB standard errors are generally larger (corrected for overdispersion)')
print(f'  - Coefficients may differ slightly due to different likelihood functions')

comparison.to_csv(TABLES_DIR / 'table_03_poisson_vs_nb_coefs.csv', index=False)
print(f'\nSaved to {TABLES_DIR / "table_03_poisson_vs_nb_coefs.csv"}')

In [None]:
# ============================================================
# Visualization: Fit Comparison and Coefficient Forest Plot
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 1. Observed vs Fitted distributions
ax = axes[0]
max_count = min(int(df['patents'].quantile(0.95)), 25)
observed_freq = np.bincount(np.minimum(y, max_count))
counts_range = np.arange(len(observed_freq))

# Poisson predicted frequencies
mu_pois = np.exp(X @ poisson_result.params)
pois_freq = np.array([
    len(y) * np.mean(stats.poisson.pmf(k, mu_pois)) if k < max_count
    else len(y) * np.mean(1 - stats.poisson.cdf(k - 1, mu_pois))
    for k in range(len(observed_freq))
])

# NB predicted frequencies
mu_nb = np.exp(X @ nb_result.params_exog)
r_nb = 1.0 / nb_result.alpha
nb_freq = np.array([
    len(y) * np.mean(stats.nbinom.pmf(k, r_nb, r_nb / (r_nb + mu_nb))) if k < max_count
    else len(y) * np.mean(1 - stats.nbinom.cdf(k - 1, r_nb, r_nb / (r_nb + mu_nb)))
    for k in range(len(observed_freq))
])

width = 0.25
ax.bar(counts_range - width, observed_freq, width, label='Observed',
       color='steelblue', alpha=0.8)
ax.bar(counts_range, pois_freq, width, label='Poisson',
       color='coral', alpha=0.8)
ax.bar(counts_range + width, nb_freq, width, label='Neg. Binomial',
       color='seagreen', alpha=0.8)
ax.set_xlabel('Patent Count')
ax.set_ylabel('Frequency')
ax.set_title('Observed vs Predicted Distributions')
ax.legend(fontsize=10)
ax.set_xlim(-0.5, max_count + 0.5)

# 2. Forest plot comparing coefficients
ax = axes[1]
n_vars = len(var_names) - 1  # Exclude constant
y_pos = np.arange(n_vars)
var_labels = var_names[1:]  # Exclude constant

# Plot Poisson
ax.errorbar(poisson_result.params[1:], y_pos + 0.1,
            xerr=1.96 * poisson_se[1:],
            fmt='o', color='coral', markersize=8, capsize=4,
            label='Poisson', linewidth=2)

# Plot NB
ax.errorbar(nb_params[1:], y_pos - 0.1,
            xerr=1.96 * nb_se[1:],
            fmt='s', color='seagreen', markersize=8, capsize=4,
            label='Neg. Binomial', linewidth=2)

ax.axvline(x=0, color='gray', linestyle='--', linewidth=1)
ax.set_yticks(y_pos)
ax.set_yticklabels(var_labels)
ax.set_xlabel('Coefficient (95% CI)')
ax.set_title('Coefficient Comparison: Poisson vs NB')
ax.legend(fontsize=10, loc='lower right')
ax.grid(alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'nb_fit_comparison.png', dpi=300, bbox_inches='tight')
plt.savefig(FIGURES_DIR / 'coefficient_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print('NB model captures the dispersion much better than Poisson.')
print('Note: NB confidence intervals are wider (more honest about uncertainty).')

## Section 4: Likelihood Ratio Test — Poisson vs Negative Binomial

### Nested Models

The Poisson model is a **special case** of the NB model when $\alpha = 0$. This nesting allows us to use a **likelihood ratio (LR) test**.

### LR Test Statistic

$$LR = 2 \left( \ell_{\text{NB}} - \ell_{\text{Poisson}} \right)$$

Under $H_0: \alpha = 0$:

$$LR \sim \chi^2(1)$$

### Interpretation

- **Large LR** (small p-value): Reject Poisson in favor of NB
- **Small LR** (large p-value): Poisson is adequate

**Note**: Since $\alpha \geq 0$ (on the boundary of the parameter space), the test is conservative. The true distribution is a mixture of $\chi^2(0)$ and $\chi^2(1)$, so the reported p-value is an upper bound.

In [None]:
# ============================================================
# Likelihood Ratio Test: PanelBox Built-in
# ============================================================

print('Likelihood Ratio Test: Poisson vs Negative Binomial')
print('=' * 60)

# Use PanelBox's built-in LR test
lr_test = nb_result.lr_test_poisson()

print(f'H0: alpha = 0 (Poisson is adequate)')
print(f'H1: alpha > 0 (NB is needed)')
print()
print(f'LR statistic:    {lr_test["statistic"]:.2f}')
print(f'Degrees of freedom: {lr_test["df"]}')
print(f'p-value:         {lr_test["pvalue"]:.2e}')
print(f'Conclusion:      {lr_test["conclusion"]}')
print()

# Log-likelihoods
print(f'Log-likelihood (Poisson): {lr_test["llf_restricted"]:.2f}')
print(f'Log-likelihood (NB):      {lr_test["llf_unrestricted"]:.2f}')
print(f'Difference:               {lr_test["llf_unrestricted"] - lr_test["llf_restricted"]:.2f}')

In [None]:
# ============================================================
# Manual LR Test Computation (Pedagogical)
# ============================================================

# For understanding, let's compute the LR test step by step

# Step 1: Get log-likelihoods
ll_poisson = lr_test['llf_restricted']
ll_nb = lr_test['llf_unrestricted']

print('Step-by-step LR Test Computation')
print('=' * 50)
print(f'Step 1: Log-likelihoods')
print(f'  LL(Poisson) = {ll_poisson:.2f}')
print(f'  LL(NB)      = {ll_nb:.2f}')

# Step 2: Compute LR statistic
LR = 2 * (ll_nb - ll_poisson)
print(f'\nStep 2: LR statistic = 2 * ({ll_nb:.2f} - ({ll_poisson:.2f}))')
print(f'  LR = {LR:.2f}')

# Step 3: Compare to chi-squared distribution
p_value = 1 - stats.chi2.cdf(LR, df=1)
print(f'\nStep 3: Compare to chi-squared(1)')
print(f'  chi-squared(1) critical value at 5%: {stats.chi2.ppf(0.95, 1):.2f}')
print(f'  LR statistic: {LR:.2f}')
print(f'  p-value: {p_value:.2e}')

# Step 4: Decision
print(f'\nStep 4: Decision')
if p_value < 0.001:
    print(f'  LR = {LR:.2f} >> {stats.chi2.ppf(0.95, 1):.2f} (critical value)')
    print(f'  p < 0.001: STRONGLY reject Poisson in favor of NB')
elif p_value < 0.05:
    print(f'  p < 0.05: Reject Poisson in favor of NB')
else:
    print(f'  p = {p_value:.4f}: Fail to reject Poisson')

In [None]:
# ============================================================
# AIC/BIC Comparison
# ============================================================

# Poisson: k parameters (betas only)
k_pois = len(var_names)
aic_pois = -2 * ll_poisson + 2 * k_pois
bic_pois = -2 * ll_poisson + np.log(len(y)) * k_pois

# NB: k+1 parameters (betas + alpha)
k_nb = len(var_names) + 1
aic_nb = -2 * ll_nb + 2 * k_nb
bic_nb = -2 * ll_nb + np.log(len(y)) * k_nb

model_comparison = pd.DataFrame({
    'Metric': ['Log-Likelihood', 'Parameters', 'AIC', 'BIC', 'LR Statistic', 'LR p-value'],
    'Poisson': [f'{ll_poisson:.2f}', k_pois, f'{aic_pois:.2f}', f'{bic_pois:.2f}', '--', '--'],
    'Neg. Binomial': [f'{ll_nb:.2f}', k_nb, f'{aic_nb:.2f}', f'{bic_nb:.2f}',
                      f'{LR:.2f}', f'{p_value:.2e}']
})

print('Model Comparison: Poisson vs Negative Binomial')
print('=' * 70)
display(model_comparison)

print(f'\nAIC difference: {aic_pois - aic_nb:.1f} (lower is better -> NB wins)')
print(f'BIC difference: {bic_pois - bic_nb:.1f} (lower is better -> NB wins)')

# Save table
model_comparison.to_csv(TABLES_DIR / 'table_04_model_comparison.csv', index=False)
print(f'\nSaved to {TABLES_DIR / "table_04_model_comparison.csv"}')

In [None]:
# ============================================================
# AIC/BIC Visual Comparison
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# AIC comparison
ax = axes[0]
models = ['Poisson', 'Negative\nBinomial']
aic_values = [aic_pois, aic_nb]
colors = ['coral', 'seagreen']
bars = ax.bar(models, aic_values, color=colors, alpha=0.8, edgecolor='black')
ax.set_ylabel('AIC')
ax.set_title('AIC Comparison\n(Lower is Better)')

# Annotate values
for bar, val in zip(bars, aic_values):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + max(aic_values)*0.01,
            f'{val:.0f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

# BIC comparison
ax = axes[1]
bic_values = [bic_pois, bic_nb]
bars = ax.bar(models, bic_values, color=colors, alpha=0.8, edgecolor='black')
ax.set_ylabel('BIC')
ax.set_title('BIC Comparison\n(Lower is Better)')

for bar, val in zip(bars, bic_values):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + max(bic_values)*0.01,
            f'{val:.0f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'aic_bic_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print(f'NB model is strongly preferred on both AIC and BIC.')

## Section 5: Model Interpretation and Application

### Incidence Rate Ratios (IRRs)

For the NB model, as with Poisson, we interpret coefficients through **Incidence Rate Ratios**:

$$IRR_j = \exp(\beta_j)$$

**Interpretation**:
- $IRR = 1.0$: No effect
- $IRR = 1.5$: 50% increase in expected count
- $IRR = 0.8$: 20% decrease in expected count

For **log-transformed** covariates (like log R&D), $\beta$ is an **elasticity**: a 1% increase in R&D is associated with a $\beta$% change in expected patents.

In [None]:
# ============================================================
# Incidence Rate Ratios for NB Model
# ============================================================

irr = np.exp(nb_params)
irr_ci_low = np.exp(nb_params - 1.96 * nb_se)
irr_ci_high = np.exp(nb_params + 1.96 * nb_se)
pct_change = (irr - 1) * 100

irr_table = pd.DataFrame({
    'Variable': var_names,
    'Coefficient': nb_params,
    'IRR': irr,
    'IRR_CI_Low': irr_ci_low,
    'IRR_CI_High': irr_ci_high,
    'Pct_Change': pct_change,
    'p-value': nb_p,
    'Sig': nb_table['Sig'].values
})

print('Incidence Rate Ratios (NB Model)')
print('=' * 80)
display(irr_table.round(4))
print('\nSignificance: *** p<0.001, ** p<0.01, * p<0.05')

irr_table.to_csv(TABLES_DIR / 'table_05_irr_nb.csv', index=False)
print(f'\nSaved to {TABLES_DIR / "table_05_irr_nb.csv"}')

In [None]:
# ============================================================
# Substantive Interpretation
# ============================================================

print('Substantive Interpretation of NB Results')
print('=' * 60)
print()

# R&D spending (log)
rd_coef = nb_params[var_names.index('log_rd')]
rd_irr = np.exp(rd_coef)
print(f'1. R&D Spending (elasticity = {rd_coef:.3f}):')
print(f'   A 10% increase in R&D spending is associated with a')
print(f'   {((1.10**rd_coef - 1) * 100):.1f}% increase in expected patents.')
print(f'   Doubling R&D: +{((2**rd_coef - 1) * 100):.1f}% more patents.')
print()

# Employees (log)
emp_coef = nb_params[var_names.index('log_emp')]
print(f'2. Firm Size (elasticity = {emp_coef:.3f}):')
print(f'   A 10% increase in employees is associated with a')
print(f'   {((1.10**emp_coef - 1) * 100):.1f}% increase in expected patents.')
print()

# Tech sector
tech_coef = nb_params[var_names.index('tech_sector')]
tech_irr = np.exp(tech_coef)
print(f'3. Tech Sector (IRR = {tech_irr:.3f}):')
print(f'   High-tech firms produce {(tech_irr - 1) * 100:.1f}% more patents')
print(f'   than traditional manufacturing firms, all else equal.')
print()

# Public funding
fund_coef = nb_params[var_names.index('public_funding')]
fund_irr = np.exp(fund_coef)
print(f'4. Public Funding (IRR = {fund_irr:.3f}):')
print(f'   Firms receiving public R&D funding produce {(fund_irr - 1) * 100:.1f}%')
print(f'   more patents than those without.')
print()

# International
intl_coef = nb_params[var_names.index('international')]
intl_irr = np.exp(intl_coef)
print(f'5. International Collaboration (IRR = {intl_irr:.3f}):')
print(f'   Firms with international R&D collaborations produce')
print(f'   {(intl_irr - 1) * 100:.1f}% more patents.')
print()

# Firm age
age_coef = nb_params[var_names.index('firm_age')]
print(f'6. Firm Age (coef = {age_coef:.4f}):')
print(f'   Each additional year of firm age is associated with a')
print(f'   {(np.exp(age_coef) - 1) * 100:.2f}% change in expected patents.')

In [None]:
# ============================================================
# Predictions: Poisson vs NB
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Add small jitter for visualization
jitter = np.random.uniform(-0.3, 0.3, len(y))

# 1. Poisson predictions
ax = axes[0]
mu_pois = np.exp(X @ poisson_result.params)
ax.scatter(mu_pois, y + jitter, alpha=0.05, s=5, color='coral')
max_val = min(np.percentile(y, 99), np.percentile(mu_pois, 99))
ax.plot([0, max_val], [0, max_val], 'k--', linewidth=2, label='Perfect fit')
ax.set_xlabel('Poisson Predicted', fontsize=12)
ax.set_ylabel('Observed Patents', fontsize=12)
ax.set_title('Poisson: Predicted vs Observed')
ax.set_xlim(0, max_val * 1.1)
ax.set_ylim(-1, min(df['patents'].quantile(0.99), max_val * 2))
ax.legend()
corr_pois = np.corrcoef(mu_pois, y)[0, 1]
ax.text(0.05, 0.95, f'Corr = {corr_pois:.3f}', transform=ax.transAxes,
        fontsize=11, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

# 2. NB predictions (same conditional mean)
ax = axes[1]
mu_nb = np.exp(X @ nb_result.params_exog)
ax.scatter(mu_nb, y + jitter, alpha=0.05, s=5, color='seagreen')
ax.plot([0, max_val], [0, max_val], 'k--', linewidth=2, label='Perfect fit')
ax.set_xlabel('NB Predicted', fontsize=12)
ax.set_ylabel('Observed Patents', fontsize=12)
ax.set_title('Negative Binomial: Predicted vs Observed')
ax.set_xlim(0, max_val * 1.1)
ax.set_ylim(-1, min(df['patents'].quantile(0.99), max_val * 2))
ax.legend()
corr_nb = np.corrcoef(mu_nb, y)[0, 1]
ax.text(0.05, 0.95, f'Corr = {corr_nb:.3f}', transform=ax.transAxes,
        fontsize=11, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'predictions_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print(f'Correlation with observed:')
print(f'  Poisson: {corr_pois:.4f}')
print(f'  NB:      {corr_nb:.4f}')

## Section 6: When to Use Each Model

### Decision Framework

| Criterion | Use Poisson | Use Negative Binomial |
|-----------|------------|----------------------|
| Overdispersion index | $\approx$ 1 | >> 1 |
| Cameron-Trivedi test | Not significant | Significant |
| LR test | -- | Reject Poisson |
| AIC/BIC | Lower for Poisson | Lower for NB |

### Practical Recommendations

1. **Always check overdispersion first** — compute Var/Mean ratio and run diagnostic tests
2. **If in doubt, use NB** — it is more general and reduces to Poisson when appropriate
3. **Use robust standard errors regardless** — provides additional protection against misspecification
4. **Consider the data generating process** — think about why overdispersion might occur

### When Poisson is Sufficient
- Equidispersion approximately holds (index close to 1)
- Diagnostic tests do not reject equidispersion
- With robust/cluster standard errors for additional safety

### When NB is Preferred
- Clear overdispersion (index >> 1)
- LR test rejects Poisson
- Better AIC/BIC for NB
- Theory suggests unobserved heterogeneity

In [None]:
# ============================================================
# Model Selection Flowchart
# ============================================================

fig, ax = plt.subplots(figsize=(12, 8))
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')

# Title
ax.text(5, 9.5, 'Count Data Model Selection', fontsize=16, fontweight='bold',
        ha='center', va='center')

# Decision boxes
box_props = dict(boxstyle='round,pad=0.5', facecolor='lightblue', edgecolor='navy', alpha=0.8)
decision_props = dict(boxstyle='round,pad=0.4', facecolor='lightyellow', edgecolor='orange', alpha=0.8)
result_props_green = dict(boxstyle='round,pad=0.4', facecolor='lightgreen', edgecolor='green', alpha=0.8)
result_props_red = dict(boxstyle='round,pad=0.4', facecolor='mistyrose', edgecolor='red', alpha=0.8)

# Start
ax.text(5, 8.5, 'Count outcome variable Y >= 0', fontsize=11,
        ha='center', va='center', bbox=box_props)

# Arrow
ax.annotate('', xy=(5, 7.6), xytext=(5, 8.1),
            arrowprops=dict(arrowstyle='->', lw=2))

# Decision 1: Check overdispersion
ax.text(5, 7.2, 'Compute Var(Y)/Mean(Y)\nand Cameron-Trivedi test', fontsize=10,
        ha='center', va='center', bbox=decision_props)

# Branches
ax.annotate('', xy=(2.5, 6.0), xytext=(4.0, 6.8),
            arrowprops=dict(arrowstyle='->', lw=2))
ax.text(3.0, 6.5, 'Index ~ 1', fontsize=9, ha='center', color='green')

ax.annotate('', xy=(7.5, 6.0), xytext=(6.0, 6.8),
            arrowprops=dict(arrowstyle='->', lw=2))
ax.text(7.0, 6.5, 'Index >> 1', fontsize=9, ha='center', color='red')

# Left branch: Poisson okay
ax.text(2.5, 5.5, 'Poisson model\n(with robust SEs)', fontsize=10,
        ha='center', va='center', bbox=result_props_green)

# Right branch: Fit NB
ax.text(7.5, 5.5, 'Fit Negative Binomial\n(NB2 model)', fontsize=10,
        ha='center', va='center', bbox=box_props)

# Arrow from NB
ax.annotate('', xy=(7.5, 4.4), xytext=(7.5, 5.0),
            arrowprops=dict(arrowstyle='->', lw=2))

# Decision 2: LR test
ax.text(7.5, 4.0, 'LR test: Poisson vs NB\nCompare AIC/BIC', fontsize=10,
        ha='center', va='center', bbox=decision_props)

# Branches from LR test
ax.annotate('', xy=(5.5, 2.8), xytext=(6.5, 3.5),
            arrowprops=dict(arrowstyle='->', lw=2))
ax.text(5.7, 3.3, 'Fail to\nreject', fontsize=9, ha='center', color='green')

ax.annotate('', xy=(9.0, 2.8), xytext=(8.5, 3.5),
            arrowprops=dict(arrowstyle='->', lw=2))
ax.text(9.0, 3.3, 'Reject\nPoisson', fontsize=9, ha='center', color='red')

# Results
ax.text(5.5, 2.3, 'Use Poisson\n(with robust SEs)', fontsize=10,
        ha='center', va='center', bbox=result_props_green)

ax.text(9.0, 2.3, 'Use NB model', fontsize=10,
        ha='center', va='center', bbox=result_props_red)

# Additional note
ax.text(5, 0.8, 'Note: Consider zero-inflated models if excess zeros are structural\n'
        'and panel FE/RE models for unobserved heterogeneity (see Notebooks 03, 05)',
        fontsize=9, ha='center', va='center', style='italic',
        bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.5))

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'model_selection_flowchart.png', dpi=300, bbox_inches='tight')
plt.show()

## Section 7: Summary

### Key Takeaways

1. **Overdispersion is common** in real count data — always check before trusting Poisson results
2. **NB2 relaxes equidispersion**: $\text{Var}[Y|\mathbf{X}] = \mu + \alpha\mu^2$
3. **$\alpha$ measures overdispersion** — when $\alpha = 0$, NB reduces to Poisson
4. **LR test** provides a formal comparison: Poisson (restricted) vs NB (unrestricted)
5. **NB is typically preferred** for highly dispersed count data like patent counts
6. **Coefficients have the same interpretation** as Poisson (IRRs via $\exp(\beta)$), but SEs are corrected

### PanelBox Workflow

```python
# Step 1: Check overdispersion
poisson = PooledPoisson(y, X, entity_id=firms, time_id=years)
pois_result = poisson.fit(se_type='cluster')
# Check overdispersion index, run Cameron-Trivedi test

# Step 2: Fit NB model
nb = NegativeBinomial(y, X, entity_id=firms, time_id=years)
nb_result = nb.fit()

# Step 3: Compare models
lr_test = nb_result.lr_test_poisson()
# Compare AIC, BIC

# Step 4: Interpret
irr = np.exp(nb_result.params_exog)  # Incidence Rate Ratios
```

### What's Next?

- **Notebook 03**: Fixed and Random Effects for panel count data — addressing unobserved firm heterogeneity
- **Notebook 05**: Zero-inflated models — when excess zeros have a structural explanation

## Exercises

### Exercise 1: Sensitivity to Specification
Re-estimate the NB model using only `log_rd` and `log_emp` as regressors. Does the dispersion parameter $\alpha$ change? What does this tell you about omitted variable bias vs. inherent overdispersion?

### Exercise 2: Predicted Probabilities
Using the NB model, compute $P(Y = 0)$ for two types of firms: (a) a small non-tech firm with low R&D, and (b) a large tech firm with high R&D. Compare with the Poisson predictions. Which model gives more realistic zero probabilities?

### Exercise 3: Subset Analysis
Split the data into tech-sector and non-tech firms. Estimate separate NB models for each group. Do the overdispersion parameters differ? What does this suggest about the sources of overdispersion?

### Exercise 4: Robust Standard Errors
The NB model uses MLE standard errors by default. Compute cluster-robust standard errors (by firm) and compare. Are the conclusions affected?

In [None]:
# ============================================================
# Exercise Solutions (template)
# ============================================================

# Exercise 1: Sensitivity to specification
# Hint:
# X_reduced = sm.add_constant(df[['log_rd', 'log_emp']].values)
# nb_reduced = NegativeBinomial(endog=y, exog=X_reduced)
# nb_reduced_result = nb_reduced.fit()
# print(f'Alpha (full model): {nb_result.alpha:.4f}')
# print(f'Alpha (reduced model): {nb_reduced_result.alpha:.4f}')

print('Complete the exercises above to deepen your understanding!')
print('Solutions are available in the solutions notebook.')