# Complete Case Study: Innovation and Patents
## Integrating All Count Model Techniques

### Learning Objectives

1. Apply complete count data modeling workflow to a real research question
2. Systematically compare multiple model specifications
3. Conduct rigorous model selection using statistical tests
4. Compute and interpret marginal effects for policy analysis
5. Perform robustness checks and sensitivity analysis
6. Present results in publication-quality format
7. Draw substantive conclusions for innovation policy

### Duration
90-120 minutes (comprehensive case study)

### Prerequisites
**ALL previous notebooks (01-06) must be completed:**
- 01: Poisson Introduction
- 02: Negative Binomial
- 03: Fixed/Random Effects Count
- 04: PPML Gravity
- 05: Zero-Inflated Models
- 06: Marginal Effects

### Dataset
**Firm Innovation** (`firm_innovation_full.csv`): Manufacturing firms' patent activity.
- N = 500 firms x T = 8 years (2012-2019) = 4,000 observations
- Outcome: `patents` (count, 0-35)
- Key predictors: `rd_intensity`, `firm_size`, `capital_intensity`, `industry`, `year`
- Characteristics: ~39% zeros, severe overdispersion (Var/Mean ~ 14.4)

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Statistical libraries
from scipy import stats

# PanelBox imports
import statsmodels.api as sm
from panelbox.models.count import (
    PooledPoisson,
    PoissonFixedEffects,
    NegativeBinomial,
    ZeroInflatedPoisson,
    ZeroInflatedNegativeBinomial,
)
from panelbox.marginal_effects.count_me import (
    compute_poisson_ame,
    compute_negbin_ame,
)

# Visualization configuration
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

np.random.seed(303)

# Paths (relative to notebook location in examples/count/notebooks/)
BASE_DIR = Path('..')
DATA_DIR = BASE_DIR / 'data'
OUTPUT_DIR = BASE_DIR / 'outputs'
FIGURES_DIR = OUTPUT_DIR / 'figures' / '07_case_study'
TABLES_DIR = OUTPUT_DIR / 'tables' / '07_case_study'

FIGURES_DIR.mkdir(parents=True, exist_ok=True)
TABLES_DIR.mkdir(parents=True, exist_ok=True)

print('Setup complete!')
print(f'Data directory: {DATA_DIR}')
print(f'Figures directory: {FIGURES_DIR}')
print(f'Tables directory: {TABLES_DIR}')

---

## Section 1: Research Question and Context (10 min)

### Central Question

*What are the determinants of firm-level innovation, and how effective are R&D investments in generating patents?*

### Policy Relevance
- **R&D tax credits**: Are they justified by the R&D-patent relationship?
- **Firm size and innovation**: Do small firms need special support?
- **Industry differences**: Should policies be targeted or broad-based?

### Literature Context
- Griliches (1990): Knowledge production functions
- Hall, Griliches & Hausman (1986): Patents and R&D relationship
- Aghion et al. (2005): Competition and innovation

### Economic Framework

Knowledge Production Function:
$$\text{Patents}_{it} = f(\text{R\&D}_{it}, \text{Size}_{it}, \text{Industry}_i, \text{Time}_t, \text{Firm Effect}_i)$$

In [None]:
# Load the dataset
df = pd.read_csv(DATA_DIR / 'firm_innovation_full.csv')

print(f'Dataset shape: {df.shape}')
print(f'Firms: {df["firm_id"].nunique()}')
print(f'Years: {df["year"].min()}-{df["year"].max()}')
print(f'\nFirst few rows:')
display(df.head(10))

print(f'\nVariable types:')
print(df.dtypes)

In [None]:
# Table 01: Descriptive statistics
desc_stats = df.describe().T
desc_stats['zeros'] = (df == 0).sum()
desc_stats['pct_zeros'] = (df == 0).mean() * 100

print('Table 01: Descriptive Statistics')
print('=' * 80)
display(desc_stats[['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']])

# Key statistics
print(f'\nKey Facts:')
print(f'  Mean patents: {df["patents"].mean():.2f}')
print(f'  Variance patents: {df["patents"].var():.2f}')
print(f'  Zero patents: {(df["patents"]==0).mean():.1%}')
print(f'  Overdispersion index (Var/Mean): {df["patents"].var()/df["patents"].mean():.2f}')
print(f'  Mean R&D intensity: {df["rd_intensity"].mean():.2f}%')

desc_stats.to_csv(TABLES_DIR / 'table_01_descriptive_stats.csv')

In [None]:
# Figure: Patent distribution (highly skewed)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Panel 1: Patent distribution
max_val = min(df['patents'].max(), 30)
axes[0].hist(df['patents'], bins=np.arange(-0.5, max_val + 1.5, 1),
             alpha=0.7, color='steelblue', edgecolor='black')
axes[0].axvline(df['patents'].mean(), color='red', linestyle='--', linewidth=2,
                label=f'Mean = {df["patents"].mean():.1f}')
axes[0].set_xlabel('Number of Patents', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Patents\n(Highly Right-Skewed)', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(alpha=0.3)

# Panel 2: R&D distribution
axes[1].hist(df['rd_intensity'], bins=40, alpha=0.7, color='coral', edgecolor='black')
axes[1].axvline(df['rd_intensity'].mean(), color='red', linestyle='--', linewidth=2,
                label=f'Mean = {df["rd_intensity"].mean():.1f}%')
axes[1].set_xlabel('R&D Intensity (%)', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Distribution of R&D Intensity', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(alpha=0.3)

# Panel 3: Correlation heatmap
corr_vars = ['patents', 'rd_intensity', 'firm_size', 'capital_intensity', 'export_share', 'hhi']
corr_matrix = df[corr_vars].corr()
im = axes[2].imshow(corr_matrix, cmap='RdBu_r', vmin=-1, vmax=1, aspect='auto')
axes[2].set_xticks(range(len(corr_vars)))
axes[2].set_xticklabels(corr_vars, rotation=45, ha='right', fontsize=9)
axes[2].set_yticks(range(len(corr_vars)))
axes[2].set_yticklabels(corr_vars, fontsize=9)
for i in range(len(corr_vars)):
    for j in range(len(corr_vars)):
        axes[2].text(j, i, f'{corr_matrix.iloc[i,j]:.2f}', ha='center', va='center', fontsize=8)
axes[2].set_title('Correlation Matrix', fontsize=13, fontweight='bold')
plt.colorbar(im, ax=axes[2], shrink=0.8)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'patents_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Figure: Patents vs R&D and Zeros by industry
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel 1: Patents vs R&D scatter
axes[0].scatter(df['rd_intensity'], df['patents'], alpha=0.15, s=15, color='steelblue')
rd_bins = pd.qcut(df['rd_intensity'], 20, duplicates='drop')
rd_means = df.groupby(rd_bins, observed=True)['patents'].mean()
rd_centers = df.groupby(rd_bins, observed=True)['rd_intensity'].mean()
axes[0].plot(rd_centers.values, rd_means.values, 'r-o', linewidth=2, markersize=5,
             label='Binned mean')
axes[0].set_xlabel('R&D Intensity (%)', fontsize=12)
axes[0].set_ylabel('Patents', fontsize=12)
axes[0].set_title('Patents vs R&D Intensity', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(alpha=0.3)

# Panel 2: Zeros by industry
industry_names = {1: 'Chem/Pharma', 2: 'Electronics', 3: 'Automotive',
                  4: 'Machinery', 5: 'Food/Bev', 6: 'Textiles',
                  7: 'Metals', 8: 'Other Mfg'}
df['industry_name'] = df['industry'].map(industry_names)

zeros_by_ind = df.groupby('industry_name')['patents'].apply(
    lambda x: (x == 0).mean() * 100
).sort_values()

bars = axes[1].barh(range(len(zeros_by_ind)), zeros_by_ind.values,
                    color='coral', edgecolor='black', alpha=0.7)
axes[1].set_yticks(range(len(zeros_by_ind)))
axes[1].set_yticklabels(zeros_by_ind.index)
axes[1].set_xlabel('% Zero Patents', fontsize=12)
axes[1].set_title('Zero Patents by Industry', fontsize=13, fontweight='bold')
axes[1].axvline(df['patents'].eq(0).mean() * 100, color='red', linestyle='--',
                linewidth=2, label=f'Overall: {(df["patents"]==0).mean():.0%}')
axes[1].legend(fontsize=11)
axes[1].grid(alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'patents_vs_rd.png', dpi=300, bbox_inches='tight')
plt.savefig(FIGURES_DIR / 'zeros_by_industry.png', dpi=300, bbox_inches='tight')
plt.show()

---

## Section 2: Exploratory Data Analysis (15 min)

### Goals
1. Quantify overdispersion
2. Detect excess zeros
3. Examine panel structure
4. Check for outliers

In [None]:
# 2.1 Overdispersion Analysis
mean_patents = df['patents'].mean()
var_patents = df['patents'].var()
overdispersion_index = var_patents / mean_patents

print('2.1 Overdispersion Analysis')
print('=' * 50)
print(f'Mean(patents):          {mean_patents:.2f}')
print(f'Variance(patents):      {var_patents:.2f}')
print(f'Overdispersion index:   {overdispersion_index:.2f}')
print(f'\nConclusion: Severe overdispersion (Var/Mean = {overdispersion_index:.1f} >> 1)')
print(f'=> Poisson is likely misspecified. Negative Binomial needed.')

# By industry
print(f'\nOverdispersion by Industry:')
for ind_name, group in df.groupby('industry_name'):
    m = group['patents'].mean()
    v = group['patents'].var()
    if m > 0:
        print(f'  {ind_name:20s}: Mean={m:.2f}, Var={v:.2f}, Var/Mean={v/m:.2f}')

In [None]:
# 2.2 Zero-Inflation Check
observed_zero_pct = (df['patents'] == 0).mean()

# Predicted P(y=0) under Poisson with unconditional mean
lambda_prelim = mean_patents
poisson_predicted_zero = np.exp(-lambda_prelim)

print('2.2 Zero-Inflation Check')
print('=' * 50)
print(f'Observed % zeros:     {observed_zero_pct:.1%}')
print(f'Poisson predicted:    {poisson_predicted_zero:.1%} (using overall mean)')
print(f'Excess zeros:         {observed_zero_pct - poisson_predicted_zero:.1%}')
print(f'\nConclusion: Substantial excess zeros ({observed_zero_pct:.0%} vs {poisson_predicted_zero:.0%})')
print(f'=> Zero-inflated models (ZIP/ZINB) likely appropriate.')

In [None]:
# Figure: Overdispersion and zero-inflation diagnostics
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel 1: Variance-mean by industry
ind_stats = df.groupby('industry_name')['patents'].agg(['mean', 'var'])
axes[0].scatter(ind_stats['mean'], ind_stats['var'], s=100, zorder=5,
                color='steelblue', edgecolors='black')
max_val = max(ind_stats['mean'].max(), ind_stats['var'].max()) * 1.1
axes[0].plot([0, max_val], [0, max_val], 'r--', linewidth=2, label='Var = Mean (Poisson)')
for idx, row in ind_stats.iterrows():
    axes[0].annotate(idx, (row['mean'], row['var']), fontsize=8,
                     xytext=(5, 5), textcoords='offset points')
axes[0].set_xlabel('Mean', fontsize=12)
axes[0].set_ylabel('Variance', fontsize=12)
axes[0].set_title('Variance-Mean Plot by Industry\n(All above Poisson line)', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(alpha=0.3)

# Panel 2: Observed vs Poisson-predicted distribution
max_count = min(int(df['patents'].quantile(0.98)), 25)
count_range = np.arange(0, max_count + 1)
observed_freq = np.array([(df['patents'] == k).sum() for k in count_range])
poisson_freq = np.array([len(df) * stats.poisson.pmf(k, mean_patents) for k in count_range])

width = 0.35
axes[1].bar(count_range - width/2, observed_freq, width, label='Observed',
            color='steelblue', edgecolor='black', alpha=0.7)
axes[1].bar(count_range + width/2, poisson_freq, width, label='Poisson Predicted',
            color='coral', edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Patent Count', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Observed vs Poisson-Predicted\n(Excess zeros visible)', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'overdispersion_check.png', dpi=300, bbox_inches='tight')
plt.savefig(FIGURES_DIR / 'zero_inflation_check.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# 2.3 Panel Structure
print('2.3 Panel Structure')
print('=' * 50)
print(f'Firms: {df["firm_id"].nunique()}')
print(f'Years: {df["year"].nunique()} ({df["year"].min()}-{df["year"].max()})')
print(f'Obs per firm: {df.groupby("firm_id").size().describe()[["min","max","mean"]]}')
print(f'\nPanel: {"Balanced" if df.groupby("firm_id").size().nunique() == 1 else "Unbalanced"}')

# Within vs between variance
overall_var = df['patents'].var()
between_var = df.groupby('firm_id')['patents'].mean().var()
within_var = df.groupby('firm_id')['patents'].apply(lambda x: x.var()).mean()

print(f'\nVariance Decomposition:')
print(f'  Overall variance:  {overall_var:.2f}')
print(f'  Between variance:  {between_var:.2f} ({between_var/overall_var:.0%})')
print(f'  Within variance:   {within_var:.2f} ({within_var/overall_var:.0%})')
print(f'\n=> Most variation is between firms, suggesting firm FE important.')

In [None]:
# Figure: Panel structure
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel 1: Patents over time
yearly = df.groupby('year')['patents'].agg(['mean', 'std', 'sum'])
axes[0].plot(yearly.index, yearly['mean'], 'o-', linewidth=2, markersize=8, color='steelblue')
axes[0].fill_between(yearly.index,
                     yearly['mean'] - yearly['std'] / np.sqrt(500),
                     yearly['mean'] + yearly['std'] / np.sqrt(500),
                     alpha=0.2, color='steelblue')
axes[0].set_xlabel('Year', fontsize=12)
axes[0].set_ylabel('Mean Patents', fontsize=12)
axes[0].set_title('Average Patents Over Time\n(with 95% CI)', fontsize=13, fontweight='bold')
axes[0].grid(alpha=0.3)

# Panel 2: Firm-level time series (10 random firms)
np.random.seed(42)
sample_firms = np.random.choice(df['firm_id'].unique(), 10, replace=False)
for fid in sample_firms:
    firm_data = df[df['firm_id'] == fid].sort_values('year')
    axes[1].plot(firm_data['year'], firm_data['patents'], marker='o', alpha=0.6,
                 label=f'Firm {fid}')
axes[1].set_xlabel('Year', fontsize=12)
axes[1].set_ylabel('Patents', fontsize=12)
axes[1].set_title('Patent Trends for Selected Firms', fontsize=13, fontweight='bold')
axes[1].legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=8)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'panel_structure_plot.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# 2.4 Outlier Detection
print('2.4 Outlier Detection')
print('=' * 50)

top_firms = df.groupby('firm_id')['patents'].mean().nlargest(10)
print('Top 10 Patent Firms (by mean patents):')
for fid, mean_pat in top_firms.items():
    firm_data = df[df['firm_id'] == fid]
    print(f'  Firm {fid}: mean={mean_pat:.1f}, max={firm_data["patents"].max()}, '
          f'R&D={firm_data["rd_intensity"].mean():.1f}%')

p99 = df['patents'].quantile(0.99)
n_outliers = (df['patents'] > p99).sum()
mean_without = df[df['patents'] <= p99]['patents'].mean()
print(f'\nTop 1% threshold: {p99:.0f} patents')
print(f'N above threshold: {n_outliers}')
print(f'Mean with outliers: {mean_patents:.2f}')
print(f'Mean without top 1%: {mean_without:.2f}')

# Save EDA summary
eda_summary = pd.DataFrame({
    'Metric': ['Mean', 'Variance', 'Var/Mean', '% Zeros', 'Between Var', 'Within Var', 'P99'],
    'Value': [mean_patents, var_patents, overdispersion_index,
              observed_zero_pct * 100, between_var, within_var, p99]
})
eda_summary.to_csv(TABLES_DIR / 'table_02_eda_summary.csv', index=False)
print('\nEDA summary saved.')

---

## Section 3: Model Estimation - Full Comparison (30 min)

We estimate five count models and compare them systematically:

1. **Pooled Poisson** (baseline, likely misspecified)
2. **Pooled Negative Binomial** (handles overdispersion)
3. **FE Poisson** (controls firm heterogeneity)
4. **Zero-Inflated Poisson** (handles excess zeros)
5. **Zero-Inflated Negative Binomial** (both issues)

### Common Specification

- **Count variables**: `rd_intensity`, `firm_size`, `capital_intensity`
- **Controls**: year dummies
- **Inflate variables** (ZIP/ZINB): `industry`, `firm_size`

In [None]:
# Prepare variables for estimation
y = df['patents'].values
n_obs = len(y)

# Continuous regressors
X_vars = ['rd_intensity', 'firm_size', 'capital_intensity']

# Year dummies (first year is reference)
year_dummies = pd.get_dummies(df['year'], prefix='year', drop_first=True, dtype=float)
year_dummy_names = list(year_dummies.columns)

# Industry dummies (industry 1 is reference)
industry_dummies = pd.get_dummies(df['industry'], prefix='ind', drop_first=True, dtype=float)
industry_dummy_names = list(industry_dummies.columns)

# For pooled models: const + X_vars + industry dummies + year dummies
X_pooled_raw = pd.concat([df[X_vars], industry_dummies, year_dummies], axis=1)
X_pooled = sm.add_constant(X_pooled_raw.values)
pooled_names = ['const'] + X_vars + industry_dummy_names + year_dummy_names

# For FE model: const + X_vars + year dummies (no industry, absorbed by FE)
X_fe_raw = pd.concat([df[X_vars], year_dummies], axis=1)
X_fe = sm.add_constant(X_fe_raw.values)
fe_names = ['const'] + X_vars + year_dummy_names

# For zero-inflated count part: const + X_vars + year dummies
X_count = sm.add_constant(pd.concat([df[X_vars], year_dummies], axis=1).values)
count_names = ['const'] + X_vars + year_dummy_names

# For zero-inflated inflate part: const + industry dummies + firm_size
X_inflate = sm.add_constant(pd.concat([industry_dummies, df[['firm_size']]], axis=1).values)
inflate_names = ['const_z'] + [f'{n}_z' for n in industry_dummy_names] + ['firm_size_z']

# Helper: compute AIC/BIC from log-likelihood
def compute_ic(llf, k, n):
    """Compute AIC and BIC from log-likelihood."""
    aic = -2 * llf + 2 * k
    bic = -2 * llf + k * np.log(n)
    return aic, bic

# Helper: get llf/aic/bic for any model result
def get_model_stats(result, model=None, n=None):
    """Extract llf, aic, bic from any count model result."""
    # Get llf
    if hasattr(result, 'llf'):
        llf = result.llf
    elif model is not None and hasattr(model, 'llf'):
        llf = model.llf
    else:
        llf = np.nan

    # Get aic/bic
    if hasattr(result, 'aic') and hasattr(result, 'bic'):
        aic, bic = result.aic, result.bic
    else:
        # Compute manually - use len(result.params) for total param count
        k = len(result.params)
        nn = n if n else (model.n_obs if model and hasattr(model, 'n_obs') else k)
        if not np.isnan(llf):
            aic, bic = compute_ic(llf, k, nn)
        else:
            aic, bic = np.nan, np.nan

    return llf, aic, bic

print('Design Matrix Dimensions:')
print(f'  Pooled models:    {X_pooled.shape} ({len(pooled_names)} vars)')
print(f'  FE model:         {X_fe.shape} ({len(fe_names)} vars)')
print(f'  ZI count part:    {X_count.shape} ({len(count_names)} vars)')
print(f'  ZI inflate part:  {X_inflate.shape} ({len(inflate_names)} vars)')

In [None]:
# 3.1 Pooled Poisson
print('3.1 Pooled Poisson')
print('=' * 60)

pois_pool = PooledPoisson(endog=y, exog=X_pooled)
pois_pool_result = pois_pool.fit(se_type='robust')
pois_pool.exog_names = pooled_names

pois_table = pd.DataFrame({
    'Variable': pooled_names,
    'Coef': pois_pool_result.params,
    'SE': pois_pool_result.se,
    'p-value': pois_pool_result.pvalues,
})
display(pois_table)
pois_table.to_csv(TABLES_DIR / 'table_03_pooled_poisson.csv', index=False)

pois_llf, pois_aic, pois_bic = get_model_stats(pois_pool_result, model=pois_pool, n=n_obs)
print(f'\nLog-Lik: {pois_llf:.2f}, AIC: {pois_aic:.2f}, BIC: {pois_bic:.2f}')

In [None]:
# 3.2 Pooled Negative Binomial
print('3.2 Pooled Negative Binomial')
print('=' * 60)

nb_start = np.append(pois_pool_result.params, np.log(0.5))
nb_pool = NegativeBinomial(endog=y, exog=X_pooled, entity_id=df['firm_id'].values)
nb_pool_result = nb_pool.fit(start_params=nb_start)
nb_pool.exog_names = pooled_names

nb_table = pd.DataFrame({
    'Variable': pooled_names + ['log_alpha'],
    'Coef': nb_pool_result.params,
    'SE': nb_pool_result.se,
    'p-value': nb_pool_result.pvalues,
})
display(nb_table)
nb_table.to_csv(TABLES_DIR / 'table_04_pooled_nb.csv', index=False)

nb_llf, nb_aic, nb_bic = get_model_stats(nb_pool_result, model=nb_pool, n=n_obs)
print(f'\nAlpha (dispersion): {nb_pool_result.alpha:.4f}')
print(f'Log-Lik: {nb_llf:.2f}, AIC: {nb_aic:.2f}, BIC: {nb_bic:.2f}')

In [None]:
# 3.3 Fixed Effects Poisson (Subsample)
print('3.3 Fixed Effects Poisson')
print('=' * 60)
print()
print('NOTE: FE Poisson uses conditional MLE which is computationally')
print('expensive for firms with high patent counts. We estimate on a')
print('subsample of firms with max annual patents <= 5.')
print()

# Select firms with low patent counts for tractable FE estimation
firm_max_patents = df.groupby('firm_id')['patents'].max()
low_count_firms = firm_max_patents[firm_max_patents <= 5].index.values
df_fe = df[df['firm_id'].isin(low_count_firms)].copy()
y_fe = df_fe['patents'].values

# Rebuild design matrix for FE subsample
year_dum_fe = pd.get_dummies(df_fe['year'], prefix='year', drop_first=True, dtype=float)
X_fe_sub = sm.add_constant(pd.concat([df_fe[X_vars], year_dum_fe], axis=1).values)

print(f'Full sample: {df["firm_id"].nunique()} firms')
print(f'FE subsample: {len(low_count_firms)} firms (max patents/year <= 5)')
print(f'Observations: {len(df_fe)}')
print()

import signal

class TimeoutError(Exception):
    pass

def timeout_handler(signum, frame):
    raise TimeoutError("FE Poisson estimation timed out")

fe_pois = PoissonFixedEffects(
    endog=y_fe,
    exog=X_fe_sub,
    entity_id=df_fe['firm_id'].values,
    time_id=df_fe['year'].values
)

try:
    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(90)  # 90 second timeout
    fe_pois_result = fe_pois.fit()
    signal.alarm(0)  # Cancel alarm
    fe_pois.exog_names = fe_names
    
    fe_table = pd.DataFrame({
        'Variable': fe_names,
        'Coef': fe_pois_result.params,
        'SE': fe_pois_result.se,
        'p-value': fe_pois_result.pvalues,
    })
    display(fe_table)
    fe_table.to_csv(TABLES_DIR / 'table_05_fe_poisson.csv', index=False)
    
    fe_llf = getattr(fe_pois_result, 'llf', None) or getattr(fe_pois, 'llf', None) or np.nan
    fe_aic, fe_bic = compute_ic(fe_llf, len(fe_pois_result.params), len(y_fe)) if not np.isnan(fe_llf) else (np.nan, np.nan)
    print(f'\nLog-Lik: {fe_llf}')
    print(f'Entities dropped (all zeros): {getattr(fe_pois_result, "n_dropped", "N/A")}')
    fe_estimated = True
    
except Exception as e:
    signal.alarm(0)
    print(f'FE Poisson estimation failed: {e}')
    print('Continuing without FE Poisson results.')
    fe_pois_result = None
    fe_llf, fe_aic, fe_bic = np.nan, np.nan, np.nan
    fe_estimated = False

In [None]:
# 3.4 Zero-Inflated Poisson (ZIP)
print('3.4 Zero-Inflated Poisson (ZIP)')
print('=' * 60)

zip_model = ZeroInflatedPoisson(
    endog=y,
    exog_count=X_count,
    exog_inflate=X_inflate,
    exog_count_names=count_names,
    exog_inflate_names=inflate_names,
)
zip_result = zip_model.fit()

print(zip_result.summary(
    count_names=count_names,
    inflate_names=inflate_names,
))

zip_table = pd.DataFrame({
    'Component': ['Count'] * len(count_names) + ['Inflate'] * len(inflate_names),
    'Variable': count_names + inflate_names,
    'Coef': np.concatenate([zip_result.params_count, zip_result.params_inflate]),
    'SE': np.concatenate([zip_result.bse_count, zip_result.bse_inflate]),
})
zip_table.to_csv(TABLES_DIR / 'table_06_zip.csv', index=False)
print(f'\nAIC: {zip_result.aic:.2f}, BIC: {zip_result.bic:.2f}')
print(f'Actual zeros: {zip_result.actual_zeros}, Predicted zeros: {zip_result.predicted_zeros:.0f}')

In [None]:
# 3.5 Zero-Inflated Negative Binomial (ZINB)
print('3.5 Zero-Inflated Negative Binomial (ZINB)')
print('=' * 60)

zinb_model = ZeroInflatedNegativeBinomial(
    endog=y,
    exog_count=X_count,
    exog_inflate=X_inflate,
    exog_count_names=count_names,
    exog_inflate_names=inflate_names,
)
zinb_result = zinb_model.fit()

print(zinb_result.summary(
    count_names=count_names,
    inflate_names=inflate_names,
))

zinb_table = pd.DataFrame({
    'Component': ['Count'] * len(count_names) + ['Inflate'] * len(inflate_names) + ['Dispersion'],
    'Variable': count_names + inflate_names + ['alpha'],
    'Coef': np.concatenate([zinb_result.params_count, zinb_result.params_inflate, [zinb_result.alpha]]),
    'SE': np.concatenate([zinb_result.bse_count, zinb_result.bse_inflate, [zinb_result.bse_alpha]]),
})
zinb_table.to_csv(TABLES_DIR / 'table_07_zinb.csv', index=False)
print(f'\nAlpha (dispersion): {zinb_result.alpha:.4f}')
print(f'AIC: {zinb_result.aic:.2f}, BIC: {zinb_result.bic:.2f}')
print(f'Actual zeros: {zinb_result.actual_zeros}, Predicted zeros: {zinb_result.predicted_zeros:.0f}')

In [None]:
# 3.6 Store all results for comparison
results_dict = {
    'Pooled Poisson': pois_pool_result,
    'Pooled NB': nb_pool_result,
    'ZIP': zip_result,
    'ZINB': zinb_result,
}
models_dict = {
    'Pooled Poisson': pois_pool,
    'Pooled NB': nb_pool,
    'ZIP': None,
    'ZINB': None,
}

if fe_estimated:
    results_dict['FE Poisson'] = fe_pois_result
    models_dict['FE Poisson'] = fe_pois

# Compute stats for all models
model_stats = {}
for name, result in results_dict.items():
    llf, aic, bic = get_model_stats(result, model=models_dict.get(name), n=n_obs)
    model_stats[name] = {'llf': llf, 'aic': aic, 'bic': bic}

print('All models estimated successfully!')
for name, stats_d in model_stats.items():
    print(f'  {name:20s}: Log-Lik = {stats_d["llf"]:>10.2f}, AIC = {stats_d["aic"]:>10.2f}, BIC = {stats_d["bic"]:>10.2f}')

---

## Section 4: Model Selection and Testing (15 min)

### Testing Strategy
1. **Overdispersion test**: Poisson vs NB (Cameron-Trivedi + LR test)
2. **Vuong test**: ZIP vs Poisson, ZINB vs NB
3. **AIC/BIC comparison**: Across all models
4. **Model fit**: Predicted vs actual distributions

In [None]:
# 4.1 Overdispersion Test (Poisson vs NB)
print('4.1 Overdispersion Test')
print('=' * 60)

# Cameron-Trivedi auxiliary regression
mu_pois = np.exp(X_pooled @ pois_pool_result.params)
aux_y = ((y - mu_pois) ** 2 - y)
aux_x = mu_pois ** 2
slope, intercept, r_val, p_val_ct, std_err = stats.linregress(aux_x, aux_y)
t_stat_ct = slope / std_err
p_ct = 1 - stats.norm.cdf(t_stat_ct)

print('Cameron-Trivedi Overdispersion Test:')
print(f'  H0: alpha = 0 (Poisson adequate)')
print(f'  H1: alpha > 0 (Overdispersion)')
print(f'  Alpha estimate: {slope:.4f}')
print(f'  t-statistic:    {t_stat_ct:.4f}')
print(f'  p-value:        {p_ct:.6f}')
print(f'  => {"Reject H0: overdispersion detected" if p_ct < 0.05 else "Cannot reject H0"}')

# LR test (Poisson vs NB)
print('\nLR Test (Poisson vs NB):')
lr_stat = 2 * (model_stats['Pooled NB']['llf'] - model_stats['Pooled Poisson']['llf'])
lr_pval = 0.5 * (1 - stats.chi2.cdf(lr_stat, 1))  # Boundary test
print(f'  LR statistic:   {lr_stat:.2f}')
print(f'  p-value:        {lr_pval:.6f}')
print(f'  => {"Strong evidence for NB over Poisson" if lr_pval < 0.01 else "Poisson may be adequate"}')

In [None]:
# 4.2 Vuong Tests
print('4.2 Vuong Tests')
print('=' * 60)

# ZIP vs Poisson
if hasattr(zip_result, 'vuong_stat') and zip_result.vuong_stat is not None:
    print(f'ZIP vs Poisson:')
    print(f'  Vuong statistic: {zip_result.vuong_stat:.4f}')
    print(f'  p-value:         {zip_result.vuong_pvalue:.6f}')
    vuong_conclusion = 'ZIP preferred' if zip_result.vuong_pvalue < 0.05 and zip_result.vuong_stat > 0 else 'Standard model may be adequate'
    print(f'  => {vuong_conclusion}')
else:
    # Compute comparable Poisson AIC for same specification
    pois_comp = PooledPoisson(endog=y, exog=X_count)
    pois_comp_result = pois_comp.fit(se_type='robust')
    _, pois_comp_aic, _ = get_model_stats(pois_comp_result, model=pois_comp, n=n_obs)
    print('ZIP vs Poisson (AIC comparison):')
    print(f'  ZIP AIC:     {model_stats["ZIP"]["aic"]:.2f}')
    print(f'  Poisson AIC: {pois_comp_aic:.2f}')
    print(f'  => {"ZIP preferred" if model_stats["ZIP"]["aic"] < pois_comp_aic else "Poisson preferred"}')

print(f'\nZINB vs NB (AIC comparison):')
print(f'  ZINB AIC: {model_stats["ZINB"]["aic"]:.2f}')
print(f'  NB AIC:   {model_stats["Pooled NB"]["aic"]:.2f}')
print(f'  => {"ZINB preferred" if model_stats["ZINB"]["aic"] < model_stats["Pooled NB"]["aic"] else "NB preferred"}')

In [None]:
# 4.3 Information Criteria Comparison
print('4.3 Information Criteria Comparison')
print('=' * 60)

comparison = pd.DataFrame({
    'Model': list(model_stats.keys()),
    'Log-Lik': [s['llf'] for s in model_stats.values()],
    'AIC': [s['aic'] for s in model_stats.values()],
    'BIC': [s['bic'] for s in model_stats.values()],
})
comparison = comparison.sort_values('AIC')

display(comparison)
best_aic = comparison.dropna(subset=['AIC']).iloc[0]['Model']
best_bic = comparison.dropna(subset=['BIC']).sort_values('BIC').iloc[0]['Model']
print(f'\nBest model by AIC: {best_aic}')
print(f'Best model by BIC: {best_bic}')

comparison.to_csv(TABLES_DIR / 'table_09_aic_bic_comparison.csv', index=False)

In [None]:
# Figure: AIC/BIC bar plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

comp_plot = comparison.dropna(subset=['AIC', 'BIC']).sort_values('AIC')
x_pos = np.arange(len(comp_plot))

# AIC
bars_aic = axes[0].bar(x_pos, comp_plot['AIC'].values, alpha=0.7,
                       color='steelblue', edgecolor='black')
bars_aic[0].set_color('green')
bars_aic[0].set_alpha(0.9)
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(comp_plot['Model'].values, rotation=45, ha='right')
axes[0].set_ylabel('AIC (lower is better)', fontsize=12)
axes[0].set_title('AIC Comparison', fontsize=13, fontweight='bold')
axes[0].grid(alpha=0.3, axis='y')

# BIC
comp_bic = comp_plot.sort_values('BIC')
bars_bic = axes[1].bar(x_pos, comp_bic['BIC'].values, alpha=0.7,
                       color='coral', edgecolor='black')
bars_bic[0].set_color('green')
bars_bic[0].set_alpha(0.9)
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(comp_bic['Model'].values, rotation=45, ha='right')
axes[1].set_ylabel('BIC (lower is better)', fontsize=12)
axes[1].set_title('BIC Comparison', fontsize=13, fontweight='bold')
axes[1].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'aic_bic_barplot.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# 4.4 Model Fit: Rootogram comparison
max_count_plot = min(int(df['patents'].quantile(0.95)), 20)
count_range = np.arange(0, max_count_plot + 1)
observed_freq = np.array([(y == k).sum() for k in count_range])

# Poisson predicted
mu_pois_pred = np.exp(X_pooled @ pois_pool_result.params)
pois_pred_freq = np.array([np.mean(stats.poisson.pmf(k, mu_pois_pred)) * len(y) for k in count_range])

# NB predicted
mu_nb_pred = np.exp(X_pooled @ nb_pool_result.params_exog)
alpha_nb = nb_pool_result.alpha
r_nb = 1 / alpha_nb
p_nb = r_nb / (r_nb + mu_nb_pred)
nb_pred_freq = np.array([np.mean(stats.nbinom.pmf(k, r_nb, p_nb)) * len(y) for k in count_range])

# ZIP predicted
lambda_zip = np.exp(X_count @ zip_result.params_count)
pi_zip = 1 / (1 + np.exp(-(X_inflate @ zip_result.params_inflate)))
zip_pred_freq = np.array([
    np.mean(pi_zip * (k == 0) + (1 - pi_zip) * stats.poisson.pmf(k, lambda_zip)) * len(y)
    for k in count_range
])

# ZINB predicted
lambda_zinb = np.exp(X_count @ zinb_result.params_count)
pi_zinb = 1 / (1 + np.exp(-(X_inflate @ zinb_result.params_inflate)))
alpha_zinb = zinb_result.alpha
r_zinb = 1 / alpha_zinb
p_zinb = r_zinb / (r_zinb + lambda_zinb)
zinb_pred_freq = np.array([
    np.mean(pi_zinb * (k == 0) + (1 - pi_zinb) * stats.nbinom.pmf(k, r_zinb, p_zinb)) * len(y)
    for k in count_range
])

# Figure: Rootogram comparison (4 panels)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
models_pred = [
    ('Pooled Poisson', pois_pred_freq),
    ('Pooled NB', nb_pred_freq),
    ('ZIP', zip_pred_freq),
    ('ZINB', zinb_pred_freq),
]

for ax, (name, pred_freq) in zip(axes.flat, models_pred):
    obs_sqrt = np.sqrt(observed_freq)
    exp_sqrt = np.sqrt(pred_freq)
    residual = obs_sqrt - exp_sqrt
    colors = ['red' if r < -0.5 else ('blue' if r > 0.5 else 'gray') for r in residual]
    ax.bar(count_range, residual, bottom=exp_sqrt, color=colors, alpha=0.6, edgecolor='black')
    ax.plot(count_range, exp_sqrt, 'o-', color='darkred', linewidth=2, markersize=4)
    ax.axhline(0, color='black', linestyle='--', linewidth=1)
    ax.set_title(name, fontsize=13, fontweight='bold')
    ax.set_xlabel('Count')
    ax.set_ylabel('Sqrt(Frequency)')
    ax.grid(alpha=0.3)

plt.suptitle('Hanging Rootograms: Model Fit Comparison', fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'rootogram_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Save test statistics
test_stats = pd.DataFrame([
    {'Test': 'Cameron-Trivedi (Overdispersion)', 'Statistic': t_stat_ct, 'p-value': p_ct,
     'Conclusion': 'Overdispersion detected' if p_ct < 0.05 else 'No overdispersion'},
    {'Test': 'LR Test (Poisson vs NB)', 'Statistic': lr_stat, 'p-value': lr_pval,
     'Conclusion': 'NB preferred' if lr_pval < 0.05 else 'Poisson adequate'},
    {'Test': 'AIC: ZINB vs NB',
     'Statistic': model_stats['Pooled NB']['aic'] - model_stats['ZINB']['aic'],
     'p-value': np.nan,
     'Conclusion': 'ZINB preferred' if model_stats['ZINB']['aic'] < model_stats['Pooled NB']['aic'] else 'NB preferred'},
])
test_stats.to_csv(TABLES_DIR / 'table_08_test_statistics.csv', index=False)
display(test_stats)

print('\n' + '=' * 60)
print('MODEL SELECTION DECISION: ZINB is the preferred model')
print('  - Overdispersion: Present (Var/Mean >> 1, LR test significant)')
print('  - Excess zeros: Present (39% observed vs << predicted by Poisson)')
print('  - ZINB handles both issues simultaneously')
print('=' * 60)

---

## Section 5: Interpreting the Preferred Model (ZINB) (20 min)

The ZINB model has two components:
1. **Count component**: Determines patent production among potential innovators
2. **Inflation component**: Determines the probability of being a structural non-innovator

We interpret both components and compute Incidence Rate Ratios (IRRs).

In [None]:
# 5.1 Count Component Results
print('5.1 ZINB Count Component (Patent Production Among Innovators)')
print('=' * 70)

count_coefs = zinb_result.params_count
count_se = zinb_result.bse_count
count_z = count_coefs / count_se
count_pvals = 2 * (1 - stats.norm.cdf(np.abs(count_z)))
count_irr = np.exp(count_coefs)

count_results = pd.DataFrame({
    'Variable': count_names,
    'Coef': count_coefs,
    'SE': count_se,
    'z': count_z,
    'p-value': count_pvals,
    'IRR': count_irr,
    '% Change': (count_irr - 1) * 100,
})

display(count_results)
count_results.to_csv(TABLES_DIR / 'table_10_zinb_count_results.csv', index=False)

rd_idx = count_names.index('rd_intensity')
size_idx = count_names.index('firm_size')
cap_idx = count_names.index('capital_intensity')

print(f'\nKey Interpretations (Count Component):')
print(f'  R&D intensity: IRR = {count_irr[rd_idx]:.4f}')
print(f'    => 1 p.p. increase in R&D/sales => {(count_irr[rd_idx]-1)*100:.1f}% more patents')
print(f'  Firm size (log): IRR = {count_irr[size_idx]:.4f}')
print(f'    => 1 unit increase in log(employees) => {(count_irr[size_idx]-1)*100:.1f}% more patents')
print(f'  Capital intensity: IRR = {count_irr[cap_idx]:.4f}')
print(f'    => 1 unit increase => {(count_irr[cap_idx]-1)*100:.2f}% more patents')

In [None]:
# 5.2 Inflation Component Results
print('5.2 ZINB Inflation Component (Non-Innovator Probability)')
print('=' * 70)

inflate_coefs = zinb_result.params_inflate
inflate_se = zinb_result.bse_inflate
inflate_z = inflate_coefs / inflate_se
inflate_pvals = 2 * (1 - stats.norm.cdf(np.abs(inflate_z)))
inflate_or = np.exp(inflate_coefs)

inflate_results = pd.DataFrame({
    'Variable': inflate_names,
    'Coef': inflate_coefs,
    'SE': inflate_se,
    'z': inflate_z,
    'p-value': inflate_pvals,
    'Odds Ratio': inflate_or,
})

display(inflate_results)
inflate_results.to_csv(TABLES_DIR / 'table_11_zinb_inflate_results.csv', index=False)

pi_hat = 1 / (1 + np.exp(-(X_inflate @ inflate_coefs)))
print(f'\nEstimated % structural non-innovators: {pi_hat.mean():.1%}')
print(f'  Range: {pi_hat.min():.1%} - {pi_hat.max():.1%}')

In [None]:
# 5.3 Dispersion Parameter
print('5.3 Dispersion Parameter')
print('=' * 50)
print(f'Alpha: {zinb_result.alpha:.4f}')
print(f'SE(alpha): {zinb_result.bse_alpha:.4f}')
print(f'\nInterpretation:')
print(f'  alpha > 0 confirms overdispersion beyond what the zero-inflation handles.')
print(f'  Conditional variance = mu + alpha * mu^2')

In [None]:
# Figure: Forest plot of IRRs (count component)
fig, ax = plt.subplots(figsize=(10, 8))

key_vars_idx = [i for i, n in enumerate(count_names) if n in X_vars or n.startswith('year_')]
key_vars_names = [count_names[i] for i in key_vars_idx]
key_irrs = count_irr[key_vars_idx]
key_ci_low = np.exp(count_coefs[key_vars_idx] - 1.96 * count_se[key_vars_idx])
key_ci_high = np.exp(count_coefs[key_vars_idx] + 1.96 * count_se[key_vars_idx])

y_pos = np.arange(len(key_vars_names))
ax.errorbar(key_irrs, y_pos,
            xerr=[key_irrs - key_ci_low, key_ci_high - key_irrs],
            fmt='o', markersize=8, capsize=5, capthick=2, color='steelblue')
ax.axvline(1, color='red', linestyle='--', linewidth=2, label='IRR = 1 (no effect)')
ax.set_yticks(y_pos)
ax.set_yticklabels(key_vars_names)
ax.set_xlabel('Incidence Rate Ratio (IRR)', fontsize=12)
ax.set_title('ZINB Count Component: IRRs with 95% CIs', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'zinb_irr_plot.png', dpi=300, bbox_inches='tight')
plt.show()

---

## Section 6: Marginal Effects and Policy Implications (20 min)

### Goals
1. Compute AME for R&D intensity (key policy variable)
2. Evaluate effect heterogeneity by firm size
3. Quantify policy impacts via counterfactual simulation
4. Decompose into extensive vs intensive margins

In [None]:
# 6.1 Average Marginal Effects for ZINB
# For ZINB: E[y] = (1-pi) * lambda
# AME_k = mean[(1-pi) * beta_k * lambda] (for variables only in count part)
print('6.1 Average Marginal Effects (ZINB)')
print('=' * 60)

lambda_zinb_pred = np.exp(X_count @ zinb_result.params_count)
pi_zinb_pred = 1 / (1 + np.exp(-(X_inflate @ zinb_result.params_inflate)))

ame_zinb_data = {}
for i, var_name in enumerate(count_names):
    if var_name == 'const':
        continue
    beta_k = zinb_result.params_count[i]
    me_k = (1 - pi_zinb_pred) * beta_k * lambda_zinb_pred
    ame_zinb_data[var_name] = {
        'AME': me_k.mean(),
        'SE': me_k.std() / np.sqrt(len(me_k)),
        'Min ME': me_k.min(),
        'Max ME': me_k.max(),
    }

ame_zinb_df = pd.DataFrame(ame_zinb_data).T
ame_zinb_df['z'] = ame_zinb_df['AME'] / ame_zinb_df['SE']
ame_zinb_df['p-value'] = 2 * (1 - stats.norm.cdf(np.abs(ame_zinb_df['z'])))
ame_zinb_df['CI Lower'] = ame_zinb_df['AME'] - 1.96 * ame_zinb_df['SE']
ame_zinb_df['CI Upper'] = ame_zinb_df['AME'] + 1.96 * ame_zinb_df['SE']

display(ame_zinb_df[['AME', 'SE', 'z', 'p-value', 'CI Lower', 'CI Upper']])
ame_zinb_df.to_csv(TABLES_DIR / 'table_12_ame_zinb.csv')

rd_ame = ame_zinb_data['rd_intensity']['AME']
print(f'\nKey Result: AME of R&D intensity = {rd_ame:.4f}')
print(f'  => 1 p.p. increase in R&D/sales increases expected patents by {rd_ame:.2f}')
print(f'  Accounting for both the count and zero-inflation components.')

In [None]:
# 6.2 Marginal Effects by Firm Size
print('6.2 Marginal Effects by Firm Size')
print('=' * 60)

df['size_quartile'] = pd.qcut(df['firm_size'], 4, labels=['Small', 'Medium-Small', 'Medium-Large', 'Large'])

me_by_size = []
rd_idx_count = count_names.index('rd_intensity')
beta_rd = zinb_result.params_count[rd_idx_count]

for q in ['Small', 'Medium-Small', 'Medium-Large', 'Large']:
    mask = (df['size_quartile'] == q).values
    me_rd = (1 - pi_zinb_pred[mask]) * beta_rd * lambda_zinb_pred[mask]
    me_by_size.append({
        'Size Group': q,
        'N': mask.sum(),
        'Mean E[y]': ((1 - pi_zinb_pred[mask]) * lambda_zinb_pred[mask]).mean(),
        'AME(R&D)': me_rd.mean(),
        'SE': me_rd.std() / np.sqrt(mask.sum()),
        'Mean P(non-innovator)': pi_zinb_pred[mask].mean(),
    })

me_size_df = pd.DataFrame(me_by_size)
me_size_df['CI Lower'] = me_size_df['AME(R&D)'] - 1.96 * me_size_df['SE']
me_size_df['CI Upper'] = me_size_df['AME(R&D)'] + 1.96 * me_size_df['SE']

display(me_size_df)
me_size_df.to_csv(TABLES_DIR / 'table_13_me_by_firm_size.csv', index=False)

print(f'\nLarge firms: AME = {me_size_df[me_size_df["Size Group"]=="Large"]["AME(R&D)"].values[0]:.3f}')
print(f'Small firms: AME = {me_size_df[me_size_df["Size Group"]=="Small"]["AME(R&D)"].values[0]:.3f}')
print(f'=> Larger firms have a larger absolute response to R&D investment.')

In [None]:
# Figure: ME heterogeneity
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel 1: AME by firm size
x_pos = np.arange(len(me_size_df))
bars = axes[0].bar(x_pos, me_size_df['AME(R&D)'],
                   yerr=1.96 * me_size_df['SE'], capsize=5,
                   alpha=0.7, color=['#4393C3', '#92C5DE', '#F4A582', '#D6604D'],
                   edgecolor='black')
axes[0].axhline(ame_zinb_data['rd_intensity']['AME'], color='red', linestyle='--',
                linewidth=2, label=f'Overall AME = {ame_zinb_data["rd_intensity"]["AME"]:.3f}')
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(me_size_df['Size Group'])
axes[0].set_ylabel('AME of R&D Intensity', fontsize=12)
axes[0].set_title('R&D Effect by Firm Size\n(Larger firms: greater absolute impact)', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(alpha=0.3, axis='y')

# Panel 2: P(non-innovator) by firm size
axes[1].bar(x_pos, me_size_df['Mean P(non-innovator)'] * 100,
            alpha=0.7, color='coral', edgecolor='black')
axes[1].axhline(pi_zinb_pred.mean() * 100, color='red', linestyle='--',
                linewidth=2, label=f'Overall = {pi_zinb_pred.mean():.0%}')
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(me_size_df['Size Group'])
axes[1].set_ylabel('P(Non-Innovator) %', fontsize=12)
axes[1].set_title('Non-Innovator Probability by Size\n(Smaller firms more likely non-innovators)', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'me_heterogeneity_plot.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# 6.3 Policy Simulation: R&D Tax Credit
print('6.3 Policy Simulation: R&D Tax Credit (+2 p.p. R&D intensity)')
print('=' * 60)

# Baseline predictions
E_y_baseline = (1 - pi_zinb_pred) * lambda_zinb_pred

# Counterfactual: increase R&D by 2 p.p.
X_count_cf = X_count.copy()
rd_col_idx = count_names.index('rd_intensity')
X_count_cf[:, rd_col_idx] += 2

lambda_cf = np.exp(X_count_cf @ zinb_result.params_count)
E_y_cf = (1 - pi_zinb_pred) * lambda_cf

impact = E_y_cf - E_y_baseline

print(f'Baseline mean patents: {E_y_baseline.mean():.2f}')
print(f'Counterfactual mean:   {E_y_cf.mean():.2f}')
print(f'\nPolicy Impact:')
print(f'  Mean increase:   {impact.mean():.2f} patents per firm-year')
print(f'  Total increase:  {impact.sum():.0f} patents across all firm-years')
print(f'  % increase:      {impact.mean() / E_y_baseline.mean() * 100:.1f}%')

print(f'\nImpact by Firm Size:')
for q in ['Small', 'Medium-Small', 'Medium-Large', 'Large']:
    mask = (df['size_quartile'] == q).values
    print(f'  {q:15s}: +{impact[mask].mean():.2f} patents (from {E_y_baseline[mask].mean():.2f})')

policy_sim = pd.DataFrame({
    'Metric': ['Baseline mean', 'Counterfactual mean', 'Mean impact', 'Total impact', '% increase'],
    'Value': [E_y_baseline.mean(), E_y_cf.mean(), impact.mean(),
              impact.sum(), impact.mean() / E_y_baseline.mean() * 100]
})
policy_sim.to_csv(TABLES_DIR / 'table_14_policy_simulation.csv', index=False)

In [None]:
# 6.4 Extensive vs Intensive Margin Decomposition
print('6.4 Extensive vs Intensive Margin')
print('=' * 60)

E_y_given_innov_base = lambda_zinb_pred.mean()
E_y_given_innov_cf = lambda_cf.mean()
intensive = E_y_given_innov_cf - E_y_given_innov_base

print(f'With R&D only in count model:')
print(f'  Extensive margin (change in P(innovator)): 0 (R&D not in inflate model)')
print(f'  Intensive margin (more patents | innovator): +{intensive:.2f}')
print(f'\nThe entire R&D effect operates through the intensive margin:')
print(f'  More R&D => more patents among firms that are already innovators.')
print(f'  It does NOT change the probability of being a non-innovator.')

In [None]:
# Figure: Policy simulation
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel 1: Baseline vs counterfactual
max_val_plot = 20
bins = np.arange(-0.5, max_val_plot + 1.5, 1)
axes[0].hist(E_y_baseline, bins=bins, alpha=0.5, color='steelblue',
             label=f'Baseline (mean={E_y_baseline.mean():.2f})', edgecolor='black', density=True)
axes[0].hist(E_y_cf, bins=bins, alpha=0.5, color='coral',
             label=f'With Tax Credit (mean={E_y_cf.mean():.2f})', edgecolor='black', density=True)
axes[0].set_xlabel('Expected Patents', fontsize=12)
axes[0].set_ylabel('Density', fontsize=12)
axes[0].set_title('Predicted Patent Distribution\nBaseline vs R&D Tax Credit', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(alpha=0.3)

# Panel 2: Impact by firm size
impact_by_size = []
for q in ['Small', 'Medium-Small', 'Medium-Large', 'Large']:
    mask = (df['size_quartile'] == q).values
    impact_by_size.append(impact[mask].mean())

x_pos = np.arange(4)
axes[1].bar(x_pos, impact_by_size, alpha=0.7,
            color=['#4393C3', '#92C5DE', '#F4A582', '#D6604D'],
            edgecolor='black')
axes[1].axhline(impact.mean(), color='red', linestyle='--', linewidth=2,
                label=f'Overall: +{impact.mean():.2f}')
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(['Small', 'Med-Small', 'Med-Large', 'Large'])
axes[1].set_ylabel('Additional Patents per Firm-Year', fontsize=12)
axes[1].set_title('Policy Impact by Firm Size\n(+2 p.p. R&D intensity)', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'policy_impact_plot.png', dpi=300, bbox_inches='tight')
plt.show()

---

## Section 7: Robustness and Sensitivity Analysis (15 min)

### Checks
1. **Alternative specifications**: lagged R&D, nonlinear R&D, interactions
2. **Sample sensitivity**: excluding outliers, time periods
3. **Result stability across specifications**

In [None]:
# 7.1 Alternative Specifications
print('7.1 Robustness: Alternative Specifications')
print('=' * 60)

robustness_results = {}

# Baseline
rd_idx_zinb = count_names.index('rd_intensity')
robustness_results['Baseline'] = {
    'coef_rd': zinb_result.params_count[rd_idx_zinb],
    'se_rd': zinb_result.bse_count[rd_idx_zinb],
}

# Spec 1: Lagged R&D
df_sorted = df.sort_values(['firm_id', 'year'])
df_sorted['rd_lag'] = df_sorted.groupby('firm_id')['rd_intensity'].shift(1)
df_lag = df_sorted.dropna(subset=['rd_lag']).copy()

y_lag = df_lag['patents'].values
X_lag_vars = ['rd_lag', 'firm_size', 'capital_intensity']
year_dum_lag = pd.get_dummies(df_lag['year'], prefix='year', drop_first=True, dtype=float)
X_count_lag = sm.add_constant(pd.concat([df_lag[X_lag_vars], year_dum_lag], axis=1).values)
count_names_lag = ['const'] + X_lag_vars + list(year_dum_lag.columns)

ind_dum_lag = pd.get_dummies(df_lag['industry'], prefix='ind', drop_first=True, dtype=float)
X_inflate_lag = sm.add_constant(pd.concat([ind_dum_lag, df_lag[['firm_size']]], axis=1).values)
inflate_names_lag = ['const_z'] + [f'{n}_z' for n in ind_dum_lag.columns] + ['firm_size_z']

zinb_lag = ZeroInflatedNegativeBinomial(
    endog=y_lag, exog_count=X_count_lag, exog_inflate=X_inflate_lag,
    exog_count_names=count_names_lag, exog_inflate_names=inflate_names_lag,
)
zinb_lag_result = zinb_lag.fit()

rd_lag_idx = count_names_lag.index('rd_lag')
robustness_results['Lagged R&D'] = {
    'coef_rd': zinb_lag_result.params_count[rd_lag_idx],
    'se_rd': zinb_lag_result.bse_count[rd_lag_idx],
}
print(f'  Lagged R&D: coef = {zinb_lag_result.params_count[rd_lag_idx]:.4f}')

lag_table = pd.DataFrame({
    'Variable': count_names_lag,
    'Coef': zinb_lag_result.params_count,
    'SE': zinb_lag_result.bse_count,
})
lag_table.to_csv(TABLES_DIR / 'table_15_robustness_lag.csv', index=False)

In [None]:
# Spec 2: Add R&D squared (nonlinearity)
df['rd_sq'] = df['rd_intensity'] ** 2

X_vars_nl = ['rd_intensity', 'rd_sq', 'firm_size', 'capital_intensity']
X_count_nl = sm.add_constant(pd.concat([df[X_vars_nl], year_dummies], axis=1).values)
count_names_nl = ['const'] + X_vars_nl + year_dummy_names

zinb_nl = ZeroInflatedNegativeBinomial(
    endog=y, exog_count=X_count_nl, exog_inflate=X_inflate,
    exog_count_names=count_names_nl, exog_inflate_names=inflate_names,
)
zinb_nl_result = zinb_nl.fit()

rd_nl_idx = count_names_nl.index('rd_intensity')
rd_sq_idx = count_names_nl.index('rd_sq')
robustness_results['Nonlinear R&D'] = {
    'coef_rd': zinb_nl_result.params_count[rd_nl_idx],
    'se_rd': zinb_nl_result.bse_count[rd_nl_idx],
}
print(f'Nonlinear R&D: rd = {zinb_nl_result.params_count[rd_nl_idx]:.4f}, '
      f'rd_sq = {zinb_nl_result.params_count[rd_sq_idx]:.6f}')

nl_table = pd.DataFrame({
    'Variable': count_names_nl,
    'Coef': zinb_nl_result.params_count,
    'SE': zinb_nl_result.bse_count,
})
nl_table.to_csv(TABLES_DIR / 'table_16_robustness_nonlinear.csv', index=False)

In [None]:
# Spec 3: R&D x Size interaction
df['rd_x_size'] = df['rd_intensity'] * df['firm_size']

X_vars_int = ['rd_intensity', 'firm_size', 'capital_intensity', 'rd_x_size']
X_count_int = sm.add_constant(pd.concat([df[X_vars_int], year_dummies], axis=1).values)
count_names_int = ['const'] + X_vars_int + year_dummy_names

zinb_int = ZeroInflatedNegativeBinomial(
    endog=y, exog_count=X_count_int, exog_inflate=X_inflate,
    exog_count_names=count_names_int, exog_inflate_names=inflate_names,
)
zinb_int_result = zinb_int.fit()

rd_int_idx = count_names_int.index('rd_intensity')
rdxsize_idx = count_names_int.index('rd_x_size')
robustness_results['R&D x Size'] = {
    'coef_rd': zinb_int_result.params_count[rd_int_idx],
    'se_rd': zinb_int_result.bse_count[rd_int_idx],
}
print(f'Interaction: rd = {zinb_int_result.params_count[rd_int_idx]:.4f}, '
      f'rd_x_size = {zinb_int_result.params_count[rdxsize_idx]:.6f}')

int_table = pd.DataFrame({
    'Variable': count_names_int,
    'Coef': zinb_int_result.params_count,
    'SE': zinb_int_result.bse_count,
})
int_table.to_csv(TABLES_DIR / 'table_17_robustness_interactions.csv', index=False)

In [None]:
# 7.2 Sample Sensitivity
print('7.2 Sample Sensitivity')
print('=' * 60)

# Exclude top 1%
p99 = df['patents'].quantile(0.99)
df_no_outliers = df[df['patents'] <= p99].copy()
y_no = df_no_outliers['patents'].values

year_dum_no = pd.get_dummies(df_no_outliers['year'], prefix='year', drop_first=True, dtype=float)
ind_dum_no = pd.get_dummies(df_no_outliers['industry'], prefix='ind', drop_first=True, dtype=float)
X_count_no = sm.add_constant(pd.concat([df_no_outliers[X_vars], year_dum_no], axis=1).values)
X_inflate_no = sm.add_constant(pd.concat([ind_dum_no, df_no_outliers[['firm_size']]], axis=1).values)

count_names_no = ['const'] + X_vars + list(year_dum_no.columns)
inflate_names_no = ['const_z'] + [f'{n}_z' for n in ind_dum_no.columns] + ['firm_size_z']

zinb_no = ZeroInflatedNegativeBinomial(
    endog=y_no, exog_count=X_count_no, exog_inflate=X_inflate_no,
    exog_count_names=count_names_no, exog_inflate_names=inflate_names_no,
)
zinb_no_result = zinb_no.fit()

rd_no_idx = count_names_no.index('rd_intensity')
robustness_results['Excl. Outliers'] = {
    'coef_rd': zinb_no_result.params_count[rd_no_idx],
    'se_rd': zinb_no_result.bse_count[rd_no_idx],
}
print(f'  Excluding top 1%: N = {len(df_no_outliers)}, '
      f'coef_rd = {zinb_no_result.params_count[rd_no_idx]:.4f}')

# Time periods
for period_name, year_range in [('Early (2012-2015)', range(2012, 2016)),
                                 ('Late (2016-2019)', range(2016, 2020))]:
    df_sub = df[df['year'].isin(year_range)].copy()
    y_sub = df_sub['patents'].values
    year_dum_sub = pd.get_dummies(df_sub['year'], prefix='year', drop_first=True, dtype=float)
    ind_dum_sub = pd.get_dummies(df_sub['industry'], prefix='ind', drop_first=True, dtype=float)

    count_names_sub = ['const'] + X_vars + list(year_dum_sub.columns)
    inflate_names_sub = ['const_z'] + [f'{n}_z' for n in ind_dum_sub.columns] + ['firm_size_z']

    X_count_sub = sm.add_constant(pd.concat([df_sub[X_vars], year_dum_sub], axis=1).values)
    X_inflate_sub = sm.add_constant(pd.concat([ind_dum_sub, df_sub[['firm_size']]], axis=1).values)

    zinb_sub = ZeroInflatedNegativeBinomial(
        endog=y_sub, exog_count=X_count_sub, exog_inflate=X_inflate_sub,
        exog_count_names=count_names_sub, exog_inflate_names=inflate_names_sub,
    )
    zinb_sub_result = zinb_sub.fit()

    rd_sub_idx = count_names_sub.index('rd_intensity')
    robustness_results[period_name] = {
        'coef_rd': zinb_sub_result.params_count[rd_sub_idx],
        'se_rd': zinb_sub_result.bse_count[rd_sub_idx],
    }
    print(f'  {period_name}: N = {len(df_sub)}, '
          f'coef_rd = {zinb_sub_result.params_count[rd_sub_idx]:.4f}')

In [None]:
# Robustness summary
robustness_df = pd.DataFrame(robustness_results).T
robustness_df.columns = ['R&D Coef', 'SE']
robustness_df['z'] = robustness_df['R&D Coef'] / robustness_df['SE']
robustness_df['CI Lower'] = robustness_df['R&D Coef'] - 1.96 * robustness_df['SE']
robustness_df['CI Upper'] = robustness_df['R&D Coef'] + 1.96 * robustness_df['SE']

print('Robustness Summary: R&D Intensity Coefficient Across Specifications')
print('=' * 80)
display(robustness_df)
robustness_df.to_csv(TABLES_DIR / 'table_18_robustness_subsample.csv')

In [None]:
# Figure: Robustness forest plot
fig, ax = plt.subplots(figsize=(10, 6))

y_pos = np.arange(len(robustness_df))
ax.errorbar(robustness_df['R&D Coef'], y_pos,
            xerr=[robustness_df['R&D Coef'] - robustness_df['CI Lower'],
                  robustness_df['CI Upper'] - robustness_df['R&D Coef']],
            fmt='o', markersize=8, capsize=5, capthick=2, color='steelblue')
ax.plot(robustness_df.iloc[0]['R&D Coef'], 0, 'o', markersize=10, color='green', zorder=5)

ax.axvline(robustness_df.iloc[0]['R&D Coef'], color='green', linestyle='--',
           linewidth=1.5, alpha=0.5, label='Baseline estimate')
ax.set_yticks(y_pos)
ax.set_yticklabels(robustness_df.index)
ax.set_xlabel('R&D Intensity Coefficient', fontsize=12)
ax.set_title('Robustness of R&D Effect Across Specifications\n(All estimates qualitatively similar)',
             fontsize=14, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'robustness_forest_plot.png', dpi=300, bbox_inches='tight')
plt.show()

---

## Section 8: Publication-Quality Results (10 min)

Create final tables and figures suitable for academic publication.

In [None]:
# Table Final 01: Descriptive Statistics
print('Publication Table 1: Descriptive Statistics')
print('=' * 80)

desc_vars = ['patents', 'rd_intensity', 'firm_size', 'capital_intensity',
             'export_share', 'hhi', 'firm_age', 'subsidy']

overall = df[desc_vars].describe().T[['mean', 'std', 'min', 'max']]
overall.columns = ['Mean (All)', 'SD (All)', 'Min', 'Max']

innovators = df[df['patents'] > 0][desc_vars].describe().T[['mean', 'std']]
innovators.columns = ['Mean (Innovators)', 'SD (Innovators)']

non_innovators = df[df['patents'] == 0][desc_vars].describe().T[['mean', 'std']]
non_innovators.columns = ['Mean (Non-Innov.)', 'SD (Non-Innov.)']

table_final_01 = pd.concat([overall, innovators, non_innovators], axis=1)
table_final_01['N (All)'] = len(df)
table_final_01['N (Innov.)'] = (df['patents'] > 0).sum()
table_final_01['N (Non-Innov.)'] = (df['patents'] == 0).sum()

display(table_final_01)
table_final_01.to_csv(TABLES_DIR / 'table_final_01_descriptives.csv')

In [None]:
# Table Final 02: Model Comparison
print('Publication Table 2: Model Comparison')
print('=' * 80)

key_vars_compare = ['rd_intensity', 'firm_size', 'capital_intensity']
comparison_rows = []
for var in key_vars_compare:
    row = {'Variable': var}
    idx_p = pooled_names.index(var)
    row['Pooled Poisson'] = f'{pois_pool_result.params[idx_p]:.4f} ({pois_pool_result.se[idx_p]:.4f})'
    row['Pooled NB'] = f'{nb_pool_result.params_exog[idx_p]:.4f} ({nb_pool_result.se[idx_p]:.4f})'
    if fe_estimated:
        fe_idx = fe_names.index(var)
        row['FE Poisson'] = f'{fe_pois_result.params[fe_idx]:.4f} ({fe_pois_result.se[fe_idx]:.4f})'
    zip_idx = count_names.index(var)
    row['ZIP'] = f'{zip_result.params_count[zip_idx]:.4f} ({zip_result.bse_count[zip_idx]:.4f})'
    row['ZINB'] = f'{zinb_result.params_count[zip_idx]:.4f} ({zinb_result.bse_count[zip_idx]:.4f})'
    comparison_rows.append(row)

sep_cols = ['Variable', 'Pooled Poisson', 'Pooled NB', 'ZIP', 'ZINB']
if fe_estimated:
    sep_cols.insert(3, 'FE Poisson')
comparison_rows.append({k: '---' for k in sep_cols})

stat_models = ['Pooled Poisson', 'Pooled NB', 'ZIP', 'ZINB']
if fe_estimated:
    stat_models.insert(2, 'FE Poisson')

comparison_rows.append(dict(Variable='Log-Lik', **{k: f'{model_stats[k]["llf"]:.1f}' for k in stat_models}))
comparison_rows.append(dict(Variable='AIC', **{k: f'{model_stats[k]["aic"]:.1f}' if not np.isnan(model_stats[k]["aic"]) else 'N/A' for k in stat_models}))
comparison_rows.append(dict(Variable='N', **{k: str(len(y)) for k in stat_models}))

table_final_02 = pd.DataFrame(comparison_rows)
display(table_final_02)
table_final_02.to_csv(TABLES_DIR / 'table_final_02_model_comparison.csv', index=False)

In [None]:
# Table Final 03: ZINB Full Results
print('Publication Table 3: ZINB Full Results')
print('=' * 80)

count_pub = pd.DataFrame({
    'Component': 'Count',
    'Variable': count_names,
    'Coef': zinb_result.params_count,
    'SE': zinb_result.bse_count,
    'IRR': np.exp(zinb_result.params_count),
    'p-value': 2 * (1 - stats.norm.cdf(np.abs(zinb_result.params_count / zinb_result.bse_count))),
})

inflate_pub = pd.DataFrame({
    'Component': 'Inflate',
    'Variable': inflate_names,
    'Coef': zinb_result.params_inflate,
    'SE': zinb_result.bse_inflate,
    'OR': np.exp(zinb_result.params_inflate),
    'p-value': 2 * (1 - stats.norm.cdf(np.abs(zinb_result.params_inflate / zinb_result.bse_inflate))),
})

disp_pub = pd.DataFrame({
    'Component': ['Dispersion'],
    'Variable': ['alpha'],
    'Coef': [zinb_result.alpha],
    'SE': [zinb_result.bse_alpha],
    'IRR': [np.nan],
    'p-value': [np.nan],
})
# Rename OR column to IRR for concat
inflate_pub = inflate_pub.rename(columns={'OR': 'IRR'})

table_final_03 = pd.concat([count_pub, inflate_pub, disp_pub], ignore_index=True)
display(table_final_03)
table_final_03.to_csv(TABLES_DIR / 'table_final_03_zinb_full.csv', index=False)

In [None]:
# Table Final 04: Marginal Effects
print('Publication Table 4: Average Marginal Effects (ZINB)')
print('=' * 80)

table_final_04 = ame_zinb_df[['AME', 'SE', 'CI Lower', 'CI Upper', 'p-value']].copy()

interpretations = []
for var_name in table_final_04.index:
    ame_val = table_final_04.loc[var_name, 'AME']
    if 'year' in var_name:
        interpretations.append(f'{ame_val:+.2f} patents vs base year')
    elif var_name == 'rd_intensity':
        interpretations.append(f'+1 p.p. R&D => {ame_val:+.2f} patents')
    elif var_name == 'firm_size':
        interpretations.append(f'+1 log(emp) => {ame_val:+.2f} patents')
    elif var_name == 'capital_intensity':
        interpretations.append(f'+1 unit K/L => {ame_val:+.4f} patents')
    else:
        interpretations.append(f'{ame_val:+.4f} patents')

table_final_04['Interpretation'] = interpretations
display(table_final_04)
table_final_04.to_csv(TABLES_DIR / 'table_final_04_marginal_effects.csv')

In [None]:
# Figure Final 01: Observed vs ZINB Fitted
fig, ax = plt.subplots(figsize=(10, 6))

max_count_final = min(int(df['patents'].quantile(0.98)), 25)
count_range_final = np.arange(0, max_count_final + 1)
observed_final = np.array([(y == k).sum() for k in count_range_final])

width = 0.35
ax.bar(count_range_final - width/2, observed_final, width,
       label='Observed', color='steelblue', edgecolor='black', alpha=0.7)

zinb_final_freq = np.array([
    np.mean(pi_zinb * (k == 0) + (1 - pi_zinb) * stats.nbinom.pmf(k, r_zinb, p_zinb)) * len(y)
    for k in count_range_final
])
ax.bar(count_range_final + width/2, zinb_final_freq, width,
       label='ZINB Predicted', color='coral', edgecolor='black', alpha=0.7)

ax.set_xlabel('Number of Patents', fontsize=13)
ax.set_ylabel('Frequency', fontsize=13)
ax.set_title('Figure 1: Observed vs ZINB-Predicted Patent Distribution',
             fontsize=14, fontweight='bold')
ax.legend(fontsize=12)
ax.grid(alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'figure_final_01_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Figure Final 02: ME of R&D by Firm Size
fig, ax = plt.subplots(figsize=(10, 6))

x_pos = np.arange(len(me_size_df))
bars = ax.bar(x_pos, me_size_df['AME(R&D)'],
              yerr=1.96 * me_size_df['SE'], capsize=6,
              alpha=0.7, color=['#4393C3', '#92C5DE', '#F4A582', '#D6604D'],
              edgecolor='black', linewidth=1.5)
ax.axhline(ame_zinb_data['rd_intensity']['AME'], color='black', linestyle='--',
           linewidth=2, label=f'Overall AME = {ame_zinb_data["rd_intensity"]["AME"]:.3f}')
ax.set_xticks(x_pos)
ax.set_xticklabels(me_size_df['Size Group'], fontsize=12)
ax.set_ylabel('AME of R&D Intensity\n(Additional Patents per 1 p.p. Increase)', fontsize=12)
ax.set_title('Figure 2: Marginal Effect of R&D by Firm Size',
             fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'figure_final_02_me_rd.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Figure Final 03: Policy Simulation CDF
fig, ax = plt.subplots(figsize=(10, 6))

ey_sorted_base = np.sort(E_y_baseline)
ey_sorted_cf = np.sort(E_y_cf)
cdf = np.arange(1, len(ey_sorted_base) + 1) / len(ey_sorted_base)

ax.plot(ey_sorted_base, cdf, 'b-', linewidth=2.5,
        label=f'Baseline (mean = {E_y_baseline.mean():.2f})')
ax.plot(ey_sorted_cf, cdf, 'r-', linewidth=2.5,
        label=f'With Tax Credit (mean = {E_y_cf.mean():.2f})')
ax.fill_betweenx(cdf, ey_sorted_base, ey_sorted_cf, alpha=0.15, color='red')

ax.set_xlabel('Expected Patents per Firm-Year', fontsize=13)
ax.set_ylabel('Cumulative Proportion', fontsize=13)
ax.set_title('Figure 3: Policy Simulation\nR&D Tax Credit (+2 p.p. R&D Intensity)',
             fontsize=14, fontweight='bold')
ax.legend(fontsize=12)
ax.grid(alpha=0.3)
ax.annotate(f'Mean impact: +{impact.mean():.2f} patents',
            xy=(E_y_cf.mean(), 0.5), fontsize=12,
            bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'figure_final_03_policy_simulation.png', dpi=300, bbox_inches='tight')
plt.show()

---

## Section 9: Economic Interpretation and Conclusions (10 min)

### Key Findings

In [None]:
# Summary of key findings
print('KEY FINDINGS')
print('=' * 70)

rd_irr = np.exp(zinb_result.params_count[rd_idx_zinb])
rd_ame_val = ame_zinb_data['rd_intensity']['AME']
print(f'\n1. R&D-Patent Elasticity:')
print(f'   IRR = {rd_irr:.4f} => 1 p.p. increase in R&D intensity')
print(f'   increases patents by {(rd_irr-1)*100:.1f}% (among innovators)')
print(f'   AME = {rd_ame_val:.3f} additional patents per firm-year')
print(f'   Consistent with Griliches (1990) elasticity estimates.')

size_irr = np.exp(zinb_result.params_count[count_names.index('firm_size')])
print(f'\n2. Firm Size Effect:')
print(f'   IRR = {size_irr:.4f} => Larger firms patent more')
print(f'   Larger firms also less likely to be non-innovators')
print(f'   => Innovation is concentrated in larger firms.')

print(f'\n3. Structural Non-Innovators:')
print(f'   {pi_zinb_pred.mean():.0%} of firm-years are structural non-innovators')
print(f'   Smaller firms overrepresented among non-innovators')

print(f'\n4. Policy Implications:')
print(f'   R&D Tax Credit (+2 p.p.): +{impact.mean():.2f} patents per firm-year')
print(f'   Total impact: +{impact.sum():.0f} patents across all firms')
print(f'   Effect heterogeneous: larger absolute gains for large firms')

In [None]:
# Limitations
print('LIMITATIONS')
print('=' * 70)
print()
print('1. Patents != Innovation')
print('   Patents are just one measure of innovation output.')
print('   Many innovations are not patented (trade secrets, process improvements).')
print()
print('2. R&D Endogeneity')
print('   Firms that expect to patent more may invest more in R&D.')
print('   Would need instrumental variables or dynamic GMM for causal claims.')
print()
print('3. Sample Limitations')
print('   Manufacturing firms only. Results may not generalize to services.')
print('   Simulated data -- real-world relationships may differ.')
print()
print('4. Correlation vs Causation')
print('   Our ZINB estimates are associations, not causal effects.')
print('   Policy simulations assume constant structural parameters.')

---

## Section 10: Summary and Extensions

### What We Did

1. Conducted comprehensive EDA (overdispersion, excess zeros, panel structure)
2. Estimated 5 count models (Poisson, NB, FE Poisson, ZIP, ZINB)
3. Performed rigorous model selection (tests, AIC/BIC, rootograms)
4. Interpreted preferred model (ZINB: count + inflate components)
5. Computed marginal effects with heterogeneity analysis
6. Conducted policy simulations (R&D tax credit counterfactual)
7. Performed robustness checks (alternative specs, subsamples, outliers)
8. Created publication-quality tables and figures

### Complete Workflow Summary

```python
# 1. EDA: Check overdispersion, zeros, panel structure
# 2. Estimate models: Poisson, NB, FE, ZIP, ZINB
# 3. Model selection: Tests, AIC/BIC -> Choose ZINB
# 4. Interpret: IRRs, count + inflate components
# 5. Marginal effects: AME with heterogeneity
# 6. Policy simulation: Counterfactual analysis
# 7. Robustness: Alternative specs, subsamples
# 8. Report: Publication-quality tables and figures
```

### Extensions (Future Work)

1. **Dynamic models**: Include lagged patents (knowledge accumulation)
2. **GMM**: Address R&D endogeneity with instrumental variables
3. **Spatial models**: Innovation spillovers from nearby firms
4. **Recent data**: Apply to post-2019 data
5. **Other outcomes**: Patent citations, product launches, revenue

### Lessons for Research

1. Always compare multiple models systematically
2. Use formal tests, not just eyeballing
3. Compute marginal effects for interpretation (not just coefficients)
4. Check robustness extensively
5. Connect results to economic theory and policy

### References

- Griliches, Z. (1990). Patent statistics as economic indicators: A survey. *Journal of Economic Literature*, 28(4), 1661-1707.
- Hall, B. H., Griliches, Z., & Hausman, J. A. (1986). Patents and R and D: Is there a lag? *International Economic Review*, 27(2), 265-283.
- Aghion, P., Bloom, N., Blundell, R., Griffith, R., & Howitt, P. (2005). Competition and innovation: An inverted-U relationship. *Quarterly Journal of Economics*, 120(2), 701-728.
- Cameron, A. C., & Trivedi, P. K. (2013). *Regression Analysis of Count Data* (2nd ed.). Cambridge University Press.
- Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. *Technometrics*, 34(1), 1-14.

In [None]:
# Final summary: list all output files
print('OUTPUT FILES GENERATED')
print('=' * 60)

print('\nFigures:')
for f in sorted(FIGURES_DIR.glob('*.png')):
    print(f'  {f.name}')

print('\nTables:')
for f in sorted(TABLES_DIR.glob('*.csv')):
    print(f'  {f.name}')

print(f'\nTotal figures: {len(list(FIGURES_DIR.glob("*.png")))}')
print(f'Total tables: {len(list(TABLES_DIR.glob("*.csv")))}')
print('\nCase study complete!')