# Tutorial 03: Estimation and Results Interpretation

**Series**: PanelBox - Fundamentals  
**Level**: Intermediate  
**Estimated Time**: 60-75 minutes  
**Prerequisites**: Tutorials 01 (Panel Data Structures) and 02 (Formulas)

## Learning Objectives

By the end of this tutorial, you will be able to:
- Estimate panel data models using PanelBox
- Interpret regression coefficients in economic terms
- Understand standard errors, t-statistics, and p-values
- Compare classical, robust, and clustered standard errors
- Compute and interpret confidence intervals
- Perform hypothesis tests
- Export results to multiple formats (LaTeX, Markdown, JSON)
- Validate and diagnose model fit

## Table of Contents
1. [Introduction to Estimation](#1-introduction-to-estimation)
2. [Your First Model: Pooled OLS](#2-your-first-model-pooled-ols)
3. [Understanding Results Tables](#3-understanding-results-tables)
4. [Standard Errors and Inference](#4-standard-errors-and-inference)
5. [Hypothesis Testing](#5-hypothesis-testing)
6. [Model Diagnostics](#6-model-diagnostics)
7. [Exporting Results](#7-exporting-results)
8. [Practical Exercises](#8-practical-exercises)
9. [Summary and Next Steps](#9-summary-and-next-steps)

---

In [None]:
# Notebook metadata
__version__ = "1.0.0"
__last_updated__ = "2026-02-16"
__compatible_with__ = "PanelBox >= 0.1.0"

# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from IPython.display import display, Markdown, HTML
import warnings
warnings.filterwarnings('ignore')

# PanelBox library (development mode)
import sys
sys.path.insert(0, '/home/guhaase/projetos/panelbox')
import panelbox as pb
from panelbox.models import PooledOLS

# Try to import other estimators (may not all be implemented yet)
try:
    from panelbox.models import FixedEffects, RandomEffects
    FE_AVAILABLE = True
except ImportError:
    FE_AVAILABLE = False
    print("Note: FixedEffects/RandomEffects not available yet")

# Plotting configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 10
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)
pd.set_option('display.width', 100)

# Display library version
print(f"PanelBox version: {pb.__version__}")
print(f"Notebook version: {__version__}")
print("Setup complete!")

In [None]:
# Load Grunfeld dataset
try:
    from panelbox.datasets import load_grunfeld
    data = load_grunfeld()
    print("‚úì Loaded from panelbox.datasets.load_grunfeld()")
except (ImportError, AttributeError):
    import os
    data_path = '/home/guhaase/projetos/panelbox/examples/datasets/grunfeld.csv'
    if os.path.exists(data_path):
        data = pd.read_csv(data_path)
        print(f"‚úì Loaded from {data_path}")
    else:
        data_path = '/home/guhaase/projetos/panelbox/panelbox/datasets/grunfeld.csv'
        if os.path.exists(data_path):
            data = pd.read_csv(data_path)
            print(f"‚úì Loaded from {data_path}")
        else:
            raise FileNotFoundError("Grunfeld dataset not found. Please check path.")

# Quick recap
print(f"\nGrunfeld Investment Data")
print(f"Observations: {data.shape[0]}")
print(f"Variables: {list(data.columns)}")
display(data.head())

---
## 1. Introduction to Estimation

### The Econometric Workflow

```
1. THEORY          ‚Üí What determines investment?
2. MODEL           ‚Üí invest = Œ≤‚ÇÄ + Œ≤‚ÇÅ¬∑value + Œ≤‚ÇÇ¬∑capital + Œµ
3. SPECIFICATION   ‚Üí Formula: "invest ~ value + capital"
4. ESTIMATION      ‚Üí Find Œ≤ÃÇ‚ÇÄ, Œ≤ÃÇ‚ÇÅ, Œ≤ÃÇ‚ÇÇ that best fit the data
5. INFERENCE       ‚Üí Are Œ≤ÃÇ statistically significant?
6. INTERPRETATION  ‚Üí What do the numbers mean economically?
```

So far, you've learned steps 1-3. This tutorial focuses on **4-6**.

---

### Pooled OLS: The Simplest Estimator

**Pooled Ordinary Least Squares** (Pooled OLS) treats panel data as a large cross-section:
- Ignores panel structure (entities and time)
- Estimates by minimizing sum of squared residuals:
$$
\min_{\beta} \sum_{i=1}^N \sum_{t=1}^T (Y_{it} - X_{it}'\beta)^2
$$

**Advantages**:
- ‚úÖ Simple, fast, interpretable
- ‚úÖ Efficient if no unobserved heterogeneity

**Disadvantages**:
- ‚ùå Biased if entity-specific effects exist
- ‚ùå Standard errors underestimate uncertainty (observations not independent)

**When to use**:
- Exploratory analysis
- Benchmark before fixed effects
- When entities are truly homogeneous (rare!)

---

### What We'll Estimate

**Model**: Grunfeld investment equation
$$
\text{Investment}_{it} = \beta_0 + \beta_1 \text{Value}_{it} + \beta_2 \text{Capital}_{it} + \varepsilon_{it}
$$

**Research questions**:
1. How does market value affect investment? (Œ≤‚ÇÅ)
2. How does existing capital stock affect investment? (Œ≤‚ÇÇ)
3. Are these effects statistically significant?
4. How much variation do we explain? (R¬≤)

---

## 2. Your First Model: Pooled OLS

### Step 1: Specify the Formula

We'll estimate:
```python
formula = "invest ~ value + capital"
```

This expands to:
$$
\text{invest}_{it} = \beta_0 + \beta_1 \cdot \text{value}_{it} + \beta_2 \cdot \text{capital}_{it} + \varepsilon_{it}
$$

---

In [None]:
# Fit Pooled OLS model
print("="*70)
print("ESTIMATING POOLED OLS MODEL")
print("="*70)

# Specify formula
formula = "invest ~ value + capital"
print(f"\nFormula: {formula}")
print(f"Model: invest = Œ≤‚ÇÄ + Œ≤‚ÇÅ¬∑value + Œ≤‚ÇÇ¬∑capital + Œµ")

# Create model instance with entity and time columns
model = PooledOLS(formula, data=data, entity_col='firm', time_col='year')

# Fit the model
results = model.fit()

print("\n‚úì Model estimated successfully!")
print(f"  Estimator: {model.__class__.__name__}")
print(f"  Observations: {results.nobs}")
print(f"  Parameters: {len(results.params)}")

In [None]:
# Display results summary
print("\n" + "="*70)
print("ESTIMATION RESULTS")
print("="*70)

# Print summary table
print(results.summary)

---
## 3. Understanding Results Tables

### Key Components of Results

A typical econometrics results table contains:

#### 1. Model Information
- **Estimator**: Pooled OLS, Fixed Effects, etc.
- **Formula**: Model specification
- **Observations**: Number of data points (N√óT)
- **Entities/Time**: Panel dimensions

#### 2. Coefficient Estimates
- **Parameter** (Œ≤ÃÇ): Estimated coefficient
- **Std. Error** (SE): Uncertainty in estimate
- **t-statistic**: Œ≤ÃÇ / SE (test H‚ÇÄ: Œ≤ = 0)
- **p-value**: Probability of seeing this t-stat if H‚ÇÄ true
- **Confidence Interval**: Range likely containing true Œ≤

#### 3. Model Fit Statistics
- **R¬≤**: Fraction of variance explained (0 to 1)
- **Adjusted R¬≤**: R¬≤ penalized for # of parameters
- **F-statistic**: Test of overall model significance
- **Log-likelihood**: Goodness of fit (higher = better)

---

### Interpreting Coefficients

For our model: `invest = Œ≤‚ÇÄ + Œ≤‚ÇÅ¬∑value + Œ≤‚ÇÇ¬∑capital + Œµ`

**Œ≤‚ÇÅ (value coefficient)**:
- **Meaning**: Change in investment per unit change in firm value, holding capital constant
- **Units**: If value increases by 1 (million $), investment increases by Œ≤‚ÇÅ (million $)
- **Ceteris paribus**: "All else equal"

**Œ≤‚ÇÇ (capital coefficient)**:
- **Meaning**: Change in investment per unit change in capital stock, holding value constant

**Œ≤‚ÇÄ (intercept)**:
- **Meaning**: Expected investment when value = capital = 0
- **Interpretation**: Often not economically meaningful (extrapolation)

---

In [None]:
# Extract and display coefficients
print("="*70)
print("COEFFICIENT ESTIMATES")
print("="*70)

# Access coefficients
coefs = results.params
print("\nEstimated coefficients (Œ≤ÃÇ):")
for name, value in coefs.items():
    print(f"  {name:15s}: {value:10.4f}")

# Interpret each coefficient
print("\n" + "-"*70)
print("ECONOMIC INTERPRETATION")
print("-"*70)

beta_value = coefs['value']
beta_capital = coefs['capital']

print(f"\n1. Value coefficient (Œ≤ÃÇ‚ÇÅ = {beta_value:.4f}):")
print(f"   Interpretation: A $1 million increase in firm value is associated")
print(f"   with a ${beta_value:.4f} million increase in investment,")
print(f"   holding capital constant.")

print(f"\n2. Capital coefficient (Œ≤ÃÇ‚ÇÇ = {beta_capital:.4f}):")
print(f"   Interpretation: A $1 million increase in capital stock is associated")
print(f"   with a ${beta_capital:.4f} million change in investment,")
print(f"   holding firm value constant.")

# Sign interpretation
if beta_value > 0:
    print(f"\n   Œ≤ÃÇ‚ÇÅ > 0: Higher value ‚Üí Higher investment (positive relationship)")
else:
    print(f"\n   Œ≤ÃÇ‚ÇÅ < 0: Higher value ‚Üí Lower investment (negative relationship)")

if beta_capital > 0:
    print(f"   Œ≤ÃÇ‚ÇÇ > 0: More capital ‚Üí More investment (positive relationship)")
else:
    print(f"   Œ≤ÃÇ‚ÇÇ < 0: More capital ‚Üí Less investment (capital substitution?)")

In [None]:
# Model fit statistics
print("\n" + "="*70)
print("MODEL FIT STATISTICS")
print("="*70)

r2 = results.rsquared
r2_adj = results.rsquared_adj
nobs = results.nobs
k = len(results.params)

print(f"\nR¬≤: {r2:.4f}")
print(f"  Interpretation: The model explains {100*r2:.2f}% of the variance in investment")

print(f"\nAdjusted R¬≤: {r2_adj:.4f}")
print(f"  Interpretation: R¬≤ adjusted for # of parameters ({k})")
print(f"  Formula: 1 - (1-R¬≤)¬∑(n-1)/(n-k)")

print(f"\nObservations: {nobs}")
print(f"Parameters: {k}")
print(f"Degrees of freedom: {nobs - k}")

# Visualize fit
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Actual vs Fitted
fitted = results.fittedvalues
actual = results.resid + fitted  # Reconstruct actual

axes[0].scatter(fitted, actual, alpha=0.6, edgecolors='k', linewidth=0.5)
axes[0].plot([actual.min(), actual.max()],
             [actual.min(), actual.max()],
             'r--', lw=2, label='Perfect fit')
axes[0].set_xlabel('Fitted Values', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Actual Values', fontsize=12, fontweight='bold')
axes[0].set_title(f'Actual vs Fitted (R¬≤ = {r2:.3f})', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Residuals
residuals = results.resid
axes[1].scatter(fitted, residuals, alpha=0.6, edgecolors='k', linewidth=0.5)
axes[1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[1].set_xlabel('Fitted Values', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Residuals', fontsize=12, fontweight='bold')
axes[1].set_title('Residual Plot', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nNote: Residuals should be randomly scattered around zero (no pattern)")

---
## 4. Standard Errors and Inference

### What are Standard Errors?

**Standard Error (SE)**: Measure of uncertainty in coefficient estimate
- Small SE ‚Üí Precise estimate (low sampling variability)
- Large SE ‚Üí Imprecise estimate (high sampling variability)

**Formula** (simplified):
$$
SE(\hat{\beta}) = \sqrt{\text{Var}(\hat{\beta})} = \sqrt{\sigma^2 (X'X)^{-1}}
$$

Where $\sigma^2$ = variance of errors.

---

### Types of Standard Errors

#### 1. Classical (IID) Standard Errors
**Assumption**: Errors are independent and identically distributed
$$
\text{Var}(\varepsilon) = \sigma^2 I
$$

**Problem**: Violated in panel data!
- ‚ùå Entity-specific shocks (within-firm correlation)
- ‚ùå Time-specific shocks (common year effects)

---

#### 2. Robust (Heteroskedasticity-Consistent) Standard Errors
**Allows**: Errors have different variances across observations
$$
\text{Var}(\varepsilon_i) = \sigma_i^2 \text{ (can vary)}
$$

**Estimator**: Huber-White sandwich estimator
- ‚úÖ Valid under heteroskedasticity
- ‚ùå Still assumes independence (not ideal for panels)

---

#### 3. Clustered Standard Errors
**Allows**: Errors correlated within clusters (e.g., within firms)
$$
\text{Cov}(\varepsilon_{it}, \varepsilon_{is}) \neq 0 \text{ for same firm } i
$$

**Best for panel data**:
- ‚úÖ Accounts for within-entity correlation
- ‚úÖ Standard in panel econometrics

**Cluster by**: Entity (firm), time, or both

---

### Statistical Significance

**t-statistic**: How many standard errors is Œ≤ÃÇ away from zero?
$$
t = \frac{\hat{\beta}}{SE(\hat{\beta})}
$$

**p-value**: Probability of observing |t| this large if true Œ≤ = 0
- p < 0.01 ‚Üí *** (highly significant)
- p < 0.05 ‚Üí ** (significant)
- p < 0.10 ‚Üí * (weakly significant)
- p ‚â• 0.10 ‚Üí Not significant

**Rule of thumb**: |t| > 2 ‚Üí Usually significant (p < 0.05)

---

In [None]:
# Compare different standard error types
print("="*70)
print("COMPARING STANDARD ERROR TYPES")
print("="*70)

# Re-estimate with different SE types
results_iid = model.fit(cov_type='nonrobust')  # Classical
results_robust = model.fit(cov_type='robust')   # Heteroskedasticity-robust
results_cluster = model.fit(cov_type='clustered')  # Clustered by entity

# Extract SEs
se_iid = results_iid.std_errors
se_robust = results_robust.std_errors
se_cluster = results_cluster.std_errors

# Create comparison table
comparison = pd.DataFrame({
    'Classical': se_iid,
    'Robust': se_robust,
    'Clustered': se_cluster
})

print("\nStandard Errors Comparison:")
print("(Rows = Variables, Columns = SE Type)")
display(comparison)

# Calculate ratios
print("\n" + "-"*70)
print("RELATIVE MAGNITUDE")
print("-"*70)

for var in comparison.index:
    ratio_robust = comparison.loc[var, 'Robust'] / comparison.loc[var, 'Classical']
    ratio_cluster = comparison.loc[var, 'Clustered'] / comparison.loc[var, 'Classical']
    
    print(f"\n{var}:")
    print(f"  Robust / Classical: {ratio_robust:.3f} ({100*(ratio_robust-1):+.1f}%)")
    print(f"  Clustered / Classical: {ratio_cluster:.3f} ({100*(ratio_cluster-1):+.1f}%)")

print("\n" + "-"*70)
print("INTERPRETATION")
print("-"*70)
print("‚Ä¢ Clustered SEs > Classical SEs ‚Üí Within-firm correlation present")
print("‚Ä¢ Use CLUSTERED SEs for panel data (accounts for correlation)")

In [None]:
# Confidence intervals
print("\n" + "="*70)
print("CONFIDENCE INTERVALS")
print("="*70)

# 95% confidence intervals
ci_95 = results_cluster.conf_int(level=0.95)
print("\n95% Confidence Intervals (Clustered SEs):")
display(ci_95)

# Visualize CIs
fig, ax = plt.subplots(figsize=(10, 6))

vars_to_plot = [v for v in coefs.index if v != 'Intercept']
y_pos = np.arange(len(vars_to_plot))

for i, var in enumerate(vars_to_plot):
    coef = coefs[var]
    ci_low = ci_95.loc[var, 'lower']
    ci_high = ci_95.loc[var, 'upper']
    
    # Plot point estimate
    ax.plot(coef, i, 'o', markersize=10, color='darkblue', zorder=3)
    
    # Plot CI
    ax.plot([ci_low, ci_high], [i, i], 'o-', linewidth=2, markersize=5,
            color='steelblue', zorder=2, alpha=0.7)

# Reference line at zero
ax.axvline(x=0, color='red', linestyle='--', linewidth=2, label='H‚ÇÄ: Œ≤ = 0', zorder=1)

ax.set_yticks(y_pos)
ax.set_yticklabels(vars_to_plot)
ax.set_xlabel('Coefficient Value', fontsize=12, fontweight='bold')
ax.set_title('Coefficient Estimates with 95% Confidence Intervals',
             fontsize=14, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("‚Ä¢ If CI includes 0 ‚Üí Cannot reject H‚ÇÄ: Œ≤ = 0 (not significant)")
print("‚Ä¢ If CI excludes 0 ‚Üí Reject H‚ÇÄ: Œ≤ = 0 (significant)")

for var in vars_to_plot:
    ci_low = ci_95.loc[var, 'lower']
    ci_high = ci_95.loc[var, 'upper']
    includes_zero = (ci_low < 0 < ci_high)
    
    if includes_zero:
        print(f"  {var}: CI includes 0 ‚Üí Not significant at 5% level")
    else:
        print(f"  {var}: CI excludes 0 ‚Üí Significant at 5% level")

---
## 5. Hypothesis Testing

### Types of Hypothesis Tests

#### 1. Individual Coefficient Test (t-test)
**Null hypothesis**: $H_0: \beta_j = 0$ (no effect)  
**Alternative**: $H_1: \beta_j \neq 0$ (has effect)

**Test statistic**: $t = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}$

**Decision rule**:
- If |t| > critical value ‚Üí Reject H‚ÇÄ
- If p-value < Œ± ‚Üí Reject H‚ÇÄ (Œ± = significance level, usually 0.05)

---

#### 2. Joint Hypothesis Test (F-test)
**Null hypothesis**: $H_0: \beta_1 = \beta_2 = 0$ (all coefficients zero)  
**Alternative**: At least one Œ≤ ‚â† 0

**Test statistic**:
$$
F = \frac{(R^2_{\text{full}} - R^2_{\text{restricted}}) / q}{(1 - R^2_{\text{full}}) / (n-k)}
$$

Where q = # of restrictions, k = # of parameters.

---

#### 3. Custom Linear Restrictions
Test hypotheses like:
- $H_0: \beta_1 = \beta_2$ (coefficients equal)
- $H_0: \beta_1 + \beta_2 = 1$ (returns to scale)

---

In [None]:
# Individual t-tests
print("="*70)
print("INDIVIDUAL COEFFICIENT TESTS (t-tests)")
print("="*70)

# Extract t-stats and p-values
tvalues = results_cluster.tvalues
pvalues = results_cluster.pvalues

# Create test results table
test_results = pd.DataFrame({
    'Coefficient': coefs,
    'Std. Error': se_cluster,
    't-statistic': tvalues,
    'p-value': pvalues
})

# Add significance stars
def add_stars(p):
    if p < 0.01:
        return '***'
    elif p < 0.05:
        return '**'
    elif p < 0.10:
        return '*'
    else:
        return ''

test_results['Sig.'] = pvalues.apply(add_stars)

print("\nHypothesis: H‚ÇÄ: Œ≤ = 0 vs H‚ÇÅ: Œ≤ ‚â† 0")
print("Significance: *** p<0.01, ** p<0.05, * p<0.10\n")
display(test_results)

# Interpret results
print("\n" + "-"*70)
print("INTERPRETATION")
print("-"*70)

for var in coefs.index:
    coef = coefs[var]
    pval = pvalues[var]
    tval = tvalues[var]
    
    if pval < 0.05:
        print(f"\n{var}:")
        print(f"  Œ≤ÃÇ = {coef:.4f}, t = {tval:.3f}, p = {pval:.4f}")
        print(f"  ‚Üí Reject H‚ÇÄ: Œ≤ = 0 (significant at 5% level)")
        print(f"  ‚Üí {var} has a statistically significant effect on investment")
    else:
        print(f"\n{var}:")
        print(f"  Œ≤ÃÇ = {coef:.4f}, t = {tval:.3f}, p = {pval:.4f}")
        print(f"  ‚Üí Cannot reject H‚ÇÄ: Œ≤ = 0 (not significant)")

In [None]:
# Joint F-test
print("\n" + "="*70)
print("JOINT HYPOTHESIS TEST (F-test)")
print("="*70)

# Overall F-test: H‚ÇÄ: all slope coefficients = 0
# Manual calculation: F = (R¬≤/k) / ((1-R¬≤)/(n-k-1))
r2 = results_cluster.rsquared
n = results_cluster.nobs
k = len(results_cluster.params) - 1  # Exclude intercept

fstat = (r2 / k) / ((1 - r2) / (n - k - 1))
from scipy.stats import f as f_dist
f_pvalue = 1 - f_dist.cdf(fstat, k, n - k - 1)

print(f"\nNull hypothesis: H‚ÇÄ: Œ≤‚ÇÅ = Œ≤‚ÇÇ = 0")
print(f"(All slope coefficients jointly equal zero)")

print(f"\nF-statistic: {fstat:.4f}")
print(f"p-value: {f_pvalue:.6f}")

if f_pvalue < 0.01:
    print(f"\n‚Üí Reject H‚ÇÄ at 1% level (p < 0.01)")
    print(f"‚Üí The model is statistically significant overall")
    print(f"‚Üí At least one predictor has a non-zero effect")
else:
    print(f"\n‚Üí Cannot reject H‚ÇÄ")
    print(f"‚Üí The model is not statistically significant")

In [None]:
# Test custom hypothesis: Œ≤‚ÇÅ = Œ≤‚ÇÇ
print("\n" + "="*70)
print("CUSTOM HYPOTHESIS TEST")
print("="*70)

print("\nHypothesis: H‚ÇÄ: Œ≤_value = Œ≤_capital")
print("(The effects of value and capital are equal)")

# Wald test for linear restriction
# R¬∑Œ≤ = r, where R = [0, 1, -1], r = 0
# This tests: Œ≤_value - Œ≤_capital = 0

try:
    # Manual Wald test
    beta_vec = coefs.values
    beta_diff = coefs['value'] - coefs['capital']
    
    # Variance of difference: Var(Œ≤‚ÇÅ) + Var(Œ≤‚ÇÇ) - 2¬∑Cov(Œ≤‚ÇÅ, Œ≤‚ÇÇ)
    vcov = results_cluster.cov_params
    var_diff = vcov.loc['value', 'value'] + vcov.loc['capital', 'capital'] - \
               2 * vcov.loc['value', 'capital']
    se_diff = np.sqrt(var_diff)
    
    t_stat_diff = beta_diff / se_diff
    p_value_diff = 2 * (1 - stats.t.cdf(abs(t_stat_diff), results_cluster.df_resid))
    
    print(f"\nŒ≤ÃÇ_value - Œ≤ÃÇ_capital = {beta_diff:.6f}")
    print(f"SE(difference) = {se_diff:.6f}")
    print(f"t-statistic = {t_stat_diff:.4f}")
    print(f"p-value = {p_value_diff:.4f}")
    
    if p_value_diff < 0.05:
        print(f"\n‚Üí Reject H‚ÇÄ (p < 0.05)")
        print(f"‚Üí The effects of value and capital are significantly different")
    else:
        print(f"\n‚Üí Cannot reject H‚ÇÄ (p ‚â• 0.05)")
        print(f"‚Üí The effects of value and capital are not significantly different")

except Exception as e:
    print(f"\nNote: Custom Wald test calculation issue: {e}")
    print(f"This is for pedagogical illustration")

---
## 6. Model Diagnostics

### What Can Go Wrong?

Even with significant coefficients, the model may have issues:

1. **Heteroskedasticity**: Error variance not constant
2. **Serial correlation**: Errors correlated over time
3. **Non-normality**: Errors not normally distributed
4. **Outliers**: Influential observations distorting results
5. **Multicollinearity**: Predictors highly correlated

### Diagnostic Tools

- **Residual plots**: Check for patterns
- **Q-Q plots**: Check normality
- **VIF**: Check multicollinearity
- **Formal tests**: Breusch-Pagan, Durbin-Watson, etc.

---

In [None]:
# Residual diagnostics
print("="*70)
print("RESIDUAL DIAGNOSTICS")
print("="*70)

residuals = results_cluster.resid
fitted = results_cluster.fittedvalues

# Summary statistics
print("\nResidual summary:")
print(f"  Mean: {residuals.mean():.6f} (should be ‚âà 0)")
print(f"  Std: {residuals.std():.4f}")
print(f"  Min: {residuals.min():.4f}")
print(f"  Max: {residuals.max():.4f}")

# Create diagnostic plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Residuals vs Fitted
axes[0, 0].scatter(fitted, residuals, alpha=0.6, edgecolors='k', linewidth=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--', lw=2)
axes[0, 0].set_xlabel('Fitted Values', fontsize=11, fontweight='bold')
axes[0, 0].set_ylabel('Residuals', fontsize=11, fontweight='bold')
axes[0, 0].set_title('Residuals vs Fitted', fontsize=12, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

# 2. Q-Q plot
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q Plot', fontsize=12, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# 3. Scale-Location (sqrt of standardized residuals vs fitted)
standardized_resid = residuals / residuals.std()
axes[1, 0].scatter(fitted, np.sqrt(np.abs(standardized_resid)),
                   alpha=0.6, edgecolors='k', linewidth=0.5)
axes[1, 0].set_xlabel('Fitted Values', fontsize=11, fontweight='bold')
axes[1, 0].set_ylabel('‚àö|Standardized Residuals|', fontsize=11, fontweight='bold')
axes[1, 0].set_title('Scale-Location Plot', fontsize=12, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# 4. Histogram of residuals
axes[1, 1].hist(residuals, bins=20, edgecolor='black', alpha=0.7)
axes[1, 1].axvline(x=0, color='r', linestyle='--', lw=2)
axes[1, 1].set_xlabel('Residuals', fontsize=11, fontweight='bold')
axes[1, 1].set_ylabel('Frequency', fontsize=11, fontweight='bold')
axes[1, 1].set_title('Histogram of Residuals', fontsize=12, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nDiagnostic Interpretation:")
print("‚úì Residuals vs Fitted: No clear pattern ‚Üí Good")
print("‚úì Q-Q Plot: Points near line ‚Üí Normality")
print("‚úì Scale-Location: Random scatter ‚Üí Homoskedasticity")
print("‚úì Histogram: Bell-shaped ‚Üí Normality")

In [None]:
# Variance Inflation Factor (VIF)
print("\n" + "="*70)
print("MULTICOLLINEARITY CHECK (VIF)")
print("="*70)

# Calculate VIF for each predictor
# VIF = 1 / (1 - R¬≤_j), where R¬≤_j is from regressing X_j on other X's

from patsy import dmatrix

# Create design matrix (without intercept for VIF)
X = dmatrix(formula + " - 1", data=data, return_type='dataframe')

# Calculate VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print("\nVariance Inflation Factors:")
display(vif_data)

print("\nInterpretation:")
print("  VIF = 1: No correlation with other predictors")
print("  VIF < 5: Low multicollinearity (acceptable)")
print("  VIF 5-10: Moderate multicollinearity (caution)")
print("  VIF > 10: High multicollinearity (problematic)")

for idx, row in vif_data.iterrows():
    var = row['Variable']
    vif = row['VIF']
    if vif < 5:
        print(f"  {var}: VIF = {vif:.2f} ‚Üí Low multicollinearity")
    elif vif < 10:
        print(f"  {var}: VIF = {vif:.2f} ‚Üí Moderate multicollinearity")
    else:
        print(f"  {var}: VIF = {vif:.2f} ‚Üí High multicollinearity!")

---
## 7. Exporting Results

### Why Export?

- **Papers**: LaTeX tables for publications
- **Reports**: Markdown/HTML for reproducible documents
- **Collaboration**: JSON/CSV for data sharing
- **Presentations**: Formatted tables for slides

### PanelBox Export Options

```python
results.summary.as_latex()   # LaTeX table
results.summary.as_text()    # Plain text
results.summary.as_html()    # HTML table
# Manual JSON/CSV exports also available
```

---

In [None]:
# Export results to different formats
print("="*70)
print("EXPORTING RESULTS")
print("="*70)

# Create output directory
output_dir = '/home/guhaase/projetos/panelbox/examples/tutorials/01_fundamentals/output'
import os
os.makedirs(output_dir, exist_ok=True)

# 1. LaTeX
try:
    latex_output = results_cluster.summary.as_latex()
    print("\n1. LaTeX Output (for academic papers):")
    print("-" * 70)
    print(latex_output[:500] + "\n... (truncated)")
    
    with open(f'{output_dir}/pooled_ols_results.tex', 'w') as f:
        f.write(latex_output)
    print(f"‚úì Saved to: {output_dir}/pooled_ols_results.tex")
except Exception as e:
    print(f"\nNote: LaTeX export not available - {e}")

# 2. Plain text
print("\n2. Plain Text Output:")
print("-" * 70)
print(results_cluster.summary)

# 3. HTML
try:
    html_output = results_cluster.summary.as_html()
    with open(f'{output_dir}/pooled_ols_results.html', 'w') as f:
        f.write(html_output)
    print(f"\n‚úì Saved to: {output_dir}/pooled_ols_results.html")
    
    # Display HTML in notebook
    print("\n3. HTML Output (rendered):")
    display(HTML(html_output))
except Exception as e:
    print(f"\nNote: HTML export not available - {e}")

# 4. JSON (all results)
results_dict = {
    'coefficients': coefs.to_dict(),
    'std_errors': se_cluster.to_dict(),
    'tvalues': tvalues.to_dict(),
    'pvalues': pvalues.to_dict(),
    'rsquared': float(r2),
    'rsquared_adj': float(r2_adj),
    'nobs': int(nobs)
}

import json
with open(f'{output_dir}/pooled_ols_results.json', 'w') as f:
    json.dump(results_dict, f, indent=2)
print(f"\n‚úì Saved to: {output_dir}/pooled_ols_results.json")

# 5. CSV (coefficient table)
results_table = pd.DataFrame({
    'Coefficient': coefs,
    'Std_Error': se_cluster,
    't_statistic': tvalues,
    'p_value': pvalues
})
results_table.to_csv(f'{output_dir}/pooled_ols_results.csv')
print(f"‚úì Saved to: {output_dir}/pooled_ols_results.csv")

print("\n" + "="*70)
print("All results exported successfully!")
print("="*70)

---
## 8. Practical Exercises

Solutions available in `/examples/solutions/01_fundamentals/03_estimation_solutions.ipynb`.

### Exercise 1: Log-Log Model

**Task**: Estimate a log-log model to obtain elasticities:
$$
\log(\text{Investment}) = \beta_0 + \beta_1 \log(\text{Value}) + \beta_2 \log(\text{Capital}) + \varepsilon
$$

1. Specify the formula using `np.log()`
2. Estimate with clustered SEs
3. Interpret Œ≤‚ÇÅ as an elasticity
4. Is the elasticity significantly different from 1?

---

### Exercise 2: Compare SE Types

**Task**: Re-estimate the original model with all three SE types:
1. Classical (IID)
2. Robust (heteroskedasticity-consistent)
3. Clustered (by entity)

Create a table comparing:
- Coefficient estimates (should be identical)
- Standard errors (should differ)
- t-statistics (should differ)
- p-values (should differ)

Which variables remain significant under all SE types?

---

### Exercise 3: Model with Interaction

**Task**: Add an interaction between value and a post-war dummy:
$$
\text{Investment} = \beta_0 + \beta_1 \text{Value} + \beta_2 \text{Post1945} + \beta_3 (\text{Value} \times \text{Post1945}) + \beta_4 \text{Capital} + \varepsilon
$$

1. Create `post_1945` dummy (year > 1945)
2. Estimate the model
3. Interpret Œ≤‚ÇÉ: Did the effect of value change after 1945?
4. Test H‚ÇÄ: Œ≤‚ÇÉ = 0

---

### Exercise 4: Diagnostics

**Task**: Check if residuals exhibit heteroskedasticity.
1. Plot residuals vs fitted values (visual check)
2. Perform Breusch-Pagan test (if available in PanelBox)
3. If heteroskedasticity detected, which SE type should you use?

---

In [None]:
# Exercise 1: Your code here
# -------------------------



In [None]:
# Exercise 2: Your code here
# -------------------------



In [None]:
# Exercise 3: Your code here
# -------------------------



In [None]:
# Exercise 4: Your code here
# -------------------------



---
## 9. Summary and Next Steps

### What You Learned

In this tutorial, you mastered:

‚úÖ **Estimating models**: Using `PooledOLS().fit()`  
‚úÖ **Interpreting coefficients**: Economic meaning of Œ≤ÃÇ  
‚úÖ **Standard errors**: Classical, robust, clustered  
‚úÖ **Statistical inference**: t-tests, p-values, confidence intervals  
‚úÖ **Hypothesis testing**: Individual and joint tests  
‚úÖ **Model diagnostics**: Residual plots, VIF, normality checks  
‚úÖ **Exporting results**: LaTeX, HTML, JSON, CSV

---

### Key Concepts

| Concept | Formula | Interpretation |
|---------|---------|----------------|
| Coefficient | Œ≤ÃÇ | Estimated effect of X on Y |
| Standard Error | SE(Œ≤ÃÇ) | Uncertainty in Œ≤ÃÇ |
| t-statistic | t = Œ≤ÃÇ / SE(Œ≤ÃÇ) | Distance from zero (in SEs) |
| p-value | P(\|t\| > observed \| H‚ÇÄ) | Probability under null |
| Confidence Interval | [Œ≤ÃÇ - 1.96¬∑SE, Œ≤ÃÇ + 1.96¬∑SE] | 95% range for true Œ≤ |
| R¬≤ | 1 - SS_res / SS_tot | Fraction of variance explained |

---

### Best Practices Learned

1. **Always use clustered SEs for panel data** (accounts for within-entity correlation)
2. **Report robust SEs at minimum** (protects against heteroskedasticity)
3. **Check diagnostics** (residual plots, VIF) before trusting results
4. **Interpret economically** not just statistically (Œ≤ÃÇ = 0.05 significant, but is it meaningful?)
5. **Export results** for reproducibility

---

### Next Steps

**Tutorial 04: Spatial Fundamentals** (Optional)
- Learn about spatial weight matrices
- Visualize spatial connections
- Prepare for spatial panel models

**Or skip to Module 2: Classical Estimators**
- Fixed Effects (FE)
- Random Effects (RE)
- First Differences (FD)
- Between estimator

**Recommended path**: Module 2 (Classical Estimators) next

---

### Further Reading

- **Wooldridge (2010)**: Chapter 10 (Basic Linear Unbiased Estimation)
- **Cameron & Trivedi (2005)**: Chapter 21 (Linear Panel Data Models)
- **Angrist & Pischke (2009)**: "Mostly Harmless Econometrics" (practical inference)

---

In [None]:
# Session information
print("="*70)
print("SESSION INFORMATION")
print("="*70)
print(f"\nNotebook: 03_estimation_interpretation.ipynb")
print(f"Version: {__version__}")
print(f"Last updated: {__last_updated__}")
print(f"\nLibrary versions:")
print(f"  PanelBox: {pb.__version__}")
print(f"  NumPy: {np.__version__}")
print(f"  Pandas: {pd.__version__}")
import scipy
print(f"  SciPy: {scipy.__version__}")
print("\nTutorial completed successfully! üéâ")
print("You are now ready for advanced panel models!")