# Solutions: Tutorial 03 - Estimation and Results Interpretation

**Series**: PanelBox - Fundamentals (Solutions)
**Level**: Intermediate
**Tutorial**: 03_estimation_interpretation.ipynb

This notebook contains complete solutions to the exercises in Tutorial 03.

---

## Setup and Imports

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from IPython.display import display, HTML

# PanelBox library
import sys
sys.path.append('/home/guhaase/projetos/panelbox')
import panelbox as pb
from panelbox.core.panel_data import PanelData
from panelbox.models import PooledOLS

# Configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print(f"PanelBox version: {pb.__version__}")
print("Setup complete!")

## Load Data

In [None]:
# Load Grunfeld dataset
import os
data_path = '/home/guhaase/projetos/panelbox/examples/datasets/grunfeld.csv'
data = pd.read_csv(data_path)

print(f"Dataset loaded: {data.shape[0]} observations")
display(data.head())

---

## Exercise 1: Log-Log Model

**Task**: Estimate a log-log model to obtain elasticities:
$$
\log(\text{Investment}) = \beta_0 + \beta_1 \log(\text{Value}) + \beta_2 \log(\text{Capital}) + \varepsilon
$$

In [None]:
print("="*70)
print("SOLUTION 1: LOG-LOG MODEL FOR ELASTICITIES")
print("="*70)

# Step 1: Specify formula
formula_log = "np.log(invest) ~ np.log(value) + np.log(capital)"
print(f"\nFormula: {formula_log}")
print(f"Model: log(invest) = β₀ + β₁·log(value) + β₂·log(capital) + ε")

In [None]:
# Step 2: Estimate with clustered standard errors
model_log = PooledOLS(formula_log, data=data)
results_log = model_log.fit(cov_type='clustered', cluster_entity=True)

print("\nModel estimated successfully!")
print(f"Observations: {results_log.nobs}")
print(f"R²: {results_log.rsquared:.4f}")

print("\n" + "="*70)
print("ESTIMATION RESULTS (Clustered SE)")
print("="*70)
print(results_log.summary())

In [None]:
# Step 3: Interpret β₁ as elasticity
print("\n" + "-"*70)
print("ELASTICITY INTERPRETATION")
print("-"*70)

beta_value = results_log.params['np.log(value)']
beta_capital = results_log.params['np.log(capital)']
se_value = results_log.std_errors['np.log(value)']
pval_value = results_log.pvalues['np.log(value)']

print(f"\nβ₁ (value elasticity) = {beta_value:.4f}")
print(f"Standard error: {se_value:.4f}")
print(f"p-value: {pval_value:.4f}")

print(f"\nInterpretation:")
print(f"  A 1% increase in firm value is associated with a {beta_value:.4f}% increase in investment")
print(f"  (holding capital constant)")

if pval_value < 0.01:
    print(f"  ✓ Highly significant (p < 0.01)")
elif pval_value < 0.05:
    print(f"  ✓ Significant (p < 0.05)")
else:
    print(f"  ✗ Not significant (p ≥ 0.05)")

print(f"\nβ₂ (capital elasticity) = {beta_capital:.4f}")
print(f"Interpretation:")
print(f"  A 1% increase in capital stock is associated with a {beta_capital:.4f}% change in investment")

# Example calculation
print(f"\nExample:")
print(f"  If firm value increases by 10%:")
print(f"    → Investment increases by approximately {10 * beta_value:.2f}%")
print(f"  If capital increases by 10%:")
print(f"    → Investment changes by approximately {10 * beta_capital:.2f}%")

In [None]:
# Step 4: Test if elasticity is significantly different from 1
print("\n" + "-"*70)
print("HYPOTHESIS TEST: H₀: β₁ = 1 (Unit Elasticity)")
print("-"*70)

# Manual t-test for β₁ = 1
null_value = 1.0
t_stat = (beta_value - null_value) / se_value
df = results_log.df_resid
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))

print(f"\nNull hypothesis: β₁ = 1 (unit elasticity)")
print(f"Alternative: β₁ ≠ 1")
print(f"\nTest statistic: t = (β̂₁ - 1) / SE(β̂₁)")
print(f"  β̂₁ = {beta_value:.4f}")
print(f"  SE(β̂₁) = {se_value:.4f}")
print(f"  t = ({beta_value:.4f} - 1.0) / {se_value:.4f} = {t_stat:.4f}")
print(f"\np-value: {p_value:.4f}")

if p_value < 0.05:
    print(f"\n→ Reject H₀ (p < 0.05)")
    print(f"→ The value elasticity is significantly different from 1")
    if beta_value < 1:
        print(f"→ Elasticity < 1: Investment is INELASTIC with respect to value")
        print(f"  (% change in investment < % change in value)")
    else:
        print(f"→ Elasticity > 1: Investment is ELASTIC with respect to value")
        print(f"  (% change in investment > % change in value)")
else:
    print(f"\n→ Cannot reject H₀ (p ≥ 0.05)")
    print(f"→ The value elasticity is not significantly different from 1")
    print(f"→ Consistent with unit elasticity (proportional response)")

In [None]:
# Visualize the log-log relationship
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Log(invest) vs log(value)
axes[0].scatter(np.log(data['value']), np.log(data['invest']), alpha=0.5, s=40)
axes[0].set_xlabel('log(Value)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('log(Investment)', fontsize=12, fontweight='bold')
axes[0].set_title(f'Log-Log Relationship: Value\n(Slope = Elasticity = {beta_value:.3f})', 
                 fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Fitted vs actual (in logs)
fitted_log = results_log.fittedvalues
actual_log = np.log(data['invest'])
axes[1].scatter(fitted_log, actual_log, alpha=0.5, s=40)
axes[1].plot([actual_log.min(), actual_log.max()],
            [actual_log.min(), actual_log.max()],
            'r--', lw=2, label='Perfect fit')
axes[1].set_xlabel('Fitted log(Investment)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Actual log(Investment)', fontsize=12, fontweight='bold')
axes[1].set_title(f'Model Fit (R² = {results_log.rsquared:.3f})', 
                 fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Exercise 2: Compare SE Types

**Task**: Re-estimate the original model with all three SE types and compare results.

In [None]:
print("="*70)
print("SOLUTION 2: COMPARING STANDARD ERROR TYPES")
print("="*70)

# Original model
formula = "invest ~ value + capital"
model = PooledOLS(formula, data=data)

# Estimate with different SE types
results_iid = model.fit(cov_type='unadjusted')
results_robust = model.fit(cov_type='robust')
results_cluster = model.fit(cov_type='clustered', cluster_entity=True)

print(f"\nModel: {formula}")
print(f"Estimated with 3 different standard error types")

In [None]:
# Create comparison table
print("\n" + "="*70)
print("COEFFICIENT ESTIMATES (SHOULD BE IDENTICAL)")
print("="*70)

coef_comparison = pd.DataFrame({
    'Classical': results_iid.params,
    'Robust': results_robust.params,
    'Clustered': results_cluster.params
})

print("\nCoefficient estimates:")
display(coef_comparison)

print("\nVerification: All coefficients identical?")
for var in coef_comparison.index:
    all_close = np.allclose([coef_comparison.loc[var, 'Classical'],
                            coef_comparison.loc[var, 'Robust'],
                            coef_comparison.loc[var, 'Clustered']])
    print(f"  {var}: {all_close}")

In [None]:
# Compare standard errors
print("\n" + "="*70)
print("STANDARD ERRORS (SHOULD DIFFER)")
print("="*70)

se_comparison = pd.DataFrame({
    'Classical': results_iid.std_errors,
    'Robust': results_robust.std_errors,
    'Clustered': results_cluster.std_errors
})

print("\nStandard errors:")
display(se_comparison)

# Calculate percentage differences
print("\n" + "-"*70)
print("PERCENTAGE DIFFERENCES (relative to Classical)")
print("-"*70)

pct_diff = pd.DataFrame({
    'Robust vs Classical': 100 * (se_comparison['Robust'] / se_comparison['Classical'] - 1),
    'Clustered vs Classical': 100 * (se_comparison['Clustered'] / se_comparison['Classical'] - 1)
})

display(pct_diff)

print("\nInterpretation:")
print("  Positive % → SE is larger than classical (more conservative)")
print("  Negative % → SE is smaller than classical (less common)")

In [None]:
# Compare t-statistics
print("\n" + "="*70)
print("T-STATISTICS")
print("="*70)

t_comparison = pd.DataFrame({
    'Classical': results_iid.tstats,
    'Robust': results_robust.tstats,
    'Clustered': results_cluster.tstats
})

print("\nt-statistics:")
display(t_comparison)

print("\nNote: t = β̂ / SE, so larger SE → smaller |t|")

In [None]:
# Compare p-values
print("\n" + "="*70)
print("P-VALUES AND SIGNIFICANCE")
print("="*70)

p_comparison = pd.DataFrame({
    'Classical': results_iid.pvalues,
    'Robust': results_robust.pvalues,
    'Clustered': results_cluster.pvalues
})

print("\np-values:")
display(p_comparison)

# Check significance at 5% level
print("\n" + "-"*70)
print("SIGNIFICANCE AT 5% LEVEL (p < 0.05)")
print("-"*70)

sig_comparison = pd.DataFrame({
    'Classical': (results_iid.pvalues < 0.05),
    'Robust': (results_robust.pvalues < 0.05),
    'Clustered': (results_cluster.pvalues < 0.05)
})

display(sig_comparison)

print("\nVariables significant under ALL SE types:")
for var in sig_comparison.index:
    if sig_comparison.loc[var].all():
        print(f"  ✓ {var}")

In [None]:
# Visualize SE comparison
fig, ax = plt.subplots(figsize=(10, 6))

vars_to_plot = [v for v in se_comparison.index if v != 'Intercept']
x = np.arange(len(vars_to_plot))
width = 0.25

ax.bar(x - width, se_comparison.loc[vars_to_plot, 'Classical'], 
       width, label='Classical', alpha=0.8)
ax.bar(x, se_comparison.loc[vars_to_plot, 'Robust'], 
       width, label='Robust', alpha=0.8)
ax.bar(x + width, se_comparison.loc[vars_to_plot, 'Clustered'], 
       width, label='Clustered', alpha=0.8)

ax.set_xlabel('Variable', fontsize=12, fontweight='bold')
ax.set_ylabel('Standard Error', fontsize=12, fontweight='bold')
ax.set_title('Standard Errors by Type', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(vars_to_plot)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nConclusion: Use CLUSTERED SEs for panel data!")

---

## Exercise 3: Model with Interaction

**Task**: Add interaction between value and post-war dummy.

In [None]:
print("="*70)
print("SOLUTION 3: INTERACTION WITH POST-WAR DUMMY")
print("="*70)

# Step 1: Create post_1945 dummy
data['post_1945'] = (data['year'] > 1945).astype(int)
print(f"\nCreated post_1945 dummy variable")
print(f"Distribution: {data['post_1945'].value_counts().to_dict()}")

In [None]:
# Step 2: Estimate model with interaction
formula_int = "invest ~ value * post_1945 + capital"
model_int = PooledOLS(formula_int, data=data)
results_int = model_int.fit(cov_type='clustered', cluster_entity=True)

print(f"\nFormula: {formula_int}")
print(f"Expands to: invest ~ value + post_1945 + value:post_1945 + capital")

print("\n" + "="*70)
print("ESTIMATION RESULTS")
print("="*70)
print(results_int.summary())

In [None]:
# Step 3: Interpret β₃ (interaction coefficient)
print("\n" + "-"*70)
print("INTERPRETATION OF INTERACTION COEFFICIENT")
print("-"*70)

beta_int = results_int.params['value:post_1945']
se_int = results_int.std_errors['value:post_1945']
pval_int = results_int.pvalues['value:post_1945']
beta_value = results_int.params['value']

print(f"\nCoefficients:")
print(f"  β₁ (value): {beta_value:.6f}")
print(f"  β₃ (interaction): {beta_int:.6f}")

print(f"\nMarginal effect of value on investment:")
print(f"\nBefore 1945 (post_1945 = 0):")
print(f"  ∂invest/∂value = β₁ = {beta_value:.6f}")

print(f"\nAfter 1945 (post_1945 = 1):")
print(f"  ∂invest/∂value = β₁ + β₃ = {beta_value:.6f} + {beta_int:.6f} = {beta_value + beta_int:.6f}")

print(f"\nInterpretation of β₃:")
if beta_int > 0:
    print(f"  β₃ = {beta_int:.6f} > 0")
    print(f"  → The effect of value on investment INCREASED after 1945")
    print(f"  → Financial markets became MORE important post-war")
elif beta_int < 0:
    print(f"  β₃ = {beta_int:.6f} < 0")
    print(f"  → The effect of value on investment DECREASED after 1945")
    print(f"  → Financial markets became LESS important post-war")
else:
    print(f"  β₃ ≈ 0")
    print(f"  → No change in the effect of value after 1945")

In [None]:
# Step 4: Test H₀: β₃ = 0
print("\n" + "-"*70)
print("HYPOTHESIS TEST: H₀: β₃ = 0 (No Structural Change)")
print("-"*70)

t_stat = beta_int / se_int

print(f"\nNull hypothesis: β₃ = 0 (no change in value effect after 1945)")
print(f"Alternative: β₃ ≠ 0 (structural change occurred)")

print(f"\nTest results:")
print(f"  β̂₃ = {beta_int:.6f}")
print(f"  SE(β̂₃) = {se_int:.6f}")
print(f"  t = {t_stat:.4f}")
print(f"  p-value = {pval_int:.4f}")

if pval_int < 0.01:
    print(f"\n→ Reject H₀ at 1% level (p < 0.01)")
    print(f"→ Strong evidence of structural change after 1945")
elif pval_int < 0.05:
    print(f"\n→ Reject H₀ at 5% level (p < 0.05)")
    print(f"→ Significant structural change after 1945")
elif pval_int < 0.10:
    print(f"\n→ Weak evidence (p < 0.10)")
    print(f"→ Marginal structural change after 1945")
else:
    print(f"\n→ Cannot reject H₀ (p ≥ 0.10)")
    print(f"→ No significant structural change after 1945")

In [None]:
# Visualize interaction
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Separate plots for pre/post 1945
pre_data = data[data['post_1945'] == 0]
post_data = data[data['post_1945'] == 1]

axes[0].scatter(pre_data['value'], pre_data['invest'], alpha=0.6, s=40, label='Pre-1945')
axes[0].set_xlabel('Value', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Investment', fontsize=12, fontweight='bold')
axes[0].set_title(f'Pre-1945 (Slope = {beta_value:.4f})', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].scatter(post_data['value'], post_data['invest'], alpha=0.6, s=40, 
               label='Post-1945', color='red')
axes[1].set_xlabel('Value', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Investment', fontsize=12, fontweight='bold')
axes[1].set_title(f'Post-1945 (Slope = {beta_value + beta_int:.4f})', 
                 fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Exercise 4: Diagnostics

**Task**: Check for heteroskedasticity in residuals.

In [None]:
print("="*70)
print("SOLUTION 4: HETEROSKEDASTICITY DIAGNOSTICS")
print("="*70)

# Use original model
formula = "invest ~ value + capital"
model = PooledOLS(formula, data=data)
results = model.fit(cov_type='unadjusted')  # Classical SE for diagnostics

residuals = results.resids
fitted = results.fittedvalues

print(f"\nModel: {formula}")
print(f"Observations: {len(residuals)}")

In [None]:
# Step 1: Visual check - Residuals vs Fitted
print("\n" + "-"*70)
print("VISUAL INSPECTION")
print("-"*70)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Residuals vs Fitted
axes[0, 0].scatter(fitted, residuals, alpha=0.6, s=40)
axes[0, 0].axhline(y=0, color='r', linestyle='--', lw=2)
axes[0, 0].set_xlabel('Fitted Values', fontsize=11, fontweight='bold')
axes[0, 0].set_ylabel('Residuals', fontsize=11, fontweight='bold')
axes[0, 0].set_title('Residuals vs Fitted Values', fontsize=12, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Scale-Location (sqrt of |standardized residuals| vs fitted)
standardized_resid = residuals / residuals.std()
axes[0, 1].scatter(fitted, np.sqrt(np.abs(standardized_resid)), alpha=0.6, s=40)
axes[0, 1].set_xlabel('Fitted Values', fontsize=11, fontweight='bold')
axes[0, 1].set_ylabel('√|Standardized Residuals|', fontsize=11, fontweight='bold')
axes[0, 1].set_title('Scale-Location Plot', fontsize=12, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Residuals vs Value
axes[1, 0].scatter(data['value'], residuals, alpha=0.6, s=40)
axes[1, 0].axhline(y=0, color='r', linestyle='--', lw=2)
axes[1, 0].set_xlabel('Value', fontsize=11, fontweight='bold')
axes[1, 0].set_ylabel('Residuals', fontsize=11, fontweight='bold')
axes[1, 0].set_title('Residuals vs Value', fontsize=12, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Residuals vs Capital
axes[1, 1].scatter(data['capital'], residuals, alpha=0.6, s=40)
axes[1, 1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[1, 1].set_xlabel('Capital', fontsize=11, fontweight='bold')
axes[1, 1].set_ylabel('Residuals', fontsize=11, fontweight='bold')
axes[1, 1].set_title('Residuals vs Capital', fontsize=12, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nVisual assessment:")
print("  - Look for funnel/cone shape → indicates heteroskedasticity")
print("  - Constant scatter around zero → homoskedasticity")
print("  - Increasing variance with fitted values → heteroskedasticity")

In [None]:
# Step 2: Breusch-Pagan test (manual implementation)
print("\n" + "-"*70)
print("BREUSCH-PAGAN TEST FOR HETEROSKEDASTICITY")
print("-"*70)

# Regress squared residuals on predictors
from patsy import dmatrix

# Get design matrix
X = dmatrix(formula + " - 1", data=data, return_type='dataframe')  # Without intercept for test
resid_squared = residuals ** 2

# Auxiliary regression: e² ~ X
from sklearn.linear_model import LinearRegression
aux_model = LinearRegression()
aux_model.fit(X, resid_squared)
r2_aux = aux_model.score(X, resid_squared)

# BP test statistic: n * R²_aux ~ χ²(k)
n = len(residuals)
k = X.shape[1]
bp_stat = n * r2_aux
bp_pvalue = 1 - stats.chi2.cdf(bp_stat, k)

print(f"\nBreusch-Pagan test:")
print(f"  H₀: Homoskedasticity (constant variance)")
print(f"  H₁: Heteroskedasticity (non-constant variance)")
print(f"\nTest statistic: LM = n × R²_aux = {n} × {r2_aux:.4f} = {bp_stat:.4f}")
print(f"Degrees of freedom: {k}")
print(f"p-value: {bp_pvalue:.4f}")

if bp_pvalue < 0.05:
    print(f"\n→ Reject H₀ (p < 0.05)")
    print(f"→ Evidence of HETEROSKEDASTICITY detected")
else:
    print(f"\n→ Cannot reject H₀ (p ≥ 0.05)")
    print(f"→ No strong evidence of heteroskedasticity")

In [None]:
# Step 3: Recommendation
print("\n" + "="*70)
print("RECOMMENDATION")
print("="*70)

if bp_pvalue < 0.05:
    print("\nHeteroskedasticity detected!")
    print("\nRecommended actions:")
    print("  1. Use ROBUST standard errors (White/Huber)")
    print("     → Corrects SEs for heteroskedasticity")
    print("  2. For panel data, use CLUSTERED standard errors")
    print("     → Corrects for both heteroskedasticity AND within-cluster correlation")
    print("  3. Consider transforming variables (e.g., logs)")
    print("     → May stabilize variance")
    print("  4. Use WLS (Weighted Least Squares) if variance structure known")
    
    print("\nFor this panel dataset, ALWAYS use:")
    print("  model.fit(cov_type='clustered', cluster_entity=True)")
else:
    print("\nNo strong heteroskedasticity detected.")
    print("\nHowever, for panel data:")
    print("  Still use CLUSTERED standard errors!")
    print("  Reason: Accounts for within-entity correlation over time")
    print("  This is independent of heteroskedasticity")

---

## Summary

In these exercises, you practiced:

✅ **Exercise 1**: Estimating log-log models and interpreting elasticities
✅ **Exercise 2**: Comparing classical, robust, and clustered standard errors
✅ **Exercise 3**: Including and interpreting interaction effects
✅ **Exercise 4**: Diagnosing heteroskedasticity visually and with formal tests

### Key Skills Acquired

1. **Elasticities**: Interpreting log-log coefficients as % changes
2. **Standard errors**: Understanding why choice matters for inference
3. **Interactions**: Calculating and interpreting marginal effects
4. **Diagnostics**: Detecting violations of classical assumptions

### Best Practices Learned

| Situation | Recommended SE Type | Reason |
|-----------|---------------------|--------|
| Cross-sectional data | Robust | Accounts for heteroskedasticity |
| Panel data | Clustered (by entity) | Accounts for within-entity correlation |
| Time series | HAC (Newey-West) | Accounts for autocorrelation |
| Heteroskedasticity detected | Robust or WLS | Corrects inference or efficiency |

**Golden rule for panel data**: Always use clustered standard errors!

---

### Next Steps

You are now ready for:

**Module 2: Classical Panel Estimators**
- Fixed Effects (FE)
- Random Effects (RE)
- First Differences (FD)
- Hausman test for choosing between FE and RE

---

In [None]:
print("="*70)
print("SOLUTIONS COMPLETED!")
print("="*70)
print("\nYou've successfully completed all exercises in Tutorial 03.")
print("You now have a solid foundation in panel data estimation!")
print("\nNext: Module 2 - Classical Panel Estimators")
print("\nExcellent work!")