In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
# pip install linearmodels
from linearmodels.panel import PanelOLS, RandomEffects
from IPython.display import display, Markdown

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 12, 'figure.figsize': (11, 7), 'figure.dpi': 130,
                     'axes.titlesize': 'large', 'axes.labelsize': 'medium',
                     'xtick.labelsize': 'small', 'ytick.labelsize': 'small'})
np.set_printoptions(suppress=True, linewidth=120, precision=4)

# --- Utility Functions ---
def note(msg): display(Markdown(f"<div class='alert alert-block alert-info'>📝 **Note:** {msg}</div>"))
def sec(title): print(f"\n{80*'='}\n| {title.upper()} |\n{80*'='}")

note("Environment initialized for Panel Data Methods.")

# Part 6: Econometrics
## Chapter 6.7: Panel Data Methods

### Introduction: The Power of Observing Over Time

**Panel data**, also known as longitudinal data, follows the same set of individuals (e.g., people, firms, countries) over multiple time periods. This data structure is incredibly powerful because it provides a solution to one of the most pervasive problems in econometrics: **omitted variable bias** arising from unobserved, time-invariant heterogeneity.

Consider estimating the effect of education on wages. In a simple cross-section of individuals, we might worry that an unobserved factor like a person's innate "ability" or "drive" is correlated with both their education level and their future wages. Since we cannot measure ability, it becomes part of the error term, violating the OLS exogeneity assumption and biasing our estimate of the return to education.

Panel data offers a powerful solution. By observing the same person over time, we can use econometric techniques to effectively difference out *all* time-invariant characteristics, whether observed or unobserved. This allows us to isolate the true effect of the variables that change over time, providing a much more credible path to causal inference.

This notebook explores the two workhorse models of panel data econometrics: **Fixed Effects** and **Random Effects**, and the crucial decision of when to use each.

## 1. The Panel Data Model

A standard panel data regression model can be written as:
$$ y_{it} = \mathbf{x}_{it}'\beta + c_i + u_{it} $$ 
Here, $y_{it}$ is the dependent variable for individual $i$ at time $t$. The error term is decomposed into two parts:

- $c_i$: The **unobserved individual-specific effect** (or fixed effect). It captures all factors that are constant over time for a given individual, such as a firm's management quality, a person's innate ability, or a country's geography and institutions. This is the primary source of omitted variable bias if it's correlated with the regressors $\mathbf{x}_{it}$.
- $u_{it}$: The **idiosyncratic error term**, which represents unobserved factors that vary over both individuals and time.

The central question in panel data analysis is how to deal with $c_i$. The choice between Fixed Effects and Random Effects models hinges on our assumption about the relationship between the unobserved effect $c_i$ and the regressors $\mathbf{x}_{it}$.

## 2. The Fixed Effects (FE) Estimator

The Fixed Effects model is generally the more robust and credible approach. It explicitly allows the unobserved individual effect $c_i$ to be **correlated** with the explanatory variables $\mathbf{x}_{it}$. This is a realistic assumption in many economic settings (e.g., more able people tend to get more education).

To eliminate the problematic $c_i$ term, we can use the **within-transformation** (also known as demeaning). For each individual $i$, we take the average of the regression equation over all time periods they are in the sample:
$$ \bar{y}_i = \bar{\mathbf{x}}_i'\beta + c_i + \bar{u}_i $$ 
where $\bar{y}_i = \frac{1}{T_i}\sum_{t=1}^{T_i} y_{it}$. Subtracting this time-averaged equation from the original equation gives the "demeaned" or "within-transformed" equation:
$$ (y_{it} - \bar{y}_i) = (\mathbf{x}_{it} - \bar{\mathbf{x}}_i)'\beta + (u_{it} - \bar{u}_i) $$ 
The fixed effect $c_i$ is eliminated. We can now run OLS on this transformed data to get a consistent estimate of $\beta$. This is why the FE estimator is often called the **within estimator**—it identifies $\beta$ purely from the variation *within* each individual over time.

The major drawback of this method is that it cannot estimate the effect of any time-invariant variables (like gender, race, or a firm's industry), as they are completely wiped out by the demeaning process.

In [None]:
sec("Loading and Preparing Panel Data")

# Load a classic panel dataset on investment from the linearmodels package
try:
    from linearmodels.datasets import grunfeld
    data = grunfeld.load()
    df = data.copy()
    note("Grunfeld investment dataset loaded.")
except ImportError:
    note("Could not load Grunfeld dataset. Creating dummy data.")
    firms = [f"Firm_{i}" for i in range(10)]
    years = range(1935, 1955)
    index = pd.MultiIndex.from_product([firms, years], names=['firm', 'year'])
    df = pd.DataFrame(np.random.rand(200, 3) * 1000, index=index, columns=['invest', 'value', 'capital'])

# For use with linearmodels, we need to set a MultiIndex of (entity, time)
df_lm = df.set_index(['firm', 'year'])
df_lm = sm.add_constant(df_lm)
dependent = df_lm['invest']
exog = df_lm[['const', 'value', 'capital']]

# --- Visualize the raw data and a biased Pooled OLS fit ---
fig, ax = plt.subplots(figsize=(12, 8))
sns.scatterplot(data=df.reset_index(), x='value', y='invest', hue='firm', palette='viridis', alpha=0.7, ax=ax)
# Add a pooled OLS regression line
b, a = np.polyfit(df['value'], df['invest'], 1)
ax.plot(df['value'], a + b * df['value'], color='red', lw=2.5, ls='--', label='Pooled OLS Fit')
ax.set_title('Raw Data: Investment vs. Firm Value')
ax.legend()
plt.show()
note("A simple OLS regression on this pooled data would be biased. It ignores the fact that observations from the same firm are not independent and, more importantly, that firms may have different baseline investment levels (fixed effects) that are correlated with their value and capital stock.")

In [None]:
sec("Fixed Effects vs. Pooled OLS")

# --- Model 1: Pooled OLS (ignores panel structure, likely biased) ---
pooled_ols = PanelOLS(dependent, exog).fit(cov_type='clustered', cluster_entity=True)
print("--- Pooled OLS Results (likely biased) ---")
print(pooled_ols)

# --- Model 2: One-Way Fixed Effects (controls for firm characteristics) ---
fe_model = PanelOLS(dependent, exog, entity_effects=True).fit(cov_type='clustered', cluster_entity=True)
print("\n--- Fixed Effects (Within) Estimator Results ---")
print(fe_model)

# --- Model 3: Two-Way Fixed Effects (controls for firm and year effects) ---
# This is often the preferred specification as it also controls for common time shocks.
twfe_model = PanelOLS(dependent, exog, entity_effects=True, time_effects=True).fit(cov_type='clustered', cluster_entity=True)
print("\n--- Two-Way Fixed Effects (TWFE) Estimator Results ---")
print(twfe_model)
note("Notice how the coefficients on 'value' and 'capital' change significantly from the Pooled OLS model to the Fixed Effects models. This is strong evidence that unobserved firm-specific characteristics were biasing the OLS estimates.")

## 3. The Random Effects (RE) Estimator

The Random Effects model makes a much stronger assumption than the FE model. It assumes that the unobserved individual effect $c_i$ is **uncorrelated** with the explanatory variables $\mathbf{x}_{it}$ for all time periods. Under this assumption, $c_i$ is not a source of omitted variable bias, but simply another random component of the error term.

The RE model is estimated using **Generalized Least Squares (GLS)**. It accounts for the serial correlation within each individual's errors (since the composite error $v_{it} = c_i + u_{it}$ means all errors for individual $i$ share the common component $c_i$) via a **quasi-demeaning** process. The RE estimator is a matrix-weighted average of the *within* estimator (from FE) and the *between* estimator (from a regression on individual time-averages).

**Why use RE if its assumption is so strong?**
1.  **Efficiency:** If the no-correlation assumption holds, the RE estimator is more efficient (has lower variance) than FE.
2.  **Time-Invariant Variables:** Crucially, because RE does not wipe out all the between-individual variation, it *can* estimate the effects of time-invariant regressors (like gender or race), which FE cannot.

In [None]:
sec("Random Effects Model")

# The setup is the same, but we use the RandomEffects model class
re_model = RandomEffects(dependent, exog).fit(cov_type='clustered', cluster_entity=True)

print("--- Random Effects (GLS) Estimator Results ---")
print(re_model)

## 4. Fixed Effects vs. Random Effects: The Hausman Test

We face a classic bias-variance tradeoff: FE is robust to correlation between $c_i$ and $\mathbf{x}_{it}$ but is less efficient and can't estimate time-invariant effects. RE is more efficient and can handle time-invariant variables, but produces biased estimates if its core assumption is violated. The **Hausman Test** provides a formal way to choose between the two.

- **Null Hypothesis ($H_0$):** The Random Effects model is the correct model. The unobserved effects $c_i$ are uncorrelated with the regressors. Both FE and RE are consistent, but RE is more efficient.
- **Alternative Hypothesis ($H_A$):** The Fixed Effects model is the correct model. The effects $c_i$ are correlated with the regressors. In this case, FE is consistent, but RE is **inconsistent and biased**.

The test statistic is based on the difference between the FE and RE coefficient estimates. If the difference is large and statistically significant, we reject the null hypothesis that RE is consistent.

**Practical Rule:** If the p-value from the Hausman test is small (e.g., < 0.05), we reject the null hypothesis and conclude that the Fixed Effects model is the more appropriate choice.

In [None]:
sec("Hausman Test: FE vs. RE")

def hausman_test(fe, re):
    """Performs the Hausman test for fixed vs. random effects."""
    # Get coefficients and covariance matrices, dropping the constant
    b_fe = fe.params.drop('const')
    b_re = re.params.drop('const')
    cov_fe = fe.cov.loc[b_fe.index, b_fe.index]
    cov_re = re.cov.loc[b_re.index, b_re.index]
    
    # The formula for the test statistic
    b_diff = b_fe - b_re
    # The variance of the difference is Var(b_fe) - Var(b_re)
    cov_diff_inv = np.linalg.inv(cov_fe - cov_re)
    
    stat = b_diff.T @ cov_diff_inv @ b_diff
    dof = len(b_diff)
    pval = 1.0 - sm.distributions.chi2.cdf(stat, dof)
    
    return stat, dof, pval

# Perform the test using the one-way FE model
stat, dof, pval = hausman_test(fe_model, re_model)

print("--- Hausman Test Results ---")
print(f"Chi-squared statistic: {stat:.4f}")
print(f"Degrees of freedom:    {dof}")
print(f"p-value:               {pval:.4f}")

note(f"The p-value ({pval:.4f}) is very small, so we strongly reject the null hypothesis. This suggests that the unobserved firm-specific effects are correlated with the regressors ('value' and 'capital'), making the Random Effects estimates inconsistent. The Fixed Effects model is the more appropriate choice for this dataset.")

## 5. Dynamic Panel Data: The Problem of Nickell Bias

A major limitation of the standard Fixed Effects model is that it produces biased estimates when the model includes a **lagged dependent variable** as a regressor. A model of the form:
$$ y_{it} = \alpha y_{i,t-1} + \mathbf{x}_{it}'\beta + c_i + u_{it} $$ 
is known as a **dynamic panel data model**. These are common in economics, where current outcomes often depend on past outcomes (e.g., state dependence in unemployment, habit formation in consumption).

Applying the within-transformation to this model creates a mechanical correlation between the transformed lagged dependent variable and the transformed error term. The demeaned lagged outcome is $\tilde{y}_{i,t-1} = y_{i,t-1} - \bar{y}_{i,-1}$ (where the bar is over t-1), and the demeaned error is $\tilde{u}_{it} = u_{it} - \bar{u}_i$. Since $y_{i,t-1}$ is a function of all past errors $u_{i,t-2}, u_{i,t-3}, ...$, it is correlated with $\bar{u}_i$. This correlation means that $\text{Cov}(\tilde{y}_{i,t-1}, \tilde{u}_{it}) \neq 0$, which violates the exogeneity assumption and makes the standard FE estimator inconsistent for dynamic panels. This is known as **Nickell bias**, and it is particularly severe when the time dimension $T$ is small.

### 5.1 The Arellano-Bond (1991) Solution: Difference GMM
The **Arellano-Bond (1991) estimator** provides a solution using the Generalized Method of Moments (GMM). The key steps are:
1.  **First-Difference the Equation:** This removes the fixed effect $c_i$, leaving: 
    $$ \Delta y_{it} = \alpha \Delta y_{i,t-1} + \Delta \mathbf{x}_{it}'\beta + \Delta u_{it} $$ 
    However, $\Delta y_{i,t-1} = y_{i,t-1} - y_{i,t-2}$ is still correlated with the new error term $\Delta u_{it} = u_{it} - u_{i,t-1}$ because $y_{i,t-1}$ depends on $u_{i,t-1}$.
2.  **Instrument the Endogenous Variable:** The crucial insight is to use *lagged levels* of the variables as instruments for the first-differenced equation. For the equation at time $t$, the variable $y_{i,t-2}$ is correlated with the endogenous regressor $\Delta y_{i,t-1}$ but is **not** correlated with the error term $\Delta u_{it}$ (assuming the original errors $u_{it}$ are not serially correlated). We can also use $y_{i,t-3}$, $y_{i,t-4}$, etc., as additional valid instruments. This creates a set of moment conditions that can be used to form a GMM estimator.

### 5.2 System GMM: Improving Efficiency
A drawback of the Difference GMM is that lagged levels can be weak instruments for first differences if the variables are highly persistent (close to a random walk). **System GMM** (Arellano & Bover, 1995; Blundell & Bond, 1998) improves efficiency by adding a second set of equations to the system: the original equations in *levels*. For these level equations, it uses *lagged differences* as instruments. This combination of the differenced equations (instrumented by levels) and the level equations (instrumented by differences) constitutes the System GMM estimator, which is now the standard in applied work.

In [None]:
sec("System GMM for Dynamic Panel Data")
try:
    from linearmodels.panel import PanelGMM
    # Simulate data for a dynamic panel model
    rng = np.random.default_rng(123)
    n_ind, n_time = 500, 10
    alpha_true = 0.6; beta_true = 1.5
    individual_effects = rng.normal(size=n_ind)
    y = np.zeros((n_ind, n_time)); x = rng.normal(size=(n_ind, n_time))
    for t in range(1, n_time):
        y[:, t] = alpha_true * y[:, t-1] + beta_true * x[:, t] + individual_effects + rng.normal(size=n_ind)
    
    # Reshape into a pandas DataFrame
    df_dyn = pd.DataFrame({'y': y.flatten(), 'x': x.flatten(), 
                           'entity': np.repeat(np.arange(n_ind), n_time),
                           'time': np.tile(np.arange(n_time), n_ind)})
    df_dyn = df_dyn.set_index(['entity', 'time'])
    df_dyn['y_lag'] = df_dyn.groupby(level=0)['y'].shift(1)
    df_dyn.dropna(inplace=True)
    
    # --- Estimate using System GMM ---
    # We specify the model and use .fit() with the appropriate options.
    # By default, PanelGMM uses a two-step estimator.
    model_gmm = PanelGMM.from_formula('y ~ 1 + y_lag + x', data=df_dyn)
    res_gmm = model_gmm.fit()
    
    note(f"System GMM Results (True α={alpha_true}, β={beta_true})")
    display(res_gmm)
    note("The estimates for y_lag and x are very close to the true parameters, demonstrating the estimator's consistency. The Sargan statistic tests the validity of the overidentifying restrictions. A high p-value (like the one here) means we do not reject the null hypothesis that the instruments are valid, which is a good sign.")
except ImportError:
    note("linearmodels not installed. Skipping GMM example.")

## 6. Exercises

1.  **FE Intuition:** Explain in your own words why the Fixed Effects estimator cannot estimate the effect of time-invariant variables like a person's gender or a firm's industry. What happens to such a variable during the within-transformation?

2.  **Manual Within-Transformation:** To prove you understand the FE transformation, perform it by hand. Using the `grunfeld` dataset, use `pandas` to group the data by firm, calculate the firm-specific mean for `invest`, `value`, and `capital`, and then subtract these means from the original values to create new demeaned variables. Finally, run a simple OLS (using `smf.ols`) of demeaned `invest` on demeaned `value` and `capital`. Verify that the coefficients you get are identical to those from the `PanelOLS` Fixed Effects model.

3.  **First-Differencing:** An alternative to the within-transformation for removing fixed effects is the **first-difference (FD) estimator**. This involves subtracting the equation at time `t-1` from the equation at time `t` for each individual.
    a. Write down the first-differenced equation and show that the fixed effect $c_i$ is eliminated.
    b. When might the FD estimator be preferred to the FE estimator? (Hint: Think about the assumptions on the idiosyncratic error term $u_{it}$. What if $u_{it}$ follows a random walk?)

4.  **Data Application:** Find a panel dataset of your choice (e.g., from the `piketty` R dataset collection, or a dataset from a previous course). Formulate a research question, estimate Pooled OLS, Fixed Effects (one-way and two-way), and Random Effects models. Perform a Hausman test to determine which model is most appropriate for your data and justify your choice.