In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from IPython.display import display, Markdown

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 14, 'figure.figsize': (12, 8), 'figure.dpi': 150,
                     'axes.titlesize': 'large', 'axes.labelsize': 'medium',
                     'xtick.labelsize': 'small', 'ytick.labelsize': 'small'})
np.set_printoptions(suppress=True, linewidth=120, precision=4)

# --- Utility Functions ---
def note(msg, **kwargs):
    display(Markdown(f"<div class='alert alert-block alert-info'>📝 **Note:** {msg}</div>"))
def sec(title):
    print(f"\n{100*'='}\n| {title.upper()} |\n{100*'='}")

note("Environment initialized for Advanced Difference-in-Differences.")

# Part 6: Econometrics
## Chapter 6.6: Difference-in-Differences: Theory and Modern Practice

### Introduction: Constructing a Counterfactual

The **Difference-in-Differences (DiD)** method is a cornerstone of modern policy evaluation. It provides an intuitive and powerful way to estimate the causal effect of a policy by using data from a **control group** to construct a credible estimate of the **counterfactual**—what *would have happened* to the **treatment group** in the absence of the treatment.

By comparing the change in outcomes over time for a group that receives a treatment to the change for a control group, DiD differences away two potential sources of bias:
1.  **Time-invariant, unobservable differences** between the groups.
2.  **Common time trends** that affect both groups.

This chapter provides a PhD-level treatment of DiD, covering its formal derivation, its connection to fixed effects models, and the modern estimators required to handle the complexities of real-world policy adoption.

## 1. The DiD Estimator and the Parallel Trends Assumption

The DiD estimate is the simple difference in the average change over time between the treatment and control groups:
$$ \hat{\delta}_{DiD} = (E[Y_{T, post}] - E[Y_{T, pre}]) - (E[Y_{C, post}] - E[Y_{C, pre}]) $$ 

### 1.1 Formal Derivation with Potential Outcomes
The validity of this estimator rests on the **Parallel Trends Assumption**. Formally, this assumes that the average change in the *no-treatment potential outcome* is the same for both groups:
$$ E[Y_i(0)_{post} - Y_i(0)_{pre} | \text{Group = Treat}] = E[Y_i(0)_{post} - Y_i(0)_{pre} | \text{Group = Control}] $$ 

Let's see how this assumption identifies the Average Treatment Effect on the Treated (ATT). The parameter we want is $\text{ATT} = E[Y_i(1)_{post} - Y_i(0)_{post} | \text{Group = Treat}]$.
We can write the observed change for the treatment group as:
$$ E[Y_{T, post}] - E[Y_{T, pre}] = E[Y_i(1)_{post} | T] - E[Y_i(0)_{pre} | T] $$
Adding and subtracting the unobserved counterfactual $E[Y_i(0)_{post} | T]$ gives:
$$ = \underbrace{(E[Y_i(1)_{post} | T] - E[Y_i(0)_{post} | T])}_{\text{ATT}} + \underbrace{(E[Y_i(0)_{post} | T] - E[Y_i(0)_{pre} | T])}_{\text{Counterfactual Trend for Treated}} $$ 
The parallel trends assumption allows us to substitute the *observed* trend from the control group for the unobserved counterfactual trend for the treated group. Rearranging gives the DiD estimator:
$$ \text{ATT} = (E[Y_{T, post}] - E[Y_{T, pre}]) - (E[Y_{C, post}] - E[Y_{C, pre}]) $$

## 2. Estimation: The Two-Way Fixed Effects (TWFE) Model

When we have panel data (multiple units observed over multiple time periods), the DiD estimate can be obtained from a **Two-Way Fixed Effects (TWFE)** regression:
$$ Y_{it} = \alpha_i + \gamma_t + \delta D_{it} + \epsilon_{it} $$ 
Where:
- $Y_{it}$ is the outcome for unit $i$ at time $t$.
- $\alpha_i$ are **unit fixed effects**, which absorb all time-invariant characteristics of each unit (treatment or control).
- $\gamma_t$ are **time fixed effects**, which absorb all common shocks or trends that affect all units in a given period.
- $D_{it}$ is a dummy variable that is 1 if unit $i$ is treated at time $t$, and 0 otherwise.
- $\mathbf{\delta}$ is the DiD estimate of the treatment effect.

In the simple 2x2 case, this regression is numerically identical to the manual difference-in-differences calculation. The fixed effects framework is powerful because it easily accommodates multiple time periods and control variables.

In [None]:
sec("DiD Estimation: Manual vs. Regression")

# 1. Generate Synthetic Data (from previous notebook)
rng = np.random.default_rng(seed=42); n_per_group = 100
df = pd.DataFrame({'group': ['Control']*n_per_group*2 + ['Treatment']*n_per_group*2,
                   'time': ['Pre']*n_per_group + ['Post']*n_per_group + ['Pre']*n_per_group + ['Post']*n_per_group})
df['treat'] = (df['group'] == 'Treatment').astype(int); df['post'] = (df['time'] == 'Post').astype(int)
true_effect = -10; group_fe = 20*df['treat']; time_trend = 15*df['post']
treat_effect = true_effect * (df['treat'] * df['post'])
df['outcome'] = 100 + group_fe + time_trend + treat_effect + rng.normal(0, 5, size=df.shape[0])

# 2. Manual Calculation
means = df.groupby(['treat', 'post'])['outcome'].mean()
manual_did = (means.loc[1,1] - means.loc[1,0]) - (means.loc[0,1] - means.loc[0,0])
note(f"Manual DiD Estimate: {manual_did:.4f}")

# 3. Regression Calculation
did_model = smf.ols('outcome ~ treat * post', data=df).fit()
regression_did = did_model.params['treat:post']
note(f"Regression DiD Estimate (Interaction Term): {regression_did:.4f}")
note("The estimates are numerically identical.")

## 3. Modern DiD: Staggered Adoption and Heterogeneous Effects

A major recent development in econometrics is the recognition that the standard TWFE estimator can be severely biased in the common setting of **staggered treatment adoption** (when different units get treated at different times) if the treatment effects are heterogeneous.

**The Problem:** The TWFE estimator implicitly uses already-treated units as controls for later-treated units. If the treatment effect changes over time (e.g., it grows), then using an already-treated unit as a control is no longer valid—it violates the parallel trends assumption for that specific comparison. The TWFE estimate becomes a weighted average of all possible 2x2 DiDs in the data, and some of these weights can be negative, leading to biased and uninterpretable results.

**The Solution: Modern DiD Estimators**
A new generation of estimators has been developed to address this problem. Key examples include:
- **Callaway and Sant'Anna (2021):** This method avoids forbidden comparisons by estimating group-time average treatment effects ($ATT(g,t)$) for each group $g$ (defined by when they were treated) at each time $t$. It does this by using only units that are not yet treated as the control group for that specific comparison. These group-time effects can then be aggregated to form various summary measures.
- **Sun and Abraham (2021):** Proposes an interaction-weighted estimator that correctly estimates the dynamic treatment effects (the event study coefficients) in a staggered setting. It explicitly estimates the cohort-specific event-study effects and then averages them.
- **Borusyak, Jaravel, and Spiess (2021):** Provides an "imputation"-based approach that is computationally efficient and robust.

The key takeaway is that when dealing with staggered treatment adoption, one should no longer use the simple TWFE regression. Instead, one of these modern, robust estimators is required.

## 4. The Synthetic Control Method

What if you have only one treated unit (e.g., a single state implements a policy) and many potential control units? DiD requires choosing a control group, but any choice might be criticized as arbitrary. The **Synthetic Control Method (SCM)**, developed by Abadie and Gardeazabal (2003), provides a data-driven way to construct an optimal control group.

**The Idea:**
Instead of choosing a single control state, SCM creates an ideal "synthetic" control group by taking a weighted average of multiple untreated units. The weights are chosen such that the synthetic control group's pre-treatment outcomes and other important predictors match the treated unit's pre-treatment characteristics as closely as possible.

**The Procedure:**
1.  Identify a single treated unit and a "donor pool" of untreated units.
2.  Find the vector of weights $W$ that minimizes the pre-treatment difference between the treated unit and the weighted average of the donor pool units.
3.  Construct the synthetic control by applying these weights to the donor pool outcomes for the entire sample period.
4.  The estimated treatment effect is the difference between the treated unit's outcome and the synthetic control's outcome in the post-treatment period.

SCM formalizes the comparative case study and is now a standard tool for policy evaluation with a small number of treated units.

In [None]:
sec("Code Lab: A Simple Synthetic Control Example")

# Generate synthetic panel data
rng = np.random.default_rng(seed=101)
units = ['Treated'] + [f'Control_{i}' for i in range(10)]
time_periods = range(2000, 2021)
synth_df = pd.DataFrame([(u, t) for u in units for t in time_periods], columns=['unit', 'year'])

# Create outcomes with unit-specific trends and a treatment effect
unit_effects = pd.Series(rng.normal(10, 5, len(units)), index=units)
synth_df['outcome'] = synth_df['unit'].map(unit_effects) + (synth_df['year'] - 2000) * rng.uniform(0.5, 2.5, len(synth_df))
synth_df.loc[(synth_df['unit'] == 'Treated') & (synth_df['year'] >= 2015), 'outcome'] += 15

pre_treatment_df = synth_df[synth_df['year'] < 2015].pivot(index='year', columns='unit', values='outcome')
X1 = pre_treatment_df.drop('Treated', axis=1) # Donor pool
y1 = pre_treatment_df['Treated'] # Treated unit

# Find the optimal weights
weights_solver = minimize(lambda w: np.sum((y1 - X1 @ w)**2), 
                          x0=np.ones(X1.shape[1])/X1.shape[1], 
                          constraints=({'type': 'eq', 'fun': lambda w: np.sum(w) - 1}),
                          bounds=[(0,1) for _ in range(X1.shape[1])])
weights = weights_solver.x

# Create the synthetic control
full_df_pivoted = synth_df.pivot(index='year', columns='unit', values='outcome')
synthetic_control = full_df_pivoted.drop('Treated', axis=1) @ weights

# Plotting
plt.figure(figsize=(12, 8))
plt.plot(full_df_pivoted.index, full_df_pivoted['Treated'], 'b-', lw=3, label='Treated Unit')
plt.plot(synthetic_control.index, synthetic_control, 'r--', lw=3, label='Synthetic Control')
plt.axvline(2014.5, color='k', linestyle=':', label='Treatment')
plt.ylabel('Outcome'); plt.title('Synthetic Control Method'); plt.legend(); plt.show()
note("The synthetic control (red dashed line) matches the treated unit's trend almost perfectly in the pre-treatment period. The divergence after 2015 is the estimated causal effect of the treatment.")

### 5. Triple Differences (DDD)

The **Difference-in-Difference-in-Differences (DDD)** estimator is a powerful extension used when the parallel trends assumption might not hold for the main treatment and control groups, but it *is* plausible to assume it holds for a specific *sub-group* within those larger groups.

**Example:** A state ($T$) implements a job training program, and a neighboring state ($C$) does not. We worry that the two states have different underlying economic trends. However, suppose the training is for manufacturing workers. We can use non-manufacturing workers in both states as an additional layer of control. The DDD logic is:
1.  Calculate the DiD effect for manufacturing workers: $\text{DiD}_{mfg} = (Y_{T,mfg,post} - Y_{T,mfg,pre}) - (Y_{C,mfg,post} - Y_{C,mfg,pre})$.
2.  Calculate a "placebo" DiD for non-manufacturing workers: $\text{DiD}_{non} = (Y_{T,non,post} - Y_{T,non,pre}) - (Y_{C,non,post} - Y_{C,non,pre})$. This captures the differential trend between the two states that is unrelated to the manufacturing-specific program.
3.  The DDD estimate is the difference between these two DiDs: $\hat{\delta}_{DDD} = \text{DiD}_{mfg} - \text{DiD}_{non}$. This differences out the state-level violation of parallel trends, isolating the effect specific to the treated group.

Estimation is done via a regression with a three-way interaction term: $Y \sim \text{Treat} * \text{Post} * \text{Group}$.