# Chapter 6.9: Causal Inference: Synthetic Control Methods

---

### Table of Contents

1.  [**The Counterfactual Problem in Case Studies**](#counterfactual-problem)
2.  [**The Idea of Synthetic Control**](#idea-synthetic-control)
3.  [**The Synthetic Control Algorithm**](#algorithm)
    - [Mathematical Formulation](#math-formulation)
4.  [**Case Study: The Economic Costs of Conflict in the Basque Country**](#case-study)
    - [Data and Setup](#data-setup)
    - [Implementation with `pySinc`](#implementation)
    - [Visualizing the Results](#visualizing-results)
5.  [**Inference and Robustness Checks**](#inference)
6.  [**Advantages and Limitations**](#advantages-limitations)
7.  [**Exercises**](#exercises)
8.  [**Summary and Key Takeaways**](#summary)

<a id='counterfactual-problem'></a>
## 1. The Counterfactual Problem in Case Studies

A common challenge in economics and political science is to estimate the causal effect of a large-scale event or policy intervention on a single aggregate unit (e.g., a country, state, or city). For example:
- What was the effect of German reunification on West Germany's GDP?
- What was the impact of California's Proposition 99 (a tobacco tax) on smoking rates?

The fundamental problem is that we only observe the outcome for the treated unit *after* the intervention. We do not observe the **counterfactual**: what would have happened to that same unit in the absence of the event? This is the core challenge of causal inference.

Traditional methods like Difference-in-Differences (DiD) rely on finding a single control unit or an average of control units that had a parallel trend with the treated unit before the intervention. But what if no single unit provides a good comparison? This is where the Synthetic Control Method comes in.

<a id='idea-synthetic-control'></a>
## 2. The Idea of Synthetic Control

The Synthetic Control Method, developed by Abadie and Gardeazabal (2003) and Abadie, Diamond, and Hainmueller (2010), provides a systematic way to construct a better counterfactual. 

The core idea is to create a **"synthetic" control unit** by taking a weighted average of multiple untreated units (the "donor pool"). The weights are chosen algorithmically to create a synthetic unit that best matches the treated unit's characteristics *before* the intervention. 

This synthetic control then serves as the ideal counterfactual. The causal effect of the intervention is estimated as the difference between the outcome of the treated unit and the outcome of its synthetic counterpart after the intervention.

<a id='algorithm'></a>
## 3. The Synthetic Control Algorithm

<a id='math-formulation'></a>
### Mathematical Formulation

Suppose we have $J+1$ units. Unit 1 is the treated unit, and units $j=2, ..., J+1$ form the donor pool of potential controls. We want to find a set of weights $W = (w_2, ..., w_{J+1})'$ such that:
- $w_j \ge 0$ for all $j$ (no extrapolation)
- $\sum_{j=2}^{J+1} w_j = 1$ (weights sum to one)

The weights are chosen to minimize the distance between the pre-treatment characteristics of the treated unit and the synthetic control. Let $X_1$ be a vector of pre-treatment characteristics for the treated unit, and $X_0$ be a matrix containing the same characteristics for the units in the donor pool. The optimization problem is:
$$ \min_{W} || X_1 - X_0 W ||_V = \sqrt{(X_1 - X_0 W)' V (X_1 - X_0 W)} $$
where $V$ is a weighting matrix that reflects the relative importance of the different pre-treatment characteristics. The choice of $V$ is crucial and is often chosen to minimize the mean squared prediction error of the outcome variable in the pre-treatment period.

<a id='case-study'></a>
## 4. Case Study: The Economic Costs of Conflict in the Basque Country

We will replicate the seminal study by Abadie and Gardeazabal (2003), which estimated the economic impact of terrorism and political conflict in the Basque Country.

<a id='data-setup'></a>
### Data and Setup
- **Treated Unit:** Basque Country
- **Donor Pool:** Other regions in Spain (e.g., Catalonia, Madrid) and other countries (e.g., USA, UK - though typically we use more similar units).
- **Outcome Variable:** Per capita GDP.
- **Intervention:** The onset of major terrorist conflict, roughly in the mid-1970s.
- **Pre-treatment Predictors:** Per capita GDP in previous years, investment rates, population density, sector shares (industry, agriculture, services).

<a id='implementation'></a>
### Implementation
We will use the `pySinc` library, which is a Python implementation of the synthetic control method. First, let's install it.

In [None]:
!pip install pysinc

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pysinc.Sinc import Sinc

# Load the dataset from the original paper's replication files
# This data is widely available online, e.g., from the Harvard Dataverse
try:
    basque_df = pd.read_csv('https://raw.githubusercontent.com/matteopack/pysinc/master/pysinc/datasets/basque_data.csv')
    DATA_LOADED = True
except Exception as e:
    print(f"Could not load data: {e}")
    DATA_LOADED = False

if DATA_LOADED:
    # --- Data Preparation ---
    pivot_df = basque_df.pivot_table(index='year', columns='regionno', values='gdpcap')
    pivot_df = pivot_df.rename(columns={1: 'Spain', 17: 'Basque Country'})
    
    # Define treated unit and donor pool
    treated_unit = 'Basque Country'
    control_units = [col for col in pivot_df.columns if col != treated_unit and col != 'Spain'] # Exclude total Spain
    
    # Separate pre- and post-intervention data
    pre_treatment_df = pivot_df[pivot_df.index <= 1975]
    
    # --- Run Synthetic Control ---
    synth = Sinc(pre_treatment_df, treated_unit, control_units)
    weights = synth.get_weights()
    
    # Create the synthetic control unit
    synthetic_basque = pivot_df[control_units].dot(weights)
    
    # Combine results into a new DataFrame
    results_df = pd.DataFrame({
        'Real GDP per Capita (Basque Country)': pivot_df[treated_unit],
        'Synthetic Basque Country': synthetic_basque
    })
    
    print("Optimal Weights for Synthetic Control:")
    # Display weights that are greater than a small threshold
    print(weights[weights > 0.01].round(3))

The weights show that the synthetic Basque Country is primarily a combination of Catalonia and Madrid. This is intuitive, as they were the other two major industrial regions of Spain.

<a id='visualizing-results'></a>
### Visualizing the Results
The most powerful output of a synthetic control analysis is the plot showing the trajectory of the treated unit versus its synthetic counterfactual.

In [None]:
if DATA_LOADED:
    plt.figure(figsize=(14, 9))
    plt.plot(results_df.index, results_df['Real GDP per Capita (Basque Country)'], 'b-', label='Basque Country')
    plt.plot(results_df.index, results_df['Synthetic Basque Country'], 'r--', label='Synthetic Basque Country')
    
    plt.axvline(x=1975, linestyle=':', color='k', label='Onset of Terrorism (1975)')
    plt.ylabel('GDP per Capita')
    plt.title('The Economic Costs of Conflict in the Basque Country')
    plt.legend()
    plt.grid(True)
    plt.show()
    
    # Plot the gap (the estimated treatment effect)
    gap = results_df['Real GDP per Capita (Basque Country)'] - results_df['Synthetic Basque Country']
    plt.figure(figsize=(14, 9))
    plt.plot(gap.index, gap, 'g-')
    plt.axvline(x=1975, linestyle=':', color='k')
    plt.axhline(y=0, linestyle='-', color='k')
    plt.ylabel('Gap in GDP per Capita')
    plt.title('Difference between Basque Country and Synthetic Control')
    plt.grid(True)
    plt.show()

The plots are striking. Before the mid-1970s, the synthetic Basque Country tracks the real one remarkably well, which gives us confidence in the model. After the onset of terrorism, a gap appears and widens over time, suggesting that the conflict had a significant and persistent negative effect on the region's economic growth. Abadie and Gardeazabal (2003) estimate this gap to be about 10 percentage points of GDP by the 1990s.

<a id='inference'></a>
## 5. Inference and Robustness Checks

How do we know if this gap is statistically significant? The most common method is a **placebo test** or **permutation test**.

The procedure is as follows:
1.  Run the synthetic control analysis on every unit in the donor pool, pretending each one is the treated unit.
2.  This gives us a distribution of estimated "gaps" for all the untreated units.
3.  Compare the gap for the actual treated unit (Basque Country) to this distribution. If the gap for the Basque Country is much larger than the gaps for the placebo units, it suggests the effect is not due to chance.

This is a powerful, non-parametric way to conduct inference.

<a id='advantages-limitations'></a>
## 6. Advantages and Limitations

**Advantages:**
- **Transparent:** It's easy to see which units are contributing to the synthetic control.
- **Data-Driven:** The weights are chosen objectively, removing researcher discretion in picking control groups.
- **No Extrapolation:** By construction, the synthetic control is a convex combination of donor units.
- **Falsifiable:** The quality of the pre-treatment fit is a clear diagnostic test.

**Limitations:**
- **Requires Good Pre-treatment Fit:** If no combination of donor units can match the treated unit before the intervention, the method is not credible.
- **Convex Hull:** The treated unit's characteristics must lie within the "convex hull" of the donor units' characteristics.
- **Interpolation, not Extrapolation:** It cannot estimate the effect of an intervention that is entirely outside the historical experience of the donor pool.

### Case Study 2: California's Proposition 99

In [None]:
sec("Data Loading and Preparation for Prop 99")
try:
    # Attempt to load the data from a URL
    url = 'https://raw.githubusercontent.com/matheusfacure/python-causality-handbook/master/causality/data/smoking.csv'
    df_smoking = pd.read_csv(url)
except Exception as e:
    # Fallback or error message if download fails
    note(f"Could not download data. Error: {e}. Please check your internet connection.")
    df_smoking = pd.DataFrame() # Create an empty DataFrame to avoid further errors

if not df_smoking.empty:
    # Create a pivot table for easy access
    pivoted_smoking_df = df_smoking.pivot_table(index='year', columns='state', values='cigsale').rename(columns={'California': 'CA'})
    note("Smoking data loaded and prepared.")

In [None]:
if not df_smoking.empty:
    sec("Solving for Synthetic California")
    donor_pool_ca = pivoted_smoking_df.drop(columns='CA')
    synth_ca_model = Sinc(pivoted_smoking_df[pivoted_smoking_df.index <= 1988], 'CA', list(donor_pool_ca.columns))
    weights_ca = synth_ca_model.get_weights()
    synthetic_california = donor_pool_ca.dot(weights_ca)
    
    note("Optimal weights for Synthetic California. Top 5 contributing states:")
    top_states_ca = pd.Series(weights_ca, index=donor_pool_ca.columns).sort_values(ascending=False).head()
    display(top_states_ca)

In [None]:
if not df_smoking.empty:
    sec("Visualizing the Prop 99 Treatment Effect")
    plt.figure(figsize=(12, 8))
    plt.plot(pivoted_smoking_df.index, pivoted_smoking_df['CA'], label='California', color='black')
    plt.plot(pivoted_smoking_df.index, synthetic_california, label='Synthetic California', color='red', linestyle='--')
    
    plt.axvline(x=1988, color='gray', linestyle=':', label='Proposition 99 (1988)')
    
    plt.title('Per-Capita Cigarette Sales: California vs. Synthetic California')
    plt.ylabel('Cigarette Sales (per capita)')
    plt.xlabel('Year')
    plt.legend()
    plt.show()

<a id='exercises'></a>
## 7. Exercises

1.  **Examine the Weights:** In the Basque Country example, which two regions make up the vast majority of the synthetic control? Why is this economically intuitive?

2.  **Changing the Donor Pool:** Rerun the analysis, but exclude Catalonia (region 3) from the donor pool. How do the weights and the pre-treatment fit change? Is the new synthetic control as good as the original?

3.  **Placebo Test:** Pick a different region from the donor pool (e.g., Madrid, region 13) and pretend it was the treated unit. Run the synthetic control analysis for this placebo unit. Plot the gap for the placebo unit. Is it as large as the gap for the Basque Country?

4.  **Read the Original Paper:** Read Abadie, Diamond, and Hainmueller (2010), "Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program." What are the key predictors they use to create a synthetic California? How do they perform inference?

<a id='summary'></a>
## 8. Summary and Key Takeaways

The Synthetic Control Method is a powerful and increasingly popular tool for causal inference in settings with a small number of aggregate units.

**Key Takeaways:**
- It addresses the fundamental problem of finding a good counterfactual in comparative case studies.
- It creates a **synthetic control** as a weighted average of untreated units from a donor pool.
- The weights are chosen to match the treated unit's characteristics **before** the intervention.
- The treatment effect is the difference between the treated unit and its synthetic version **after** the intervention.
- The quality of the pre-treatment fit is a crucial diagnostic for the method's validity.