<a href="https://colab.research.google.com/github/MLMario/mariogj1987/blob/main/A_Clear_Primer_on_Synthetic_Controls_Part_1_%5BWIP%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

The Synthetic Control Method (SCM) is a statistical methodology used primarily for evaluating the impact of interventions, particularly in the realms of economics, policy evaluation, and social sciences. This method provides a data-driven way to construct an 'imaginary' or synthetic control unit that closely approximates the treated unit before the intervention, this allows us to then ask the question: In the absence of treatment, what would have been the results for a given outcome variable?

### **But why should you care as a data scientist?**

Understanding and implementing the Synthetic Control Method is crucial for data scientists in the tech industry, particularly those involved in experimentation and product development. In numerous situations, deploying conventional A/B testing is impractical or problematic, due to inherent complexities like network effects, externalities, or ethical constraints. This is where the prowess of the Synthetic Control Method shines. It potentially enables the design of quasi-experiments where a synthetic unit, constructed through a convex combination of untreated units, serves as the control, and the treatment unit is an average of treated units that are representative of broader group. In short, this is a great option for data scientists to estimate causal impact in constrained environments.


# A real-world application

[In the study by Abadie and Gardeazabal (2003)](https://economics.mit.edu/sites/default/files/publications/The%20Economic%20Costs%20of%20Conflict.pdf), the Synthetic Control Method (SCM) was employed to assess the economic impacts of the conflict in the Basque Country, linked primarily to the terrorist activities of the ETA group.

The authors constructed a synthetic control, utilizing weighted combination of other Spanish regions as potential controls, to represent what the economic trajectory of the Basque Country might have been in the absence of conflict.
The study focused on GDP per capita as the outcome variable and used several economic and demographic characteristics as predictor variables, succesfully finding a 'Synthethic' Pre-Terrorism (the "Intervetion") Basque Country that closely resembles both in GDP per capita (Output variables) and demographic characteristics (Predictor variables)

<div>
<img src="https://drive.google.com/uc?export=view&id=1kZhCAZvAoFNBj4f6rdEzzQeguePtZAKF" width="650"/>
</div>

The results indicated a substantial and persistent divergence in GDP per capita between the Basque Country and its synthetic counterpart, post the onset of terrorism in the 1970s. By 1998, it was estimated that the GDP per capita in the Basque Country was about 10% lower than it would have been without the terrorist conflict.

<div>
<img src="https://drive.google.com/uc?export=view&id=1G9wVXXWUfESvLyHbZS3NX498IuO-rf8G" width="650"/>
</div>


This study underscored the utility of SCM in quantifying the economic repercussions of conflicts, offering empirical insights into the prolonged economic damages stemming from political unrest and terrorism, while also highlighting the need for cautious interpretation due to potential limitations and unobserved confounders.

# The Objective

Mathemathicaly, we are trying to understand what is the average treatment effect for a given outcome $Y$ by comparing:

$ATE = Y^{(1)} _ {t, post} - Y^{(0)} _ {t, post}$

Where the superscript denotes whether the treatment unit $t$ recieved treatment (1) or not (0).

The challenge lies in the fact that typically, the counterfactual outcome for treated units remains unobserved, meaning, we lack insights into what would have occurred to them in the absence of treatment.

For the outcome variable we basically observe 4 types of scenarios given a treatment and control units:

$$
Y =
\begin{bmatrix}
Y^{(1)} _ {t, post} \ & Y^{(0)} _ {c, post} \newline
Y^{(0)} _ {t, pre} \ & Y^{(0)} _ {c, pre}
\end{bmatrix}
$$

The good news, is that if we have access to similar control units that have not recieve a treatment, we can, as mentioned above, use them to construct a synthetic control that basically estimates $Y^{(0)} _ {t, post}$.


# How to calculate the SC. An example without time dimension

Assuming you have familiarity with basic matrix algebra and algorithms, I will explain the method using a simplified SCM scenario without incorporating time dimensionality and explaining how to get a synthetic control step by step.

##Step 1: Define the Problem and Assemble Data

Identify the treated unit (the entity undergoing intervention) and potential control units (entities without intervention). Accumulate outcome and predictor variables for these units. Predictor variables are essential covariates believed to influence the outcome.

**Note:** Take notice that the treatment units here are taken as a given, this an assumption that will be present in all extensions of this method up to the case where we actually want to design an experiment

##Step 2: Calculate an initial set of weights $W$

The crux is to find a vector of weights, $W$ , ensuring that the weighted combination of control units closely resembles the treated unit regarding predictor and outcome variables before the intervention. To get to this, we will start by first decreasing the distance between predictor variables.

For two entities, treated and control, with predictor variables denoted by
$X$ matrices, we formulate the objective as:

$min_W  V^{1/2} (X_1 - X_0W)^T(X_1-X0W)$

$Where:$

$X_1$ is the matrix of predictor variables for the treated unit.

$X_0$  is the matrix of predictor variables for the control units.

$V$  is a diagonal matrix determining the importance of each predictor variable to estimate output $Y$.

$W_i \ge 0$

$\sum_{W_i\in J} = 1$


The contraints on the weights are there to help avoid overfitting and to maintain interpretabiilty, for example, Country A = 2*Countries B - 1 Country C doesnt quite make sense in that context. Having said that, this restriction can be relaxed depending on the Synthethic control Method we are using.

##Step 3: Iterative Adjustment of $V$ and re-calculation of $W$

The minimization of this distance is equivalent to minimizing the distance between the control and treatment unit pre-treatment, but in itself  treated unit's outcome variable $Y$ during the pre-treatment period.

To get the synthetic control to match the treated unit in terms of both predictor variables and the outcome variable during the pre-treatment period, we take an iterative approach

*Here's the procedure:*

1) Start by finding weights $W$ that minimize the aforementioned distance, so the synthetic control matches the predictor variables of the treated unit.
Check how well the synthetic control matches the treated unit's outcome variable $Y$ during the pre-treatment period.

2) If the match is not within an acceptable error range, adjust the diagonal matrix $V$ to give more weight to predictor variables that seem to be influential in improving the match in the outcome variable. Then, recompute $W$.

3) Repeat the above steps until you achieve a satisfactory match in the outcome variable $Y$ during the pre-treatment period.

So, while the distance formula does not explicitly contain $Y$, the iterative procedure ensures that the synthetic control matches $Y$ closely during the pre-treatment period by adjusting the importance of the predictor variables (through $V$) based on how they impact the match in $Y$


In [None]:
_import numpy as np

# Sample data
# Rows correspond to time, columns to different control units
# Last column is the treated unit
data = np.array([
    [2, 3, 5, 2.5],
    [2.5, 3.5, 4.5, 3],
    [3, 4, 4, 3.5],
    [3.5, 4.5, 3.5, 4],  # <-- Treatment happens after this time
    [4, 5, 3, 5]
])

# Splitting data into pre-treatment and post-treatment
pre_treatment_data = data[:-1]
post_treatment_data = data[-1]

# X_1: predictor variables for treated unit (using pre-treatment data)
X_1 = pre_treatment_data[:, -1]

# X_0: predictor variables for control units
X_0 = pre_treatment_data[:, :-1]

# V: diagonal matrix assigning importance to each predictor variable
# Initially, we'll start with the identity matrix (equal importance)
V = np.eye(X_1.shape[0])
