# Multi-Action Synthetic Control Example

This Jupyter notebook is designed to be a simple, "user-friendly" tool to demonstrate the Multi-Action Synthetic Control (MA-SC) algorithm. 

The MS-SC algorithm is implented in the $\textbf{fill_tensor}$ method below. 

In Sections 1 and 2, using artificially generated data, we illustrate how to use the $\textbf{fill_tensor}$ method to generate counterfactuals for $\textit{each unit}$ under $\textit{each intervention}$ of interest (i.e., personalized interventions). 

We hope you find the method useful for your problems of interest. We have found MA-SC to product accurate counterfactual estimates across a wide vareity of fields including: econometric policy evaluation, web-scale A/B testing, sports, genetics. We hope you find it to be of use too for your problems of interest.

In [1]:
from multi_action_synthetic_control import random_rct, diagnostic, fill_tensor
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Section 1 - Generating Artificial Data from a Randomized Control Trial

### Explanation of Terms $N, I, T, T_0, r, \sigma$ 

We begin with generating artificial data for the purposes of the demonstration through the function random_rct. All the data can be captured through a 3-dimensional tensor, $\mathcal{M} \in \mathbb{R}^{N \times T \times I}$.

$N$ denotes the number of units we perform the experiments on. 

$I$ denotes the total number of intervention. Each unit $n \in N$ will recieve exactly one intervention, $i \in I$.

$T$ is the total number of time periods (i.e., total number of measurements) we perform the experiment for. 

$T_0$ is the number of pre-intervention periods. Note $1 < T_0 < T$.

$r$ denotes the "model complexity", i.e., the rank of the tensor $\mathcal{M}$. 

$\sigma$ is the level of noise added to each measurement, i.e., the variance parameter of mean zero Gaussian noise.

In [2]:
# Generate Artifical Data

# Number of Units
N = 10
# Number of Interventions
I = 4
# Number of Total Time Steps (Pre- and Post-Intervention)
T = 1000
# Number of Pre-Intervention Time Steps
T0 = 800
# Model Complexity
rank = 3
# Noise in System
sigma = 0

rct_data = random_rct(N, I, T, T0, rank, sigma)

### Pre-Intervention & Post-Intervention Data (pre_df, post_df)

The rct_data object returned by calling the function $\textbf{random_rct}$ is comprised of two dataframes: pre_df and post_df.

pre_df is a 2-dimensional matrix, $\mathcal{M}^{\text{pre}} \in \mathbb{R}^{N \times T_0}$. It is measurements of all units before any experiments are performed.

post_df is a 2-dimensional matrix, $\mathcal{M}^{\text{post}} \in \mathbb{R}^{N \times (T-T_0)}$. It is the intervention that each unit $n \in N$ experienced (actually observed in reality) in the post-intevention phase. 

(Note not each unit in pre_df has to have experienced an intervention. Further, a unit can experience multiple interventions. The function $\textbf{fill_tensor}$ (the MA-SC algorithm) will work as is for both. For simplicity, we illustrate on artificial data, the case where each unit in the pre-intervention phase $n \in N$ receives exactly one intervention in the post-intervention phase.)

In [3]:
# Pre- and Post- Intervention Data
pre_df, post_df = rct_data

In [4]:
print(pre_df.shape)
pre_df.head(10)

(10, 802)


Unnamed: 0,unit,intervention,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,...,t_790,t_791,t_792,t_793,t_794,t_795,t_796,t_797,t_798,t_799
0,id_0,inter_0,-0.028001,-1.672346,0.785791,0.41254,0.256327,-5.130559,-3.293334,-4.731656,...,-3.717539,0.714784,1.50788,-5.747853,-1.291651,-3.34471,2.484742,-1.592308,-3.99208,2.341859
1,id_1,inter_0,0.63125,-4.05984,-0.547637,-1.06521,-2.412615,-9.984155,-4.979013,-5.168403,...,-2.682743,-0.286003,0.286701,-8.004713,-1.518615,-7.918738,-3.157394,-4.351016,-7.607643,2.559859
2,id_2,inter_0,-0.025216,-2.937749,-0.795601,-0.772372,-1.003184,-5.047185,-3.976838,-3.934493,...,-2.260701,-0.005648,0.951212,-4.385417,-1.740216,-4.292551,-0.843093,-2.017682,-3.441745,1.534022
3,id_3,inter_0,0.189137,-1.407847,0.239735,-0.059737,-0.497705,-4.170614,-2.11999,-2.707938,...,-1.830125,0.172131,0.488921,-3.885115,-0.669021,-2.999539,0.056772,-1.611893,-3.270764,1.405059
4,id_4,inter_0,-0.358377,1.0648,2.366081,1.902956,2.466591,-0.92985,-0.785466,-3.460734,...,-4.039066,1.513007,2.023903,-3.610032,-0.39731,0.857071,6.97697,0.686879,-1.164723,2.017232
5,id_5,inter_0,-0.230486,-6.8694,-3.979237,-3.154395,-3.356711,-7.359712,-7.820925,-5.33734,...,-1.57043,-1.068007,0.993948,-4.549061,-3.850093,-8.036383,-6.082234,-3.563262,-4.00657,1.137046
6,id_6,inter_0,-0.214989,4.290266,4.342199,3.437759,4.039212,2.221214,2.850022,-1.096719,...,-3.430587,2.044565,1.562361,-1.783573,1.439477,4.466408,9.911432,2.247578,0.434555,1.599441
7,id_7,inter_0,-0.157444,-1.353691,-0.651702,-0.468389,-0.343566,-1.362143,-1.811727,-1.545073,...,-0.820099,-0.035313,0.539086,-1.162566,-0.958731,-1.419274,-0.336606,-0.525103,-0.686867,0.416362
8,id_8,inter_0,0.087208,-1.01726,0.383658,0.128087,-0.121245,-3.256535,-1.775125,-2.476702,...,-1.842198,0.288293,0.619722,-3.324843,-0.606105,-2.209286,0.777006,-1.144356,-2.572241,1.27977
9,id_9,inter_0,0.605894,-2.133175,0.064287,-0.473676,-1.609608,-6.716429,-2.505154,-2.8533,...,-1.469741,-0.193656,-0.178066,-5.335076,-0.461323,-5.072677,-2.187124,-2.976994,-5.37999,1.67372


In [5]:
print(post_df.shape)
post_df.head(10)

(10, 202)


Unnamed: 0,unit,intervention,t_800,t_801,t_802,t_803,t_804,t_805,t_806,t_807,...,t_990,t_991,t_992,t_993,t_994,t_995,t_996,t_997,t_998,t_999
0,id_0,inter_0,-0.340655,3.218854,-4.649053,0.669246,-0.563896,1.129696,1.456061,-2.156759,...,-2.281156,-3.906005,-2.275219,-2.878416,-1.334093,-1.473696,-4.364125,-2.276766,-3.577582,0.325634
1,id_1,inter_2,-2.200089,-3.791027,-14.777304,-11.609694,-12.393693,-9.57686,-9.601843,-14.121227,...,-9.525813,-14.481275,-16.761161,-11.700043,-2.468019,-6.346053,-8.58004,-6.261735,-17.124726,-12.772743
2,id_2,inter_3,4.880946,-1.089231,1.006642,0.553514,-0.78522,-1.698557,1.480665,0.64361,...,0.006048,3.633549,0.175606,1.173266,6.800956,1.529371,3.781292,1.85546,2.092107,-4.583523
3,id_3,inter_3,3.601492,-0.103262,-1.300959,0.048853,-1.393599,-1.47408,1.101197,-0.827246,...,-1.132391,0.980282,-1.35837,-0.503082,4.798126,0.449637,1.256114,0.474943,-0.25903,-4.118802
4,id_4,inter_3,8.416527,3.983248,3.220128,7.342559,4.117467,3.327216,8.78129,5.316471,...,1.914118,8.140761,5.717785,4.110887,10.403998,3.753319,4.948484,3.265832,7.375337,-0.662926
5,id_5,inter_1,10.329899,-1.814334,11.356604,6.308153,4.81234,0.897482,6.669617,9.110477,...,5.76318,15.734971,9.574302,9.289482,14.536365,6.658563,13.387354,7.628992,13.975711,-2.52906
6,id_6,inter_0,3.506869,7.459592,-1.715134,7.092348,4.138402,5.603666,8.322725,2.700235,...,-0.098655,1.325461,3.3903,0.481377,2.899342,1.013369,-2.236283,-0.41159,2.210734,3.624268
7,id_7,inter_1,3.660904,-0.742204,3.019879,1.632132,0.9653,-0.214988,1.93381,2.361398,...,1.408409,4.695619,2.359672,2.542994,5.146744,1.982615,4.177259,2.304665,3.895378,-1.725532
8,id_8,inter_1,4.462731,0.519789,6.874678,4.974272,4.378737,2.493396,4.809136,6.202732,...,3.919535,8.634669,6.9154,5.67121,6.031385,3.723953,6.426939,3.975999,8.537552,1.699418
9,id_9,inter_2,-0.573327,-3.169679,-10.526881,-8.49912,-9.313688,-7.494127,-6.848408,-10.221596,...,-6.952417,-9.778562,-12.269643,-8.287621,-0.321352,-4.308436,-5.362115,-4.143442,-12.076438,-10.440413


## Section 2 - Producing Counterfactual Estimates: For Each Unit Under Each Intervention

In this section, we show how to use the $\textbf{fill_tensor}$ method to produce personalized interventions for each unit. 

The input to $\textbf{fill_tensor}$ are the two pre- and post- intervention dataframes. 

The key parameter to the method is: $\textit{cum_energy} \in [0, 100]$, which decides the number of prinicpal components to retain when performing Principal Component Regression, when we learn a linear coefficient. In essence, we find the minimum of principal components required such that the percentage of the spectral energy retained is above the given parameter. 

The output of $\textbf{fill_tensor}$ is a order-3 tensor (flattended), $\hat{\mathcal{M}}^{\text{Counterfactual}}\in \mathbb{R}^{N \times (T - T_0) \times I}$, termed $\textit{df_output}$. This contains the counterfactual estiamtes for every unit $n \in [N]$ and for each intervention $i \in [I]$, over the entire post-intervention period, $T - T_0$. 

Producing this dataframe is indeed the desired counterfactual output!

In [6]:
df_output = fill_tensor(pre_df, post_df, cum_energy=0.9, full_matrix_denoise=True)
df_output.head(10)

Unnamed: 0,unit,intervention,t_800,t_801,t_802,t_803,t_804,t_805,t_806,t_807,...,t_990,t_991,t_992,t_993,t_994,t_995,t_996,t_997,t_998,t_999
0,id_0,inter_0,-0.340655,3.218854,-4.649053,0.669246,-0.563896,1.129696,1.456061,-2.156759,...,-2.281156,-3.906005,-2.275219,-2.878416,-1.334093,-1.473696,-4.364125,-2.276766,-3.577582,0.325634
1,id_0,inter_1,3.400304,-0.473592,3.866657,2.260287,1.763659,0.469711,2.358964,3.166523,...,1.996861,5.303186,3.359121,3.172599,4.759746,2.25085,4.432415,2.552072,4.778349,-0.609743
2,id_0,inter_2,-0.892386,-2.056559,-7.581566,-6.010809,-6.473942,-5.073239,-4.92826,-7.28351,...,-4.926868,-7.301874,-8.677761,-5.991557,-0.924533,-3.205386,-4.223658,-3.137172,-8.756749,-6.872156
3,id_0,inter_3,7.88781,1.063818,1.455314,3.52404,0.781203,-0.117276,5.099879,2.346424,...,0.382363,6.007077,2.040477,2.255951,10.328322,2.685064,4.816068,2.670384,4.296924,-4.558063
4,id_1,inter_0,-2.391648,0.504493,-5.658917,-2.910988,-3.051914,-1.447859,-2.46491,-4.527882,...,-3.180993,-6.260104,-5.071425,-4.34359,-3.469702,-2.641227,-4.971254,-3.004455,-6.276205,-1.510742
5,id_1,inter_1,7.978874,-1.111293,9.07318,5.303803,4.138457,1.102185,5.535351,7.430304,...,4.68567,12.44402,7.882236,7.444561,11.168828,5.28166,10.400741,5.988482,11.212481,-1.430772
6,id_1,inter_2,-1.784178,-4.111749,-15.158084,-12.017614,-12.943573,-10.1431,-9.853238,-14.562169,...,-9.850456,-14.598885,-17.349744,-11.979125,-1.84845,-6.408638,-8.444503,-6.272255,-17.507666,-13.739736
7,id_1,inter_3,6.447683,-2.309158,-0.618186,-1.09072,-2.9915,-3.925872,0.459076,-1.178186,...,-1.335846,3.024293,-2.232645,-0.018803,9.132303,1.222828,4.181406,1.739568,0.451868,-8.37593
8,id_2,inter_0,-1.43043,0.431766,-3.543355,-1.69626,-1.83067,-0.810757,-1.399578,-2.770567,...,-1.978067,-3.868681,-3.097352,-2.691246,-2.109912,-1.62505,-3.124342,-1.873339,-3.864646,-0.88122
9,id_2,inter_1,5.299573,-0.738121,6.026412,3.522789,2.748766,0.732072,3.676584,4.935213,...,3.112225,8.265326,5.235387,4.944682,7.418343,3.508082,6.908179,3.977554,7.447337,-0.95032


## Section 3 - Diagnostic: Which Interventions can be reliably produce counterfactuals for?

In this section we show how to use our diagnostic tool method, termed $\textbf{diagnostic}$. 

$\textbf{diagnostic}$ is a function to assess if the counterfactual estimates produced are reliable. Recall, in reality, we do not get access to the counterfactual estimates. Hence, we need a test to see if any relationship we learn in the pre-intervention phase, will continue to reliably hold in the post-intervention phase. 

In essence, $\textbf{diagnostic}$ checks to see if for the (unit, interventions) pairs $\textit{we do observe}$ (i.e., the unit, intervention pairs in in $\textit{post_df}$), we can reliably reconstruct those trajectories, using $\textit{only pre-intervention data}$ (i.e., only data from $\textit{pre_df)}$. For each intervention, we report the average error over all units which recieved that particular intervention.

In [7]:
R2_all_interventions = diagnostic(post_df, df_output)
R2_all_interventions

Unnamed: 0,intervention,Average R^2 Value
0,inter_0,1.0
1,inter_1,0.544225
2,inter_2,0.988776
3,inter_3,0.897528
