# Multi-Action Synthetic Control Example

This Jupyter notebook is designed to be a simple, "user-friendly" tool to demonstrate the Multi-Action Synthetic Control (MA-SC) algorithm. 

The MS-SC algorithm is implented in the $\textbf{fill_tensor}$ method below. 

In Sections 1 and 2, using artificially generated data, we illustrate how to use the $\textbf{fill_tensor}$ method to generate counterfactuals for $\textit{each unit}$ under $\textit{each intervention}$ of interest (i.e., personalized interventions). 

We hope you find the method useful for your problems of interest. We have found MA-SC to product accurate counterfactual estimates across a wide vareity of fields including: econometric policy evaluation, web-scale A/B testing, sports, genetics. We hope you find it to be of use too for your problems of interest.

In [1]:
from multi_action_synthetic_control import random_rct, diagnostic, fill_tensor
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Section 1 - Generating Artificial Data from a Randomized Control Trial

### Explanation of Terms $N, I, T, T_0, r, \sigma$ 

We begin with generating artificial data for the purposes of the demonstration through the function random_rct. All the data can be captured through a 3-dimensional tensor, $\mathcal{M} \in \mathbb{R}^{N \times T \times I}$.

$N$ denotes the number of units we perform the experiments on. 

$I$ denotes the total number of intervention. Each unit $n \in N$ will recieve exactly one intervention, $i \in I$.

$T$ is the total number of time periods (i.e., total number of measurements) we perform the experiment for. 

$T_0$ is the number of pre-intervention periods. Note $1 < T_0 < T$.

$r$ denotes the "model complexity", i.e., the rank of the tensor $\mathcal{M}$. 

$\sigma$ is the level of noise added to each measurement, i.e., the variance parameter of mean zero Gaussian noise.

In [2]:
# Generate Artifical Data

# Number of Units
N = 20
# Number of Interventions
I = 4
# Number of Total Time Steps (Pre- and Post-Intervention)
T = 100
# Number of Pre-Intervention Time Steps
T0 = 80
# Model Complexity
rank = 2
# Noise in System
sigma = 0.3

rct_data = random_rct(N, I, T, T0, rank, sigma)

### Pre-Intervention & Post-Intervention Data (pre_df, post_df)

The rct_data object returned by calling the function $\textbf{random_rct}$ is comprised of two dataframes: pre_df and post_df.

pre_df is a 2-dimensional matrix, $\mathcal{M}^{\text{pre}} \in \mathbb{R}^{N \times T_0}$. It is measurements of all units before any experiments are performed.

post_df is a 2-dimensional matrix, $\mathcal{M}^{\text{post}} \in \mathbb{R}^{N \times (T-T_0)}$. It is the intervention that each unit $n \in N$ experienced (actually observed in reality) in the post-intevention phase. 

(Note not each unit in pre_df has to have experienced an intervention. Further, a unit can experience multiple interventions. The function $\textbf{fill_tensor}$ (the MA-SC algorithm) will work as is for both. For simplicity, we illustrate on artificial data, the case where each unit in the pre-intervention phase $n \in N$ receives exactly one intervention in the post-intervention phase.)

In [3]:
# Pre- and Post- Intervention Data
pre_df, post_df = rct_data

In [4]:
print(pre_df.shape)
pre_df.head(10)

(100, 82)


Unnamed: 0,unit,intervention,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,...,t_70,t_71,t_72,t_73,t_74,t_75,t_76,t_77,t_78,t_79
0,id_0,inter_0,-0.789528,0.280044,-0.329701,-0.028155,-1.313386,-0.279037,-0.491527,-0.151177,...,0.636804,0.139924,-0.313516,-0.03245,-0.350565,-0.511724,-0.788354,-0.4473,0.083829,0.59183
1,id_1,inter_0,1.249301,0.907947,1.064869,1.641366,4.410802,2.36303,0.559894,2.556653,...,0.473655,1.203368,0.85321,1.459725,1.639142,1.231661,1.500575,0.640481,1.269764,0.381554
2,id_2,inter_0,0.996006,0.532616,0.538236,0.21803,2.259897,1.18462,0.575292,0.423723,...,-0.083158,0.511706,0.011673,0.660297,0.949552,0.232956,0.77167,0.610421,0.358509,-0.249937
3,id_3,inter_0,1.364945,0.35514,1.034978,2.284616,5.287527,3.449649,0.476375,3.834121,...,0.290863,1.619372,1.318218,2.132439,2.230829,2.519731,1.70768,0.99461,0.997742,0.227848
4,id_4,inter_0,-0.086028,-0.45161,0.142951,-0.054346,-0.485491,-0.326445,-0.161189,-0.449799,...,-0.237793,0.237131,-0.113752,-0.499663,-0.930425,-0.10957,-0.291826,-0.416512,-0.293138,0.035914
5,id_5,inter_0,-0.890137,0.317176,-0.504737,0.164722,-1.075117,-0.446164,-0.929754,-0.408411,...,1.215774,0.416226,-0.354196,0.204679,-0.275972,0.365137,-1.149439,-0.239917,0.388232,0.224078
6,id_6,inter_0,1.444802,0.533437,0.814573,1.717059,5.58909,3.445957,0.494462,3.092546,...,-0.322242,1.437809,0.623785,1.655346,2.518665,2.456437,2.381338,0.940965,1.011812,0.016081
7,id_7,inter_0,-1.178016,0.131493,0.01476,0.284677,-1.882314,-0.422142,-0.850936,-1.27553,...,0.510728,-0.224116,-0.009411,-0.457961,-0.816893,-0.357148,-0.548266,-0.621464,-0.375156,0.338086
8,id_8,inter_0,2.030421,0.718487,2.091024,3.289887,7.219802,4.752026,0.996763,4.486035,...,0.785659,2.19564,1.471219,2.471191,3.281465,3.500404,2.802366,1.258251,2.005923,0.815365
9,id_9,inter_0,0.373184,0.575373,0.32522,1.340815,1.903034,2.071,0.198981,1.529563,...,0.02551,0.690814,-0.106863,0.182188,0.561533,1.07373,0.618406,0.455797,0.901694,0.973933


In [5]:
print(post_df.shape)
post_df.head(10)

(100, 22)


Unnamed: 0,unit,intervention,t_80,t_81,t_82,t_83,t_84,t_85,t_86,t_87,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,id_0,inter_1,-0.520422,-0.245714,0.744938,-0.431805,0.3219,0.023617,1.247237,0.322584,...,-0.782588,0.262367,-0.461601,-1.745441,0.21868,-0.461133,-0.335455,0.444786,-0.870666,-1.154311
1,id_1,inter_0,4.916485,4.489747,-0.263786,1.63926,1.645509,1.686372,-4.705482,-1.681542,...,5.855587,1.386051,2.640822,6.427938,-0.484736,2.428062,2.033177,-0.857274,3.18051,3.787359
2,id_2,inter_0,3.060916,3.286056,-0.053997,1.080914,1.508272,1.730823,-3.474812,-0.847682,...,4.063639,1.120669,2.647396,5.554945,-0.371105,0.8644,2.088277,-0.36852,2.324676,2.867696
3,id_3,inter_1,3.55169,2.56649,1.029876,0.702921,0.766896,1.369691,-0.550551,-0.546775,...,3.747622,2.161093,2.831017,2.889828,-0.391413,1.043753,0.726182,0.828903,1.081866,2.578714
4,id_4,inter_3,-0.328015,0.328104,0.633021,-0.29121,0.191115,-0.067427,0.868745,-0.552619,...,0.113021,0.059578,0.303211,-0.344653,0.511637,0.229045,-0.424237,0.621207,-0.543522,-0.64573
5,id_5,inter_1,-0.065963,-0.999886,0.10057,-0.415618,-0.363962,-0.018691,1.372176,0.243055,...,-0.681887,0.847501,0.125094,-0.867209,-0.264321,-0.556769,-0.354036,0.673864,-0.942807,-0.520193
6,id_6,inter_0,7.691158,6.078563,-0.106463,2.402394,1.637922,3.207122,-6.19649,-1.939482,...,8.669936,1.646095,4.232755,9.792372,-0.811414,2.317009,3.367996,-1.069026,4.412187,5.986297
7,id_7,inter_3,1.129017,0.033812,0.739529,0.435886,1.116429,0.903671,1.262601,-0.74892,...,0.311196,1.805376,0.571653,-0.777659,-0.303089,-0.177741,-0.298116,1.26356,-0.777739,-0.211576
8,id_8,inter_0,8.872267,8.58888,-0.161581,3.588833,2.290117,3.78373,-8.039088,-3.356608,...,10.966524,2.188101,5.420691,12.712689,-0.696878,3.038589,4.296644,-1.73705,5.516621,7.696065
9,id_9,inter_3,2.067766,1.381507,2.016932,0.251803,0.642864,0.922156,1.345103,-0.426533,...,1.49893,3.340347,2.609947,0.362676,-0.131409,-0.15939,-0.277974,1.989962,-0.190884,0.139873


## Section 2 - Producing Counterfactual Estimates: For Each Unit Under Each Intervention

In this section, we show how to use the $\textbf{fill_tensor}$ method to produce personalized interventions for each unit. 

The input to $\textbf{fill_tensor}$ are the two pre- and post- intervention dataframes. 

The key parameter to the method is: $\textit{cum_energy} \in [0, 100]$, which decides the number of prinicpal components to retain when performing Principal Component Regression, when we learn a linear coefficient. In essence, we find the minimum of principal components required such that the percentage of the spectral energy retained is above the given parameter. 

The output of $\textbf{fill_tensor}$ is a order-3 tensor (flattended), $\hat{\mathcal{M}}^{\text{Counterfactual}}\in \mathbb{R}^{N \times (T - T_0) \times I}$, termed $\textit{df_output}$. This contains the counterfactual estiamtes for every unit $n \in [N]$ and for each intervention $i \in [I]$, over the entire post-intervention period, $T - T_0$. 

Producing this dataframe is indeed the desired counterfactual output!

In [6]:
df_output = fill_tensor(pre_df, post_df, cum_energy=0.9, full_matrix_denoise=True)
df_output.head(10)

Unnamed: 0,unit,intervention,t_80,t_81,t_82,t_83,t_84,t_85,t_86,t_87,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,id_0,inter_0,-3.253674,-2.895876,0.027754,-0.95805,-0.761263,-1.063057,2.636451,1.050806,...,-3.800409,-0.855803,-1.723396,-4.368479,0.236484,-0.912773,-1.576029,0.39957,-1.91716,-2.509142
1,id_0,inter_1,-0.311416,-0.635565,0.70562,-0.294053,0.035351,0.067495,1.347663,0.240878,...,-0.748985,0.517063,-0.15881,-1.27565,0.166656,-0.163339,-0.518224,0.519927,-0.745434,-0.7231
2,id_0,inter_2,3.771412,2.077791,3.346092,0.523706,1.607261,1.896589,2.105561,-1.338222,...,2.807836,4.200947,3.226594,0.733148,-0.333678,0.787729,-0.708412,2.495573,-0.705113,0.510849
3,id_0,inter_3,1.354275,0.564719,1.450554,0.171521,0.687957,0.88301,1.197243,-0.485348,...,0.767073,1.811148,1.181088,-0.214715,-0.084063,0.254422,-0.460335,1.175043,-0.576505,-0.08389
4,id_1,inter_0,4.871598,4.466262,-0.171129,1.474869,1.194074,1.909015,-4.206327,-1.667838,...,5.780097,1.218025,2.737718,6.637517,-0.370305,1.423508,2.218421,-0.732801,2.93306,3.995719
5,id_1,inter_1,2.663753,1.972179,0.922364,0.621826,0.839574,1.170995,-0.644708,-0.924453,...,2.725393,1.634366,1.698603,2.383228,-0.124162,0.746835,0.5131,0.526191,0.7833,1.444267
6,id_1,inter_2,5.198936,2.537157,5.287277,0.750196,2.360592,3.006794,3.845621,-1.993964,...,3.649085,6.46632,4.596154,-0.006688,-0.294284,1.213354,-1.367393,4.062527,-1.554145,0.346543
7,id_1,inter_3,2.924893,1.712708,2.440595,0.516433,1.223873,1.564148,1.351534,-0.994512,...,2.303783,3.060513,2.367239,0.73989,-0.206798,0.671121,-0.418237,1.811767,-0.513479,0.604473
8,id_2,inter_0,3.756372,3.389944,-0.078404,1.120533,0.898296,1.340847,-3.136413,-1.246972,...,4.419742,0.965365,2.045962,5.078039,-0.278827,1.074139,1.768969,-0.509444,2.235753,2.982281
9,id_2,inter_1,1.166154,1.093898,-0.059178,0.398211,0.291569,0.396314,-1.006368,-0.480297,...,1.438079,0.312468,0.720989,1.610803,-0.140947,0.370225,0.485434,-0.100569,0.715062,0.947732


## Section 3 - Diagnostic: Which Interventions can be reliably produce counterfactuals for?

In this section we show how to use our diagnostic tool method, termed $\textbf{diagnostic}$. 

$\textbf{diagnostic}$ is a function to assess if the counterfactual estimates produced are reliable. Recall, in reality, we do not get access to the counterfactual estimates. Hence, we need a test to see if any relationship we learn in the pre-intervention phase, will continue to reliably hold in the post-intervention phase. 

In essence, $\textbf{diagnostic}$ checks to see if for the (unit, interventions) pairs $\textit{we do observe}$ (i.e., the unit, intervention pairs in in $\textit{post_df}$), we can reliably reconstruct those trajectories, using $\textit{only pre-intervention data}$ (i.e., only data from $\textit{pre_df)}$. For each intervention, we report the average error over all units which recieved that particular intervention.

In [7]:
R2_all_interventions = diagnostic(post_df, df_output)
R2_all_interventions

Unnamed: 0,intervention,Average R^2 Value
0,inter_0,0.995441
1,inter_1,0.951441
2,inter_2,0.978519
3,inter_3,0.94584
