# Multi-Action Synthetic Control Example

This Jupyter notebook is designed to be a simple, "user-friendly" tool to demonstrate the Multi-Action Synthetic Control (MA-SC) algorithm. 

The MS-SC algorithm is implented in the $\textbf{fill_tensor}$ method below. 

In Sections 1 and 2, using artificially generated data, we illustrate how to use the $\textbf{fill_tensor}$ method to generate counterfactuals for $\textit{each unit}$ under $\textit{each intervention}$ of interest (i.e., personalized interventions). 

We hope you find the method useful for your problems of interest. We have found MA-SC to product accurate counterfactual estimates across a wide vareity of fields including: econometric policy evaluation, web-scale A/B testing, sports, genetics. We hope you find it to be of use too for your problems of interest.

In [1]:
from multi_action_synthetic_control import random_rct, diagnostic, fill_tensor
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Section 1 - Generating Artificial Data from a Randomized Control Trial

### Explanation of Terms $N, I, T, T_0, r, \sigma$ 

We begin with generating artificial data for the purposes of the demonstration through the function random_rct. All the data can be captured through a 3-dimensional tensor, $\mathcal{M} \in \mathbb{R}^{N \times T \times I}$.

$N$ denotes the number of units we perform the experiments on. 

$I$ denotes the total number of intervention. Each unit $n \in N$ will recieve exactly one intervention, $i \in I$.

$T$ is the total number of time periods (i.e., total number of measurements) we perform the experiment for. 

$T_0$ is the number of pre-intervention periods. Note $1 < T_0 < T$.

$r$ denotes the "model complexity", i.e., the rank of the tensor $\mathcal{M}$. 

$\sigma$ is the level of noise added to each measurement, i.e., the variance parameter of mean zero Gaussian noise.

In [2]:
# Generate Artifical Data

# Number of Units
N = 100
# Number of Interventions
I = 4
# Number of Total Time Steps (Pre- and Post-Intervention)
T = 100
# Number of Pre-Intervention Time Steps
T0 = 80
# Model Complexity
rank = 5
# Noise in System
sigma = 0.1

rct_data = random_rct(N, I, T, T0, rank, sigma)

### Pre-Intervention & Post-Intervention Data (pre_df, post_df)

The rct_data object returned by calling the function $\textbf{random_rct}$ is comprised of two dataframes: pre_df and post_df.

pre_df is a 2-dimensional matrix, $\mathcal{M}^{\text{pre}} \in \mathbb{R}^{N \times T_0}$. It is measurements of all units before any experiments are performed.

post_df is a 2-dimensional matrix, $\mathcal{M}^{\text{post}} \in \mathbb{R}^{N \times (T-T_0)}$. It is the intervention that each unit $n \in N$ experienced (actually observed in reality) in the post-intevention phase. 

(Note not each unit in pre_df has to have experienced an intervention. Further, a unit can experience multiple interventions. The function $\textbf{fill_tensor}$ (the MA-SC algorithm) will work as is for both. For simplicity, we illustrate on artificial data, the case where each unit in the pre-intervention phase $n \in N$ receives exactly one intervention in the post-intervention phase.)

In [3]:
# Pre- and Post- Intervention Data
pre_df, post_df = rct_data

In [4]:
print(pre_df.shape)
pre_df.head(10)

(100, 82)


Unnamed: 0,unit,intervention,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,...,t_70,t_71,t_72,t_73,t_74,t_75,t_76,t_77,t_78,t_79
0,id_0,inter_0,7.518073,0.573226,6.444611,6.73086,1.803391,8.370819,7.078284,10.159361,...,4.488761,1.973263,3.260745,4.884699,3.451529,1.187989,3.654816,4.721364,7.728562,3.869161
1,id_1,inter_0,8.432419,1.337571,1.562022,9.505007,1.909434,3.585163,1.447324,7.849175,...,5.069941,5.361269,3.626209,-0.675668,3.082126,2.007632,3.991262,6.228829,5.60047,11.046689
2,id_2,inter_0,14.983068,0.556331,0.047254,5.461952,-4.880725,-0.231059,6.568635,8.853108,...,5.106373,-0.366656,-3.789649,-3.923276,-4.75423,2.34038,7.778306,-1.758915,5.267913,10.205042
3,id_3,inter_0,12.073114,0.719206,9.091367,11.259144,6.888799,15.187119,8.298371,19.913028,...,10.741413,8.2934,11.851142,11.24946,9.372493,1.739144,10.391815,13.620383,12.025311,7.172701
4,id_4,inter_0,11.024917,0.967515,4.894992,10.786132,7.724337,11.354896,3.394517,17.555842,...,10.780139,10.921376,12.748392,7.974385,9.169765,1.969674,11.729847,14.407023,8.73901,10.112849
5,id_5,inter_0,7.021452,3.976506,4.772047,8.204582,6.157116,9.412974,0.939329,8.204495,...,5.062787,9.687016,5.106316,4.969893,6.228696,1.398867,6.810615,7.194452,7.400282,9.516328
6,id_6,inter_0,16.22954,4.419142,6.14087,13.759445,10.266443,15.096499,2.552698,20.177186,...,12.812255,16.552669,12.297097,9.462962,10.255675,2.647155,17.214731,15.275365,12.105834,16.853877
7,id_7,inter_0,2.502224,-1.961869,7.746381,5.180861,10.089248,14.449991,4.304884,17.52298,...,9.262726,8.075315,17.860535,15.48518,12.441966,-0.041055,8.713397,16.687956,7.027166,-2.82446
8,id_8,inter_0,11.52962,2.09643,7.282865,11.65243,8.328113,14.074935,4.9032,18.08697,...,10.42508,11.209543,12.023796,9.752189,9.680542,1.964015,11.405427,13.649524,10.854566,10.201093
9,id_9,inter_0,16.532442,-1.564678,7.209014,14.540882,4.868634,12.653117,9.732867,24.743336,...,13.934646,7.451522,14.161698,8.563595,8.79353,2.853186,12.5867,16.225286,12.471476,10.689566


In [5]:
print(post_df.shape)
post_df.head(10)

(100, 22)


Unnamed: 0,unit,intervention,t_80,t_81,t_82,t_83,t_84,t_85,t_86,t_87,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,id_0,inter_2,9.457158,4.907149,-1.026764,7.033237,5.121226,11.232577,6.764185,6.91284,...,2.895925,6.426895,0.049731,5.459739,6.647303,4.542335,7.970613,7.652899,7.498554,9.836908
1,id_1,inter_2,0.723184,4.058025,-1.774982,8.506259,5.158311,7.215651,5.395917,4.591215,...,-3.998681,-1.685363,0.899202,-0.347043,4.968717,-1.377615,4.916343,0.844843,10.029885,0.424224
2,id_2,inter_1,3.98588,1.196708,-2.321857,0.857628,-0.063115,5.221232,0.058156,5.241159,...,-2.446785,1.39783,0.021591,5.41441,0.774263,-0.100347,3.121653,2.680204,2.304647,4.864759
3,id_3,inter_3,16.784732,21.755579,-2.970612,20.044837,15.339921,22.058574,16.261098,9.509602,...,9.678821,17.775062,13.156638,4.112339,14.018671,13.346228,19.309067,19.428623,22.032671,16.100707
4,id_4,inter_2,8.04699,15.785518,-2.961644,14.989024,11.693705,13.719386,10.942745,5.350687,...,4.257447,9.590178,10.880756,0.026405,9.744378,7.604947,12.637026,11.076919,16.625219,7.363764
5,id_5,inter_2,3.457706,8.440817,-0.104224,12.324841,5.042978,10.69854,9.84905,4.739515,...,2.392756,4.111492,2.125671,0.086161,6.602453,1.944288,6.902431,7.860286,14.576302,3.155535
6,id_6,inter_0,3.164621,5.615667,-0.017584,6.198264,3.55389,5.494907,5.285586,1.839401,...,3.348017,3.980552,2.521907,-0.413551,3.810138,2.874336,4.44048,5.225344,6.887015,2.657762
7,id_7,inter_2,14.970561,18.3872,-1.110413,13.520684,15.476996,12.803178,11.525953,1.966079,...,12.845072,17.488613,13.976067,-0.389203,11.644514,15.285608,14.845461,14.609419,11.506758,12.697384
8,id_8,inter_1,5.650121,6.680807,-1.216379,7.107903,5.334613,8.083708,5.622203,3.87651,...,2.188741,5.073794,3.399626,2.053846,5.174127,3.745973,6.553539,6.059275,8.090464,5.562822
9,id_9,inter_1,8.842016,7.350923,-2.728053,7.01822,7.632724,10.484631,5.296064,6.105555,...,0.870715,6.288476,4.81299,3.977678,6.307114,5.098843,8.92122,6.112652,7.578704,8.416874


## Section 2 - Producing Counterfactual Estimates: For Each Unit Under Each Intervention

In this section, we show how to use the $\textbf{fill_tensor}$ method to produce personalized interventions for each unit. 

The input to $\textbf{fill_tensor}$ are the two pre- and post- intervention dataframes. 

The key parameter to the method is: $\textit{cum_energy} \in [0, 100]$, which decides the number of prinicpal components to retain when performing Principal Component Regression, when we learn a linear coefficient. In essence, we find the minimum of principal components required such that the percentage of the spectral energy retained is above the given parameter. 

The output of $\textbf{fill_tensor}$ is a order-3 tensor (flattended), $\hat{\mathcal{M}}^{\text{Counterfactual}}\in \mathbb{R}^{N \times (T - T_0) \times I}$, termed $\textit{df_output}$. This contains the counterfactual estiamtes for every unit $n \in [N]$ and for each intervention $i \in [I]$, over the entire post-intervention period, $T - T_0$. 

Producing this dataframe is indeed the desired counterfactual output!

In [6]:
df_output = fill_tensor(pre_df, post_df, rank=5, full_matrix_denoise=True)
df_output.head(10)

Unnamed: 0,unit,intervention,t_80,t_81,t_82,t_83,t_84,t_85,t_86,t_87,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,id_0,inter_0,2.224674,1.520913,0.479892,2.304231,0.899704,2.741343,2.372225,1.291214,...,1.828112,1.928639,-0.434797,0.801787,1.653688,1.243907,1.693075,2.548574,2.509359,2.068374
1,id_0,inter_1,6.34282,2.710233,-1.143971,3.726474,3.269626,7.132429,3.319466,4.875767,...,0.576651,3.575468,0.08704,4.0928,3.852709,2.547669,5.111051,4.16637,4.19269,6.410995
2,id_0,inter_2,9.692562,5.13472,-0.927111,6.977176,5.17134,11.32248,6.673829,6.987796,...,2.739387,6.508052,0.046604,5.506294,6.445321,4.502715,7.980827,7.657581,7.646984,9.766555
3,id_0,inter_3,9.489807,5.873934,-0.462993,7.361682,5.027957,11.000494,7.087163,6.219335,...,3.977283,7.124525,0.480746,4.815074,6.322254,4.967507,7.829825,8.277389,8.136821,9.268481
4,id_1,inter_0,0.675594,1.231645,-0.462818,3.121099,3.253348,2.156211,2.094986,1.000422,...,-1.067986,-0.435342,0.389331,-0.900113,2.483186,0.2587,2.063888,-0.313061,2.677952,0.481106
5,id_1,inter_1,0.705918,1.547565,-0.828353,3.400224,2.214377,3.230261,2.328547,2.113028,...,-1.848663,-0.333945,0.175145,0.19166,2.092471,-0.506491,2.307508,0.568334,4.05299,0.594262
6,id_1,inter_2,0.551568,3.996313,-1.867621,8.488786,5.176169,7.145501,5.413029,4.443651,...,-4.005586,-1.541664,0.999787,-0.287028,4.847347,-1.323922,4.963053,0.996725,9.93193,0.566425
7,id_1,inter_3,-0.184351,8.339484,-3.511446,14.937161,8.132154,12.089141,9.04674,7.153059,...,-6.909235,-2.32071,3.679279,-1.112586,7.526968,-2.506351,8.53605,2.847375,17.978722,0.088138
8,id_2,inter_0,-2.352408,-3.678346,-0.232476,-5.801887,-8.252312,-2.057778,-4.499195,1.304975,...,-1.959288,-1.701079,-3.104275,3.840782,-5.727593,-3.793658,-4.186463,0.333384,-2.972051,-1.21242
9,id_2,inter_1,4.012393,1.132116,-2.282767,0.699634,0.085114,5.404554,0.178485,5.234974,...,-2.506764,1.441567,0.014157,5.356079,0.804215,-0.092847,3.150296,2.692073,2.412516,4.693419


## Section 3 - Diagnostic: Which Interventions can be reliably produce counterfactuals for?

In this section we show how to use our diagnostic tool method, termed $\textbf{diagnostic}$. 

$\textbf{diagnostic}$ is a function to assess if the counterfactual estimates produced are reliable. Recall, in reality, we do not get access to the counterfactual estimates. Hence, we need a test to see if any relationship we learn in the pre-intervention phase, will continue to reliably hold in the post-intervention phase. 

In essence, $\textbf{diagnostic}$ checks to see if for the (unit, interventions) pairs $\textit{we do observe}$ (i.e., the unit, intervention pairs in in $\textit{post_df}$), we can reliably reconstruct those trajectories, using $\textit{only pre-intervention data}$ (i.e., only data from $\textit{pre_df)}$. For each intervention, we report the average error over all units which recieved that particular intervention.

In [7]:
R2_all_interventions = diagnostic(post_df, df_output)
R2_all_interventions

Unnamed: 0,intervention,Average R^2 Value
0,inter_0,0.999152
1,inter_1,0.998987
2,inter_2,0.999478
3,inter_3,0.999771
