# Multi-Action Synthetic Control Example

This Jupyter notebook is designed to be a simple, "user-friendly" tool to demonstrate the Multi-Action Synthetic Control (MA-SC) algorithm. 

The MS-SC algorithm is implented in the $\textbf{fill_tensor}$ method below. 

In Sections 1 and 2, using artificially generated data, we illustrate how to use the $\textbf{fill_tensor}$ method to generate counterfactuals for $\textit{each unit}$ under $\textit{each intervention}$ of interest (i.e., personalized interventions). 

We hope you find the method useful for your problems of interest. We have found MA-SC to product accurate counterfactual estimates across a wide vareity of fields including: econometric policy evaluation, web-scale A/B testing, sports, genetics. We hope you find it to be of use too for your problems of interest.

In [1]:
from multi_action_synthetic_control import random_rct, diagnostic, fill_tensor
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Section 1 - Generating Artificial Data from a Randomized Control Trial

### Explanation of Terms $N, I, M, T, T_0, r, \sigma$ 

We begin with generating artificial data for the purposes of the demonstration through the function random_rct. All the data can be captured through a 4-dimensional tensor, $\mathcal{M} \in \mathbb{R}^{N \times I \times M \times T}$.

$N$ denotes the number of units we perform the experiments on. 

$I$ denotes the total number of intervention. Each unit $n \in N$ will recieve exactly one intervention, $i \in I$.

$M$ denotes the various metrics we are interested in measuring. For each unit $n \in N$ we measure all $M$ metrics.

$T$ is the total number of time periods (i.e., total number of measurements) we perform the experiment for. 

$T_0$ is the number of pre-intervention periods. Note $1 < T_0 < T$.

$r$ denotes the "model complexity", i.e., the rank of the tensor $\mathcal{M}$. 

$\sigma$ is the level of noise added to each measurement, i.e., the variance parameter of mean zero Gaussian noise.

In [23]:
# Generate Artifical Data

# Number of Units
N = 100
# Number of Interventions
I = 3
#Number of Metrics
M = 2
# Number of Total Time Steps (Pre- and Post-Intervention)
T = 100
# Number of Pre-Intervention Time Steps
T0 = 80
# Model Complexity
rank = 5
# Noise in System
sigma = 0.1

rct_data = random_rct(N, I, M, T, T0, rank, sigma)

### Pre-Intervention & Post-Intervention Data (pre_df, post_df)

The rct_data object returned by calling the function $\textbf{random_rct}$ is comprised of two dataframes: pre_df and post_df.

pre_df is a 3-dimensional matrix, $\mathcal{M}^{\text{pre}} \in \mathbb{R}^{N \times M\times T_0}$. It is measurements of all units $n \in [N]$ for all metrics $m \in [M]$ before any experiments (intervention) is performed.

post_df is a 3-dimensional matrix, $\mathcal{M}^{\text{post}} \in \mathbb{R}^{N \times M \times (T-T_0)}$. It is the collections of measurements made for each unit $n \in [N]$ for all metrics $m \in [M]$, $\textit{after an intervention is performed}$ (actually observed in reality) in the post-intevention phase. 

(Note not each unit in pre_df has to have experienced an intervention. Further, a unit can experience multiple interventions. The function $\textbf{fill_tensor}$ (the MA-SC algorithm) will work as is for both. For simplicity, we illustrate on artificial data, the case where each unit in the pre-intervention phase $n \in N$ receives exactly one intervention in the post-intervention phase.)

In [24]:
# Pre- and Post- Intervention Data
pre_df, post_df = rct_data

In [25]:
print(pre_df.shape)
pre_df.head(10)

(200, 83)


Unnamed: 0,unit,intervention,metric,t_00,t_01,t_02,t_03,t_04,t_05,t_06,...,t_70,t_71,t_72,t_73,t_74,t_75,t_76,t_77,t_78,t_79
0,id_00,inter_0,m_0,3.630292,16.501714,18.564145,25.618941,22.994202,12.600344,31.803856,...,-11.533399,10.790337,7.236771,-1.816708,11.874544,33.328426,29.319231,-0.513339,19.617034,19.179276
1,id_00,inter_0,m_1,-0.474842,11.605762,12.870925,18.47354,16.997505,9.419333,24.251532,...,-10.929883,10.53468,4.919166,0.365642,9.99217,25.866276,22.690426,-1.184535,14.210777,14.921771
2,id_01,inter_0,m_0,15.21035,20.836119,22.784413,29.756306,24.134568,14.345467,30.51808,...,-2.617249,3.928155,7.923501,-5.041399,7.803533,32.029234,27.603117,2.152871,22.212301,18.741369
3,id_01,inter_0,m_1,4.102215,12.228095,13.12638,18.223527,16.198621,8.630758,21.559494,...,-6.727145,6.410348,5.095627,-1.392421,7.631663,22.557775,19.83548,-0.240402,13.823262,13.154702
4,id_02,inter_0,m_0,6.40337,18.519143,20.584074,28.071691,24.212722,13.628664,33.576151,...,-10.265028,10.038924,7.823924,-2.401652,11.526998,35.025304,30.526674,0.050278,20.828661,20.301902
5,id_02,inter_0,m_1,0.567535,12.577523,13.809427,19.583511,17.835356,10.039556,25.193532,...,-10.78176,10.321675,5.171729,0.007721,10.06183,26.804999,23.542172,-0.875552,14.62166,15.306444
6,id_03,inter_0,m_0,16.236672,32.415481,35.401552,47.663903,41.11528,23.098387,54.165815,...,-12.066487,12.634039,13.382872,-5.121198,17.140157,56.711794,49.029087,1.385698,36.355235,32.789784
7,id_03,inter_0,m_1,3.138897,20.276016,22.56688,31.387492,28.961455,15.425563,39.756316,...,-15.591731,14.280381,8.831478,-0.738318,15.462841,41.86683,36.940889,-1.620859,24.428108,24.160076
8,id_04,inter_0,m_0,19.260055,20.273837,22.223362,28.476257,21.125335,13.468284,26.250065,...,2.48852,-0.302855,7.504816,-6.820916,4.655153,27.348283,23.078876,3.521254,20.390725,16.166707
9,id_04,inter_0,m_1,6.030439,11.189941,12.093548,16.189537,13.215034,7.972456,17.476467,...,-3.488609,4.332673,4.038821,-1.833095,5.500579,18.483268,15.796191,0.800187,11.795404,10.827901


In [5]:
print(post_df.shape)
post_df.head(10)

(200, 23)


Unnamed: 0,unit,intervention,metric,t_80,t_81,t_82,t_83,t_84,t_85,t_86,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,id_00,inter_2,m_0,1.046375,1.120336,3.40908,0.579015,-0.226254,1.096655,-0.252373,...,1.333761,0.842051,0.553218,0.837121,1.225249,2.190254,1.777892,0.880357,1.410196,2.424399
1,id_00,inter_2,m_1,8.079868,6.509029,20.250711,5.937675,3.050818,7.65085,4.862199,...,7.802074,8.680566,1.587778,5.654865,6.452734,16.055911,12.550554,4.433613,9.84301,15.622067
2,id_01,inter_2,m_0,1.327595,8.012399,14.368356,3.229842,8.889535,8.630499,3.465975,...,4.600228,9.907329,6.34555,7.563202,8.045412,12.983559,10.181944,3.340502,8.756818,5.844087
3,id_01,inter_2,m_1,19.405544,21.850598,63.670029,13.84249,7.327605,23.634149,7.224157,...,22.600771,24.550803,10.505924,20.546585,23.542425,46.135326,39.010355,13.281347,30.058138,44.908708
4,id_02,inter_2,m_0,4.256829,7.236064,17.801059,3.615696,3.844498,7.484236,2.405484,...,6.143103,8.143006,4.709958,7.054875,7.974016,13.245126,11.21848,3.521969,9.033784,11.388154
5,id_02,inter_2,m_1,29.90147,34.616857,97.069076,21.008687,8.831341,35.242273,12.827985,...,35.777876,37.373118,17.662571,33.742601,38.441941,68.736694,59.928457,18.543991,45.226235,70.032089
6,id_03,inter_0,m_0,4.461698,4.745338,6.897251,1.345363,3.034831,5.351586,3.900153,...,5.951928,3.7265,2.542456,6.328317,5.465327,3.797218,7.647887,3.792824,6.189486,8.360353
7,id_03,inter_0,m_1,18.789796,23.612601,61.177186,15.022222,5.400588,21.466571,13.38593,...,23.343442,26.162445,12.436878,22.990667,26.420119,44.65815,37.252261,8.304883,27.726749,44.272594
8,id_04,inter_1,m_0,1.45,-1.204029,0.400429,-0.151656,-4.193691,-1.90574,-0.944652,...,0.620415,-2.020342,-1.16474,-0.747498,-0.673726,-1.450385,-0.498538,-0.708035,-0.995319,2.438665
9,id_04,inter_1,m_1,5.61461,6.190326,11.313809,4.425668,2.095826,4.651983,8.0497,...,6.818732,7.036578,2.676783,6.802338,6.983314,8.761364,8.589859,0.882602,6.227651,10.93124


## Section 2 - Producing Counterfactual Estimates: For Each Unit Under Each Intervention

In this section, we show how to use the $\textbf{fill_tensor}$ method to produce personalized interventions for each unit. 

The input to $\textbf{fill_tensor}$ are the two pre- and post- intervention dataframes. 

The key parameter to the method is: $\textit{cum_energy} \in [0, 100]$, which decides the number of prinicpal components to retain when performing Principal Component Regression, when we learn a linear coefficient. In essence, we find the minimum of principal components required such that the percentage of the spectral energy retained is above the given parameter. 

The output of $\textbf{fill_tensor}$ is a order-4 tensor (flattended), $\hat{\mathcal{M}}^{\text{Counterfactual}}\in \mathbb{R}^{N \times I \times M \times (T - T_0)}$, termed $\textit{df_output}$. This contains the counterfactual estiamtes for every unit $n \in [N]$ and for each intervention $i \in [I]$, and for each metric $m \in [M]$ over the entire post-intervention period, $T - T_0$. 

Producing this dataframe is indeed the desired counterfactual output!

In [6]:
df_output,_ = fill_tensor(pre_df, post_df, rank=5, full_matrix_denoise=True)
df_output.head(15)

Unnamed: 0,unit,intervention,metric,t_80,t_81,t_82,t_83,t_84,t_85,t_86,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,id_00,inter_0,m_0,1.041259,1.243706,0.629683,0.050392,0.86808,1.251146,1.431885,...,1.687278,0.76822,0.809699,2.047595,1.596857,-0.04513,1.704866,0.897511,1.468553,1.79405
1,id_00,inter_0,m_1,4.476678,5.786466,11.569171,4.206805,2.086507,4.32764,6.668685,...,5.62363,6.950059,2.864769,5.924216,6.407983,9.385955,7.731021,0.469239,5.610016,9.487208
2,id_00,inter_1,m_0,-0.207438,0.551814,0.84465,-0.257384,-0.201926,0.201155,-0.339666,...,0.315476,0.277644,0.735143,0.688966,0.766936,0.269901,0.506737,0.027649,0.277304,0.32641
3,id_00,inter_1,m_1,6.432164,2.561685,4.241834,6.02701,5.248634,3.350927,11.525413,...,4.280987,6.293842,-2.230761,1.679874,1.368679,7.121242,4.618851,0.584336,3.732123,7.02888
4,id_00,inter_2,m_0,0.895248,0.970277,3.345463,0.399224,-0.205973,1.043833,-0.207274,...,1.074145,0.847149,0.497137,0.941457,1.184262,2.127806,1.9239,0.760257,1.387436,2.387926
5,id_00,inter_2,m_1,8.075119,6.487576,20.313199,5.997235,3.153768,7.714541,4.821663,...,7.670427,8.42458,1.54644,5.585451,6.368781,15.881467,12.633861,4.338772,9.794173,15.667751
6,id_01,inter_0,m_0,2.739323,4.045182,2.999232,0.754381,3.554753,4.159921,4.588905,...,4.73223,3.166138,2.509467,5.722103,4.842993,1.454663,5.386812,2.520107,4.613421,5.057479
7,id_01,inter_0,m_1,9.691223,16.568773,34.193405,8.654499,3.202106,12.01407,12.088171,...,14.850958,17.454916,10.68795,17.683052,19.612906,24.794998,21.787018,2.00906,15.739714,25.212471
8,id_01,inter_1,m_0,1.352181,-0.994019,0.591382,-0.299433,-4.340116,-1.798111,-1.131705,...,0.618212,-2.218581,-1.263271,-0.65375,-0.369455,-1.401625,-0.47044,-0.527352,-0.951883,2.52191
9,id_01,inter_1,m_1,7.389189,6.725106,12.409572,6.070943,3.41838,5.497536,10.932096,...,7.816152,8.501599,1.863327,6.910751,7.150024,10.709796,9.565117,1.089798,7.049749,12.781059


## Section 3 - Diagnostic: Which Interventions can be reliably produce counterfactuals for?

In this section we show how to use our diagnostic tool method, termed $\textbf{diagnostic}$. 

$\textbf{diagnostic}$ is a function to assess if the counterfactual estimates produced are reliable. Recall, in reality, we do not get access to the counterfactual estimates. Hence, we need a test to see if any relationship we learn in the pre-intervention phase, will continue to reliably hold in the post-intervention phase. 

In essence, $\textbf{diagnostic}$ checks to see if for the (unit, interventions) pairs $\textit{we do observe}$ (i.e., the unit, intervention pairs in in $\textit{post_df}$), we can reliably reconstruct those trajectories, using $\textit{only pre-intervention data}$ (i.e., only data from $\textit{pre_df)}$. 

For each intervention, we report the average, $R^2$ error over all units which recieved that particular intervention. Note what is considered a "good enough" $R^2$ will depend greatly on the application itself. For example, if the post-intervention trajector is very stable, then a $R^2$ close to zero should be considered excellent (note we recreate the post-intervention trajectory using only pre-intervention data).

In [7]:
R2_all_interventions = diagnostic(post_df, df_output)
R2_all_interventions

Unnamed: 0,intervention,Average R^2 Value
0,inter_0,0.975448
1,inter_1,0.999048
2,inter_2,0.99993
