# Multi-Action Synthetic Control Example

This Jupyter notebook is designed to be a simple, "user-friendly" tool to demonstrate the Multi-Action Synthetic Control (MA-SC) algorithm. 

The MS-SC algorithm is implented in the $\textbf{fill_tensor}$ method below. 

In Sections 1 and 2, using artificially generated data, we illustrate how to use the $\textbf{fill_tensor}$ method to generate counterfactuals for $\textit{each unit}$ under $\textit{each intervention}$ of interest (i.e., personalized interventions). 

We hope you find the method useful for your problems of interest. We have found MA-SC to product accurate counterfactual estimates across a wide vareity of fields including: econometric policy evaluation, web-scale A/B testing, sports, genetics. We hope you find it to be of use too for your problems of interest.

In [1]:
from multi_action_synthetic_control import random_rct, diagnostic, fill_tensor
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Section 1 - Generating Artificial Data from a Randomized Control Trial

### Explanation of Terms $N, I, T, T_0, r, \sigma$ 

We begin with generating artificial data for the purposes of the demonstration through the function random_rct. All the data can be captured through a 3-dimensional tensor, $\mathcal{M} \in \mathbb{R}^{N \times T \times I}$.

$N$ denotes the number of units we perform the experiments on. 

$I$ denotes the total number of intervention. Each unit $n \in N$ will recieve exactly one intervention, $i \in I$.

$T$ is the total number of time periods (i.e., total number of measurements) we perform the experiment for. 

$T_0$ is the number of pre-intervention periods. Note $1 < T_0 < T$.

$r$ denotes the "model complexity", i.e., the rank of the tensor $\mathcal{M}$. 

$\sigma$ is the level of noise added to each measurement, i.e., the variance parameter of mean zero Gaussian noise.

In [2]:
# Generate Artifical Data

# Number of Units
N = 100
# Number of Interventions
I = 4
# Number of Total Time Steps (Pre- and Post-Intervention)
T = 100
# Number of Pre-Intervention Time Steps
T0 = 80
# Model Complexity
rank = 5
# Noise in System
sigma = 0.1

rct_data = random_rct(N, I, T, T0, rank, sigma)

### Pre-Intervention & Post-Intervention Data (pre_df, post_df)

The rct_data object returned by calling the function $\textbf{random_rct}$ is comprised of two dataframes: pre_df and post_df.

pre_df is a 2-dimensional matrix, $\mathcal{M}^{\text{pre}} \in \mathbb{R}^{N \times T_0}$. It is measurements of all units before any experiments are performed.

post_df is a 2-dimensional matrix, $\mathcal{M}^{\text{post}} \in \mathbb{R}^{N \times (T-T_0)}$. It is the intervention that each unit $n \in N$ experienced (actually observed in reality) in the post-intevention phase. 

(Note not each unit in pre_df has to have experienced an intervention. Further, a unit can experience multiple interventions. The function $\textbf{fill_tensor}$ (the MA-SC algorithm) will work as is for both. For simplicity, we illustrate on artificial data, the case where each unit in the pre-intervention phase $n \in N$ receives exactly one intervention in the post-intervention phase.)

In [3]:
# Pre- and Post- Intervention Data
pre_df, post_df = rct_data

In [4]:
print(pre_df.shape)
pre_df.head(10)

(100, 82)


Unnamed: 0,unit,intervention,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,...,t_70,t_71,t_72,t_73,t_74,t_75,t_76,t_77,t_78,t_79
0,id_0,inter_0,1.495768,-0.201,0.358933,-0.813556,-1.017978,-0.091478,-0.857365,-0.87112,...,1.720859,0.02354,-0.481961,1.739294,0.339026,-1.042633,-0.40779,-0.497457,-0.846573,2.151636
1,id_1,inter_0,7.645983,3.276851,12.716469,-1.640672,5.632779,4.15915,4.787322,1.325641,...,4.497292,2.951749,5.576608,3.417784,-1.395884,0.401565,2.206797,3.891796,9.702537,4.317075
2,id_2,inter_0,6.250636,5.334878,15.001003,-3.110473,10.5731,4.101422,5.871269,1.648434,...,2.31044,3.68294,6.258958,-0.381378,-5.302877,0.697352,1.332026,4.079616,12.999016,-0.363536
3,id_3,inter_0,2.965146,3.627545,10.592168,-1.146299,7.062997,2.63071,4.473882,1.2496,...,0.303025,2.173588,4.052843,-0.191769,-3.83322,1.381974,2.870841,3.267359,7.901464,-1.296438
4,id_4,inter_0,3.58694,2.419157,5.537741,-3.52313,5.68227,0.918837,1.034554,0.264543,...,2.325516,1.761057,2.386158,-0.905853,-3.709152,-0.511922,-2.290847,0.740472,5.592706,-1.123005
5,id_5,inter_0,2.934123,3.979882,8.748919,0.517217,7.285343,2.209436,5.211275,3.888715,...,0.853956,2.95015,5.268225,-0.460827,-2.144184,3.398092,1.775538,3.259329,7.29412,-1.698023
6,id_6,inter_0,0.202602,0.475556,2.011259,0.064333,0.465473,0.53632,0.692774,-0.413128,...,-0.289493,0.136281,0.42279,0.38851,-0.434467,-0.215599,1.341997,0.739662,1.196416,0.528271
7,id_7,inter_0,6.74623,3.915533,12.240166,3.446429,4.517724,4.509876,7.451822,4.801271,...,3.792699,3.697285,7.137552,4.959602,2.206557,4.496673,5.683479,5.149363,7.843637,4.556599
8,id_8,inter_0,1.238783,3.508645,8.782042,-4.169939,7.489431,0.963326,1.950305,-1.264053,...,-0.802083,1.102085,2.108998,-2.709494,-6.962797,-0.86868,0.225855,1.484388,6.956697,-3.243793
9,id_9,inter_0,-1.568134,3.409323,7.463871,-1.800337,7.124271,0.448494,2.922008,-0.104015,...,-3.32414,0.726493,1.793202,-3.919237,-6.1157,0.492083,2.411598,1.76581,5.732004,-5.184324


In [5]:
print(post_df.shape)
post_df.head(10)

(100, 22)


Unnamed: 0,unit,intervention,t_80,t_81,t_82,t_83,t_84,t_85,t_86,t_87,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,id_0,inter_2,0.983968,-1.95172,-0.569973,-2.458336,-1.017542,-1.17648,-0.953885,-2.473073,...,1.212167,0.59453,-1.061536,1.502074,0.623971,-0.663589,-0.727676,-0.55038,0.856172,-0.830004
1,id_1,inter_0,4.016003,2.596847,3.148551,3.539943,5.896531,1.767036,5.249763,3.244558,...,8.402391,5.756726,6.38138,4.987453,3.68525,0.740382,5.350743,2.357233,3.98332,4.356562
2,id_2,inter_1,9.095868,12.282542,3.581835,7.838116,11.470001,1.946698,16.197268,8.337869,...,24.189286,11.220028,13.853375,10.39804,12.822395,1.177943,13.542937,8.365693,11.72,12.465511
3,id_3,inter_2,5.537269,5.111228,2.194567,8.397156,4.843064,4.796494,6.990172,4.419326,...,10.899573,7.325989,7.532501,1.103891,-1.129539,2.722186,6.040072,4.12213,1.834581,3.41269
4,id_4,inter_2,4.607762,2.590153,-1.785891,4.086367,-1.71021,3.022361,3.403258,-0.742629,...,7.783094,3.778261,1.101649,-1.620874,-3.700477,1.552719,1.103118,3.206045,0.034241,-0.364486
5,id_5,inter_0,4.859507,20.18555,4.934996,12.511263,11.942193,7.64636,17.804421,10.796009,...,15.119053,5.683825,15.038932,2.662803,12.143129,12.210309,8.779997,7.564848,5.317473,8.950923
6,id_6,inter_0,0.227199,-2.783603,0.438575,-0.270844,0.135825,-0.383912,-1.403781,-0.150263,...,-0.429032,1.089029,-0.137151,0.758385,-1.611071,-1.899289,0.558817,-0.726686,-0.234575,-0.077441
7,id_7,inter_1,7.574542,6.536469,0.846613,5.081048,4.084102,2.627264,9.027687,2.002496,...,15.582704,7.849066,7.099459,4.460504,4.015745,2.309328,5.988338,5.001742,5.405906,4.628218
8,id_8,inter_2,4.563519,1.743871,1.141119,6.542345,1.862858,4.288691,3.350062,2.092774,...,7.048116,6.029259,4.329107,-0.509904,-4.547064,1.614227,3.360762,2.859448,-0.228995,0.681712
9,id_9,inter_2,1.88452,2.641337,3.207508,7.697096,4.41979,4.882033,3.312312,5.28447,...,3.168178,4.575382,5.829953,-1.210245,-3.487288,2.510774,3.797634,1.726964,-1.228089,1.743555


## Section 2 - Producing Counterfactual Estimates: For Each Unit Under Each Intervention

In this section, we show how to use the $\textbf{fill_tensor}$ method to produce personalized interventions for each unit. 

The input to $\textbf{fill_tensor}$ are the two pre- and post- intervention dataframes. 

The key parameter to the method is: $\textit{cum_energy} \in [0, 100]$, which decides the number of prinicpal components to retain when performing Principal Component Regression, when we learn a linear coefficient. In essence, we find the minimum of principal components required such that the percentage of the spectral energy retained is above the given parameter. 

The output of $\textbf{fill_tensor}$ is a order-3 tensor (flattended), $\hat{\mathcal{M}}^{\text{Counterfactual}}\in \mathbb{R}^{N \times (T - T_0) \times I}$, termed $\textit{df_output}$. This contains the counterfactual estiamtes for every unit $n \in [N]$ and for each intervention $i \in [I]$, over the entire post-intervention period, $T - T_0$. 

Producing this dataframe is indeed the desired counterfactual output!

In [6]:
df_output = fill_tensor(pre_df, post_df, rank=5, full_matrix_denoise=True)
df_output.head(10)

Unnamed: 0,unit,intervention,t_80,t_81,t_82,t_83,t_84,t_85,t_86,t_87,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,id_0,inter_0,2.845067,-7.130692,2.045857,-5.465059,-0.45141,-1.189335,-2.947219,-6.422289,...,-1.396254,1.903245,-0.377648,4.322258,1.019737,2.083115,-3.606877,-3.195472,-0.051842,-4.117016
1,id_0,inter_1,2.38768,0.262365,-0.084562,-0.120248,-0.044926,0.105434,1.231263,-1.307589,...,3.634724,1.953482,0.887023,1.335061,0.688278,0.73314,0.298067,0.786351,1.228518,-0.204191
2,id_0,inter_2,1.232797,-2.008893,-0.574576,-2.407596,-1.097493,-1.174884,-0.84045,-2.511288,...,1.032837,0.543548,-0.998869,1.503089,0.623558,-0.696774,-0.711162,-0.392888,0.912138,-0.88555
3,id_0,inter_3,6.527703,-8.498401,4.483266,-6.280941,2.10555,-0.78975,-1.267327,-8.270074,...,1.992038,5.651355,2.827454,8.094337,3.454876,4.852443,-3.094606,-3.514993,1.605602,-4.512737
4,id_1,inter_0,4.149227,2.633119,3.257599,3.554752,5.915678,1.70945,5.456518,3.282728,...,8.515752,5.834665,6.632764,4.989461,3.768072,1.111488,5.441176,2.351176,3.821338,4.251443
5,id_1,inter_1,10.019953,9.344356,1.072881,6.508103,6.79303,1.973225,13.366316,4.079923,...,23.456041,10.932655,10.038526,7.809947,7.786049,0.581952,10.483316,7.877064,9.731775,8.690102
6,id_1,inter_2,9.405295,5.770491,0.178679,7.486747,3.845857,3.521285,9.433582,2.575998,...,18.712118,10.158283,7.276622,3.687583,0.369721,0.600386,7.654905,6.525392,5.27849,4.907884
7,id_1,inter_3,9.631833,5.437766,1.731016,5.355224,5.792565,2.185474,10.429077,2.188853,...,19.495726,10.703315,8.769955,7.352712,4.954057,1.125612,8.074118,5.79455,7.406237,6.137581
8,id_2,inter_0,3.941466,15.394012,4.599364,12.193521,12.238508,5.106976,14.778533,12.945577,...,15.583367,7.156548,13.681068,4.235016,9.014818,3.895439,12.475018,7.419999,6.787462,11.736528
9,id_2,inter_1,8.969045,12.334332,3.614116,7.888337,11.572115,2.081784,16.261236,8.177596,...,24.135041,11.257206,13.709911,10.400897,12.666766,1.21609,13.636263,8.316344,11.69771,12.506922


## Section 3 - Diagnostic: Which Interventions can be reliably produce counterfactuals for?

In this section we show how to use our diagnostic tool method, termed $\textbf{diagnostic}$. 

$\textbf{diagnostic}$ is a function to assess if the counterfactual estimates produced are reliable. Recall, in reality, we do not get access to the counterfactual estimates. Hence, we need a test to see if any relationship we learn in the pre-intervention phase, will continue to reliably hold in the post-intervention phase. 

In essence, $\textbf{diagnostic}$ checks to see if for the (unit, interventions) pairs $\textit{we do observe}$ (i.e., the unit, intervention pairs in in $\textit{post_df}$), we can reliably reconstruct those trajectories, using $\textit{only pre-intervention data}$ (i.e., only data from $\textit{pre_df)}$. 

For each intervention, we report the average, $R^2$ error over all units which recieved that particular intervention. Note what is considered a "good enough" $R^2$ will depend greatly on the application itself. For example, if the post-intervention trajector is very stable, then a $R^2$ close to zero should be considered excellent (note we recreate the post-intervention trajectory using only pre-intervention data).

In [7]:
R2_all_interventions = diagnostic(post_df, df_output)
R2_all_interventions

Unnamed: 0,intervention,Average R^2 Value
0,inter_0,0.996913
1,inter_1,0.998815
2,inter_2,0.998795
3,inter_3,0.998451
