# Multi-Action Synthetic Control Example

This Jupyter notebook is designed to be a simple, "user-friendly" tool to demonstrate the Multi-Action Synthetic Control (MA-SC) algorithm. 

The MS-SC algorithm is implented in the $\textbf{fill_tensor}$ method below. 

In Sections 1 and 2, using artificially generated data, we illustrate how to use the $\textbf{fill_tensor}$ method to generate counterfactuals for $\textit{each unit}$ under $\textit{each intervention}$ of interest (i.e., personalized interventions). 

We hope you find the method useful for your problems of interest. We have found MA-SC to product accurate counterfactual estimates across a wide vareity of fields including: econometric policy evaluation, web-scale A/B testing, sports, genetics. We hope you find it to be of use too for your problems of interest.

In [1]:
from multi_action_synthetic_control import random_rct, diagnostic, fill_tensor
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Section 1 - Generating Artificial Data from a Randomized Control Trial

### Explanation of Terms $N, I, T, T_0, r, \sigma$ 

We begin with generating artificial data for the purposes of the demonstration through the function random_rct. All the data can be captured through a 3-dimensional tensor, $\mathcal{M} \in \mathbb{R}^{N \times T \times I}$.

$N$ denotes the number of units we perform the experiments on. 

$I$ denotes the total number of intervention. Each unit $n \in N$ will recieve exactly one intervention, $i \in I$.

$T$ is the total number of time periods (i.e., total number of measurements) we perform the experiment for. 

$T_0$ is the number of pre-intervention periods. Note $1 < T_0 < T$.

$r$ denotes the "model complexity", i.e., the rank of the tensor $\mathcal{M}$. 

$\sigma$ is the level of noise added to each measurement, i.e., the variance parameter of mean zero Gaussian noise.

In [2]:
# Generate Artifical Data

# Number of Units
N = 100
# Number of Interventions
I = 3
#Number of Metrics
M = 2
# Number of Total Time Steps (Pre- and Post-Intervention)
T = 100
# Number of Pre-Intervention Time Steps
T0 = 80
# Model Complexity
rank = 5
# Noise in System
sigma = 0.1

rct_data = random_rct(N, I, M, T, T0, rank, sigma)

### Pre-Intervention & Post-Intervention Data (pre_df, post_df)

The rct_data object returned by calling the function $\textbf{random_rct}$ is comprised of two dataframes: pre_df and post_df.

pre_df is a 2-dimensional matrix, $\mathcal{M}^{\text{pre}} \in \mathbb{R}^{N \times T_0}$. It is measurements of all units before any experiments are performed.

post_df is a 2-dimensional matrix, $\mathcal{M}^{\text{post}} \in \mathbb{R}^{N \times (T-T_0)}$. It is the intervention that each unit $n \in N$ experienced (actually observed in reality) in the post-intevention phase. 

(Note not each unit in pre_df has to have experienced an intervention. Further, a unit can experience multiple interventions. The function $\textbf{fill_tensor}$ (the MA-SC algorithm) will work as is for both. For simplicity, we illustrate on artificial data, the case where each unit in the pre-intervention phase $n \in N$ receives exactly one intervention in the post-intervention phase.)

In [3]:
# Pre- and Post- Intervention Data
pre_df, post_df = rct_data

In [4]:
print(pre_df.shape)
pre_df.head(10)

(200, 83)


Unnamed: 0,unit,intervention,metric,t_00,t_01,t_02,t_03,t_04,t_05,t_06,...,t_70,t_71,t_72,t_73,t_74,t_75,t_76,t_77,t_78,t_79
0,id_00,inter_0,m_0,5.818466,14.263106,0.541721,4.459587,5.055252,1.920807,2.749669,...,5.545229,18.400446,0.901564,7.529213,2.560323,6.197569,4.240513,0.618799,9.727151,-5.930602
1,id_00,inter_0,m_1,1.191598,0.788922,3.934973,-1.811036,-1.655812,1.447808,2.373006,...,1.761042,-3.778158,-2.351568,-0.715229,-0.658271,-0.333546,0.485045,-3.176048,-2.600677,-0.885073
2,id_01,inter_0,m_0,7.449716,18.755338,-14.187302,16.392558,12.871086,-1.127206,-4.768315,...,3.216043,39.193222,8.826758,17.846203,4.79043,4.060449,1.522768,6.559596,21.04829,-8.374143
3,id_01,inter_0,m_1,-6.021998,-8.483948,3.401298,-8.352823,-12.790641,-5.319588,-5.943281,...,-6.815952,-18.756057,-7.250521,-8.725113,-3.965,-8.876095,-6.219541,-9.639573,-10.312372,2.20278
4,id_02,inter_0,m_0,1.949914,7.050053,-6.875846,4.407592,4.27034,-2.675864,-0.721439,...,-0.861139,13.997236,2.995105,4.979798,0.568155,2.954014,0.027271,4.161853,8.666283,-4.600576
5,id_02,inter_0,m_1,3.001377,-0.140994,5.569534,1.237464,3.03532,6.748306,4.762573,...,6.211063,-1.744985,1.022461,1.552115,2.01791,2.123185,3.61925,0.009753,-2.64997,2.38808
6,id_03,inter_0,m_0,-0.056116,6.166479,-5.691396,0.402563,0.033445,-5.729733,-1.651186,...,-3.570501,9.150889,-0.539044,1.498003,-1.184369,1.571347,-1.635834,1.530896,6.900166,-5.808246
7,id_03,inter_0,m_1,2.630851,-0.46614,5.370951,4.079705,0.664586,6.857062,-0.631439,...,6.345577,-0.915093,0.504235,4.034672,2.46529,-3.34725,1.366659,-5.429471,-4.011089,3.473634
8,id_04,inter_0,m_0,1.237674,4.814156,9.907347,-6.394761,-2.785975,2.347703,5.426976,...,3.934235,-2.533467,-4.768991,-3.635199,0.491474,6.02817,4.914971,-2.950912,-0.968105,-1.182726
9,id_04,inter_0,m_1,3.45401,4.117443,2.292614,3.877848,4.114358,4.783518,2.159354,...,5.431876,7.150293,2.014754,4.449441,2.692944,2.769945,3.556884,0.801057,2.637639,0.231265


In [5]:
print(post_df.shape)
post_df.head(10)

(200, 23)


Unnamed: 0,unit,intervention,metric,t_80,t_81,t_82,t_83,t_84,t_85,t_86,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,id_00,inter_1,m_0,10.437166,-5.364024,11.785204,15.20426,15.904065,18.125549,9.356111,...,10.049368,-0.375724,7.564144,12.324207,-1.604767,2.981076,5.944685,7.710025,10.091646,4.590837
1,id_00,inter_1,m_1,1.259687,0.719926,-0.207112,-0.966389,-2.500328,-2.602924,-4.254558,...,-0.318527,0.884053,-0.931733,-1.710366,-3.372587,-0.581951,0.052982,-0.696896,2.103891,-3.720143
2,id_01,inter_0,m_0,2.150661,0.555151,6.196134,9.393962,10.239029,19.171072,18.810847,...,7.248465,3.274003,10.255656,19.103953,10.476299,6.958766,10.865873,-1.300963,7.438331,18.422033
3,id_01,inter_0,m_1,-0.809275,14.136039,3.405656,-8.194293,-13.7888,0.734532,12.911311,...,6.54472,21.125109,9.423461,21.826451,10.094682,14.046422,30.803173,-21.915524,19.87009,29.274243
4,id_02,inter_0,m_0,1.785837,-1.223674,0.515665,4.710289,7.439506,7.045982,3.55551,...,1.414565,-2.560563,1.730991,1.887425,1.471387,0.062218,-2.477668,5.232277,-1.430864,0.510628
5,id_02,inter_0,m_1,-2.145145,-6.720906,-2.996366,-0.549793,0.105084,-7.763923,-10.0111,...,-5.428011,-9.643489,-7.34156,-14.14837,-6.338864,-8.007871,-14.735962,6.644899,-10.41824,-16.270272
6,id_03,inter_1,m_0,1.389037,-7.297165,7.119737,8.556439,6.204093,8.114115,6.149568,...,3.040091,-3.538684,2.979941,5.736189,0.215361,-0.355633,0.553095,1.62396,2.514185,0.889048
7,id_03,inter_1,m_1,4.526204,5.861987,-4.073447,-0.853909,3.389862,1.419314,-4.336397,...,1.050797,2.109788,-0.041182,-2.566897,-1.793804,0.701421,-1.732149,5.307852,-0.04605,-2.370198
8,id_04,inter_0,m_0,-5.949948,3.134058,-7.367061,-7.107861,-7.404086,-7.84914,-4.344003,...,-5.970607,-0.833602,-3.67509,-6.191092,1.615584,-2.003868,-4.609218,-3.036107,-6.432823,-2.981399
9,id_04,inter_0,m_1,-2.032615,-2.710731,-0.672282,0.129538,-0.286115,-1.460539,-1.04044,...,-1.914428,-3.074267,-1.501952,-2.726924,-0.183921,-2.173804,-4.15444,0.751741,-3.572616,-3.435741


## Section 2 - Producing Counterfactual Estimates: For Each Unit Under Each Intervention

In this section, we show how to use the $\textbf{fill_tensor}$ method to produce personalized interventions for each unit. 

The input to $\textbf{fill_tensor}$ are the two pre- and post- intervention dataframes. 

The key parameter to the method is: $\textit{cum_energy} \in [0, 100]$, which decides the number of prinicpal components to retain when performing Principal Component Regression, when we learn a linear coefficient. In essence, we find the minimum of principal components required such that the percentage of the spectral energy retained is above the given parameter. 

The output of $\textbf{fill_tensor}$ is a order-3 tensor (flattended), $\hat{\mathcal{M}}^{\text{Counterfactual}}\in \mathbb{R}^{N \times (T - T_0) \times I}$, termed $\textit{df_output}$. This contains the counterfactual estiamtes for every unit $n \in [N]$ and for each intervention $i \in [I]$, over the entire post-intervention period, $T - T_0$. 

Producing this dataframe is indeed the desired counterfactual output!

In [6]:
df_output,_ = fill_tensor(pre_df, post_df, rank=5, full_matrix_denoise=True)
df_output.head(15)

Unnamed: 0,unit,intervention,metric,t_80,t_81,t_82,t_83,t_84,t_85,t_86,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,id_00,inter_0,m_0,-3.291934,0.481849,-2.633862,-0.037838,1.485204,2.849143,4.40344,...,-1.637284,-1.591953,0.803015,1.630118,5.089593,0.184813,-2.078802,0.0826,-4.146803,3.449633
1,id_00,inter_0,m_1,-1.754151,-0.561884,-1.953406,-2.53783,-1.768498,-4.369149,-2.880129,...,-2.254225,-1.795054,-2.576726,-4.70544,-0.861882,-1.779427,-3.481456,0.632921,-3.814479,-3.121752
2,id_00,inter_1,m_0,10.433215,-5.45238,11.83314,15.264136,15.816237,18.19908,9.278848,...,10.275293,-0.380381,7.687341,12.241812,-1.469268,3.045876,5.89124,7.519768,10.266778,4.573103
3,id_00,inter_1,m_1,1.321379,0.763378,-0.351702,-0.860056,-2.488196,-2.613963,-4.357056,...,-0.229134,0.992704,-0.832863,-1.952683,-3.371736,-0.637246,0.018821,-0.636853,2.009404,-3.716429
4,id_00,inter_2,m_0,-1.326398,-1.265278,2.250922,2.383375,7.032708,7.058358,11.275781,...,2.423368,-1.022994,2.755494,5.756785,7.588047,2.740308,2.465984,2.276477,-2.995242,10.739708
5,id_00,inter_2,m_1,0.834059,3.494568,-1.250303,-4.66114,2.513471,-2.873066,1.423253,...,1.141519,1.96142,-1.705147,-3.122648,2.149938,2.226915,1.417744,3.588673,-3.731581,5.335092
6,id_01,inter_0,m_0,2.109308,0.318715,6.285966,9.511592,10.298038,19.219674,18.819356,...,7.28206,3.386085,10.260167,19.227856,10.567017,6.874074,10.884178,-1.262127,7.438146,18.459863
7,id_01,inter_0,m_1,-0.768932,14.019544,3.458699,-8.102858,-13.843339,0.768346,12.824771,...,6.574244,21.025947,9.365959,21.789447,10.170427,14.058303,30.9226,-21.932432,19.829703,29.269093
8,id_01,inter_1,m_0,9.616436,-7.414058,12.840185,24.849338,29.672621,36.795346,25.156345,...,12.995063,-4.338132,14.420224,23.737958,9.502603,5.435661,4.546157,12.96581,7.233014,15.531712
9,id_01,inter_1,m_1,-2.765887,-1.296852,-4.925995,-7.212266,-9.108004,-15.150993,-14.714159,...,-6.372255,-3.344632,-8.14877,-14.929882,-8.138068,-5.951488,-9.176321,-0.010492,-6.234528,-14.89285


## Section 3 - Diagnostic: Which Interventions can be reliably produce counterfactuals for?

In this section we show how to use our diagnostic tool method, termed $\textbf{diagnostic}$. 

$\textbf{diagnostic}$ is a function to assess if the counterfactual estimates produced are reliable. Recall, in reality, we do not get access to the counterfactual estimates. Hence, we need a test to see if any relationship we learn in the pre-intervention phase, will continue to reliably hold in the post-intervention phase. 

In essence, $\textbf{diagnostic}$ checks to see if for the (unit, interventions) pairs $\textit{we do observe}$ (i.e., the unit, intervention pairs in in $\textit{post_df}$), we can reliably reconstruct those trajectories, using $\textit{only pre-intervention data}$ (i.e., only data from $\textit{pre_df)}$. 

For each intervention, we report the average, $R^2$ error over all units which recieved that particular intervention. Note what is considered a "good enough" $R^2$ will depend greatly on the application itself. For example, if the post-intervention trajector is very stable, then a $R^2$ close to zero should be considered excellent (note we recreate the post-intervention trajectory using only pre-intervention data).

In [7]:
R2_all_interventions = diagnostic(post_df, df_output)
R2_all_interventions

Unnamed: 0,intervention,Average R^2 Value
0,inter_0,0.999447
1,inter_1,0.99975
2,inter_2,0.999702
