# Multi-Action Synthetic Control Example

This Jupyter notebook is designed to be a simple, "user-friendly" tool to demonstrate the Multi-Action Synthetic Control (MA-SC) algorithm. 

The MS-SC algorithm is implented in the $\textbf{fill_tensor}$ method below. 

In Sections 1 and 2, using artificially generated data, we illustrate how to use the $\textbf{fill_tensor}$ method to generate counterfactuals for $\textit{each unit}$ under $\textit{each intervention}$ of interest (i.e., personalized interventions). 

We hope you find the method useful for your problems of interest. We have found MA-SC to product accurate counterfactual estimates across a wide vareity of fields including: econometric policy evaluation, web-scale A/B testing, sports, genetics. We hope you find it to be of use too for your problems of interest.

In [1]:
from multi_action_synthetic_control import random_rct, diagnostic, fill_tensor
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Section 1 - Generating Artificial Data from a Randomized Control Trial

### Explanation of Terms $N, I, T, T_0, r, \sigma$ 

We begin with generating artificial data for the purposes of the demonstration through the function random_rct. All the data can be captured through a 3-dimensional tensor, $\mathcal{M} \in \mathbb{R}^{N \times T \times I}$.

$N$ denotes the number of units we perform the experiments on. 

$I$ denotes the total number of intervention. Each unit $n \in N$ will recieve exactly one intervention, $i \in I$.

$T$ is the total number of time periods (i.e., total number of measurements) we perform the experiment for. 

$T_0$ is the number of pre-intervention periods. Note $1 < T_0 < T$.

$r$ denotes the "model complexity", i.e., the rank of the tensor $\mathcal{M}$. 

$\sigma$ is the level of noise added to each measurement, i.e., the variance parameter of mean zero Gaussian noise.

In [2]:
# Generate Artifical Data

# Number of Units
N = 20
# Number of Interventions
I = 4
# Number of Total Time Steps (Pre- and Post-Intervention)
T = 100
# Number of Pre-Intervention Time Steps
T0 = 80
# Model Complexity
rank = 2
# Noise in System
sigma = 0.3

rct_data = random_rct(N, I, T, T0, rank, sigma)

### Pre-Intervention & Post-Intervention Data (pre_df, post_df)

The rct_data object returned by calling the function $\textbf{random_rct}$ is comprised of two dataframes: pre_df and post_df.

pre_df is a 2-dimensional matrix, $\mathcal{M}^{\text{pre}} \in \mathbb{R}^{N \times T_0}$. It is measurements of all units before any experiments are performed.

post_df is a 2-dimensional matrix, $\mathcal{M}^{\text{post}} \in \mathbb{R}^{N \times (T-T_0)}$. It is the intervention that each unit $n \in N$ experienced (actually observed in reality) in the post-intevention phase. 

(Note not each unit in pre_df has to have experienced an intervention. Further, a unit can experience multiple interventions. The function $\textbf{fill_tensor}$ (the MA-SC algorithm) will work as is for both. For simplicity, we illustrate on artificial data, the case where each unit in the pre-intervention phase $n \in N$ receives exactly one intervention in the post-intervention phase.)

In [3]:
# Pre- and Post- Intervention Data
pre_df, post_df = rct_data

In [4]:
print(pre_df.shape)
pre_df.head(10)

(20, 82)


Unnamed: 0,unit,intervention,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,...,t_70,t_71,t_72,t_73,t_74,t_75,t_76,t_77,t_78,t_79
0,id_0,inter_0,-0.543969,1.035281,0.335285,-2.555682,2.6036,3.330548,2.308211,2.249504,...,3.620165,2.946774,1.338134,0.392858,-0.257526,3.627959,1.997629,0.274973,-1.435348,-1.528083
1,id_1,inter_0,-0.78148,1.299026,0.384495,-2.011859,2.731967,3.353626,2.059981,2.405144,...,2.992642,2.666685,1.205059,-0.095806,-0.337704,3.56108,1.745255,0.167762,-1.21473,-1.695198
2,id_2,inter_0,-0.999271,1.830557,0.319194,-2.878327,2.382048,4.472951,2.954833,2.956211,...,4.359038,3.227194,2.439225,-0.160956,-0.045101,5.036835,1.830849,0.349336,-1.995426,-1.871206
3,id_3,inter_0,-0.87987,1.475915,0.277543,-2.47665,2.257743,3.626444,2.449501,2.904338,...,3.475637,2.767593,1.90891,0.185513,0.07233,3.583201,1.772059,0.316373,-1.159375,-2.481678
4,id_4,inter_0,-0.461549,0.573069,-0.256319,-1.054201,1.006761,1.014361,1.252979,1.043385,...,0.508701,1.160099,0.788002,0.034908,-0.438641,1.499891,0.459484,0.488051,-0.053759,-0.911183
5,id_5,inter_0,0.081441,-0.317876,-0.200582,-0.10055,0.192017,0.287962,0.304536,0.025017,...,0.796826,0.57056,-0.160258,0.137187,-0.23196,0.229496,0.260576,-0.360562,-0.414196,0.078419
6,id_6,inter_0,-0.367902,1.1145,0.094659,-0.743477,1.993186,1.28892,0.89646,0.704465,...,1.356249,0.572616,0.643817,0.130636,-0.151405,1.161551,0.801257,0.362274,-0.888404,-0.582009
7,id_7,inter_0,-0.016738,-0.071919,-0.2096,-0.225777,-0.002399,-0.781138,0.387208,-0.34627,...,-0.531161,-0.144301,-0.305539,0.048936,-0.198559,-0.024179,-0.828766,0.117565,-0.579308,-0.391557
8,id_8,inter_0,-0.646265,1.064724,0.146258,-1.60339,1.196794,2.038146,1.870229,2.247856,...,2.498607,2.17402,1.222932,0.041469,-0.077439,2.752732,1.341387,0.19858,-1.295207,-1.909178
9,id_9,inter_0,0.395085,0.52745,-0.216767,-0.842049,0.060176,0.246993,0.444907,0.497983,...,0.754923,0.092108,0.117535,-0.161976,0.14903,0.798693,0.417954,-0.178185,-0.010077,-0.893116


In [5]:
print(post_df.shape)
post_df.head(10)

(20, 22)


Unnamed: 0,unit,intervention,t_80,t_81,t_82,t_83,t_84,t_85,t_86,t_87,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,id_0,inter_2,2.607595,2.214482,3.410503,2.40136,0.515895,3.974431,2.835563,1.890971,...,-1.13471,3.398453,4.042698,2.281217,-1.063723,4.77112,0.061871,1.733043,1.901109,1.051834
1,id_1,inter_1,4.507972,3.984884,5.921716,3.752092,1.326133,7.440213,4.377454,2.762327,...,-2.241122,5.376661,6.688149,3.037655,-1.774281,7.713218,0.285415,3.70621,2.856715,1.432137
2,id_2,inter_0,5.203045,4.352889,3.900527,3.752476,2.314443,3.931999,3.000291,1.358707,...,-1.179044,5.060241,4.239721,-1.699311,1.338123,1.641116,1.649327,2.215679,2.899751,1.830919
3,id_3,inter_2,1.890438,1.680996,3.961021,2.308179,0.108864,4.266891,2.829423,1.974879,...,-1.617563,4.09118,4.025862,2.854966,-1.427293,5.761943,-0.094998,2.312598,1.706564,1.593751
4,id_4,inter_2,0.837066,0.555694,1.361343,0.779339,0.178458,1.640946,1.456444,1.159635,...,-0.353331,1.252278,1.974292,0.901775,-0.482753,1.82153,0.03565,0.812631,-0.080801,0.801606
5,id_5,inter_3,-1.737264,-0.797654,-0.556009,-0.716026,-0.243252,-0.24091,-0.597805,0.291241,...,0.882054,-0.997513,-0.652875,1.165549,-0.872476,0.016401,-1.132255,-0.037819,-0.476483,-0.13742
6,id_6,inter_0,2.287006,2.02987,2.058629,1.560432,1.209136,1.840112,1.058653,1.394925,...,-0.694764,1.770524,2.284348,-0.784267,1.166863,0.622704,1.111246,0.893623,1.366451,0.987251
7,id_7,inter_3,7.721763,6.734099,2.968177,5.595367,3.242239,4.72378,2.974887,2.03788,...,-2.427206,5.892734,4.058872,-5.909078,4.532702,-1.204515,3.471222,0.741829,3.908765,1.30665
8,id_8,inter_0,7.953523,6.232548,4.878187,6.155962,2.86563,5.48543,3.72584,1.625014,...,-2.67049,6.818784,4.760449,-4.916583,3.812741,0.607303,3.626891,1.7846,4.087518,1.842733
9,id_9,inter_0,11.493636,9.278815,5.023916,8.322555,4.287219,6.487012,5.204248,2.517271,...,-4.041629,8.976961,6.425834,-8.494813,5.359257,-0.587271,6.255579,1.703841,5.55032,2.066769


## Section 2 - Producing Counterfactual Estimates: For Each Unit Under Each Intervention

In this section, we show how to use the $\textbf{fill_tensor}$ method to produce personalized interventions for each unit. 

The input to $\textbf{fill_tensor}$ are the two pre- and post- intervention dataframes. 

The key parameter to the method is: $\textit{cum_energy} \in [0, 100]$, which decides the number of prinicpal components to retain when performing Principal Component Regression, when we learn a linear coefficient. In essence, we find the minimum of principal components required such that the percentage of the spectral energy retained is above the given parameter. 

The output of $\textbf{fill_tensor}$ is a order-3 tensor (flattended), $\hat{\mathcal{M}}^{\text{Counterfactual}}\in \mathbb{R}^{N \times (T - T_0) \times I}$, termed $\textit{df_output}$. This contains the counterfactual estiamtes for every unit $n \in [N]$ and for each intervention $i \in [I]$, over the entire post-intervention period, $T - T_0$. 

Producing this dataframe is indeed the desired counterfactual output!

In [6]:
df_output = fill_tensor(pre_df, post_df, cum_energy=0.9, full_matrix_denoise=True)
df_output.head(10)

Unnamed: 0,unit,intervention,t_80,t_81,t_82,t_83,t_84,t_85,t_86,t_87,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,id_0,inter_0,5.173443,4.241125,3.779841,3.828172,2.16984,3.866264,2.745258,1.374177,...,-1.392907,4.776774,3.913475,-2.153551,1.839406,1.247859,1.903299,1.858531,2.834851,1.659575
1,id_0,inter_1,4.932515,4.6306,6.870495,4.240079,1.151542,7.646912,5.092045,3.264524,...,-2.359087,5.834595,7.525852,3.243655,-1.700217,8.70811,0.364339,4.107433,3.34839,1.965604
2,id_0,inter_2,2.118667,1.936523,3.47266,2.348593,0.407097,3.964339,2.675745,1.890612,...,-1.407859,3.49534,3.712617,2.423576,-1.135886,4.92223,-0.181403,1.77754,1.731668,1.025525
3,id_0,inter_3,7.630172,6.399182,9.812575,6.067397,2.50472,10.327683,6.348006,4.186725,...,-3.098673,8.901193,10.285619,2.840421,-1.58587,10.22097,1.768239,6.013476,5.470491,3.090733
4,id_1,inter_0,5.541825,4.528834,3.789341,4.104208,2.289945,3.991606,2.875814,1.439998,...,-1.568882,4.996968,4.017624,-2.618311,2.07988,1.06808,2.219242,1.805383,2.980999,1.65213
5,id_1,inter_1,4.636277,4.160581,6.237412,3.920059,1.049604,7.187942,4.600726,2.945576,...,-2.18374,5.282847,6.85767,3.026479,-1.756829,7.825777,0.360425,3.814714,2.969552,1.67271
6,id_1,inter_2,1.915329,1.733963,3.188863,2.044147,0.353422,3.620266,2.375159,1.71301,...,-1.267051,3.146897,3.437561,2.125434,-1.11594,4.424622,-0.061488,1.825731,1.539763,1.136187
7,id_1,inter_3,9.44436,7.825567,9.98721,7.346319,3.245638,10.990911,6.736895,4.425454,...,-3.633721,9.964046,10.765029,0.843856,-0.258533,9.166578,2.792208,5.808953,6.212111,3.32547
8,id_2,inter_0,4.483845,3.712034,3.934503,3.309157,1.967854,3.731228,2.54398,1.272235,...,-1.012487,4.44424,3.834419,-1.076711,1.316525,1.762815,1.192042,2.081214,2.598269,1.75694
9,id_2,inter_1,6.286241,5.802277,8.64216,5.369982,1.450641,9.745751,6.393681,4.096958,...,-2.989138,7.33184,9.479581,4.122314,-2.248874,10.912326,0.473618,5.210924,4.175492,2.414698


## Section 3 - Diagnostic: Which Interventions can be reliably produce counterfactuals for?

In this section we show how to use our diagnostic tool method, termed $\textbf{diagnostic}$. 

$\textbf{diagnostic}$ is a function to assess if the counterfactual estimates produced are reliable. Recall, in reality, we do not get access to the counterfactual estimates. Hence, we need a test to see if any relationship we learn in the pre-intervention phase, will continue to reliably hold in the post-intervention phase. 

In essence, $\textbf{diagnostic}$ checks to see if for the (unit, interventions) pairs $\textit{we do observe}$ (i.e., the unit, intervention pairs in in $\textit{post_df}$), we can reliably reconstruct those trajectories, using $\textit{only pre-intervention data}$ (i.e., only data from $\textit{pre_df)}$. For each intervention, we report the average error over all units which recieved that particular intervention.

In [7]:
R2_all_interventions = diagnostic(post_df, df_output)
R2_all_interventions

Unnamed: 0,intervention,Average R^2 Value
0,inter_0,0.872532
1,inter_1,0.990901
2,inter_2,0.969848
3,inter_3,0.974364
