# Multi-Action Synthetic Control Example

This Jupyter notebook is designed to be a simple, "user-friendly" tool to demonstrate the Multi-Action Synthetic Control (MA-SC) algorithm. 

The MS-SC algorithm is implented in the $\textbf{fill_tensor}$ method below. 

In Sections 1 and 2, using artificially generated data, we illustrate how to use the $\textbf{fill_tensor}$ method to generate counterfactuals for $\textit{each unit}$ under $\textit{each intervention}$ of interest (i.e., personalized interventions). 

We hope you find the method useful for your problems of interest. We have found MA-SC to product accurate counterfactual estimates across a wide vareity of fields including: econometric policy evaluation, web-scale A/B testing, sports, genetics. We hope you find it to be of use too for your problems of interest.

In [1]:
from multi_action_synthetic_control import random_rct, diagnostic, fill_tensor
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Section 1 - Generating Artificial Data from a Randomized Control Trial

### Explanation of Terms $N, I, T, T_0, r, \sigma$ 

We begin with generating artificial data for the purposes of the demonstration through the function random_rct. All the data can be captured through a 3-dimensional tensor, $\mathcal{M} \in \mathbb{R}^{N \times T \times I}$.

$N$ denotes the number of units we perform the experiments on. 

$I$ denotes the total number of intervention. Each unit $n \in N$ will recieve exactly one intervention, $i \in I$.

$T$ is the total number of time periods (i.e., total number of measurements) we perform the experiment for. 

$T_0$ is the number of pre-intervention periods. Note $1 < T_0 < T$.

$r$ denotes the "model complexity", i.e., the rank of the tensor $\mathcal{M}$. 

$\sigma$ is the level of noise added to each measurement, i.e., the variance parameter of mean zero Gaussian noise.

In [2]:
# Generate Artifical Data

# Number of Units
N = 100
# Number of Interventions
I = 3
#Number of Metrics
M = 2
# Number of Total Time Steps (Pre- and Post-Intervention)
T = 100
# Number of Pre-Intervention Time Steps
T0 = 80
# Model Complexity
rank = 5
# Noise in System
sigma = 0.1

rct_data = random_rct(N, I, M, T, T0, rank, sigma)

### Pre-Intervention & Post-Intervention Data (pre_df, post_df)

The rct_data object returned by calling the function $\textbf{random_rct}$ is comprised of two dataframes: pre_df and post_df.

pre_df is a 2-dimensional matrix, $\mathcal{M}^{\text{pre}} \in \mathbb{R}^{N \times T_0}$. It is measurements of all units before any experiments are performed.

post_df is a 2-dimensional matrix, $\mathcal{M}^{\text{post}} \in \mathbb{R}^{N \times (T-T_0)}$. It is the intervention that each unit $n \in N$ experienced (actually observed in reality) in the post-intevention phase. 

(Note not each unit in pre_df has to have experienced an intervention. Further, a unit can experience multiple interventions. The function $\textbf{fill_tensor}$ (the MA-SC algorithm) will work as is for both. For simplicity, we illustrate on artificial data, the case where each unit in the pre-intervention phase $n \in N$ receives exactly one intervention in the post-intervention phase.)

In [3]:
# Pre- and Post- Intervention Data
pre_df, post_df = rct_data

In [4]:
print(pre_df.shape)
pre_df.head(10)

(200, 83)


Unnamed: 0,unit,intervention,metric,t_00,t_01,t_02,t_03,t_04,t_05,t_06,...,t_70,t_71,t_72,t_73,t_74,t_75,t_76,t_77,t_78,t_79
0,id_00,inter_0,m_0,6.388753,6.820474,9.123697,2.498073,3.734856,10.246023,7.513735,...,1.722871,3.693135,8.466526,0.122262,6.576052,3.409777,3.434377,8.202383,-0.909462,0.944842
1,id_00,inter_0,m_1,15.289484,16.265434,10.681133,6.965092,5.347154,11.023223,15.584687,...,8.280221,12.717758,9.059146,-2.68379,7.490578,4.431099,13.718972,16.230941,9.26226,17.403825
2,id_01,inter_0,m_0,-0.480536,-1.820704,-0.297821,-0.320197,1.55866,-1.165727,-1.363582,...,-0.577148,-1.640323,-1.32937,2.651451,-0.027416,0.917458,-2.008707,-3.189735,0.145435,-1.816298
3,id_01,inter_0,m_1,-2.95421,-11.24179,-2.930215,-2.072608,3.03752,0.07191,-5.790998,...,-5.209299,-10.300024,-3.672517,8.920791,0.440263,4.082741,-10.304088,-6.968497,-7.263908,-16.072075
4,id_02,inter_0,m_0,7.295926,5.926615,11.34244,2.497655,5.838766,12.219371,7.889643,...,0.968065,2.704989,9.497028,3.342419,8.354507,5.658013,2.06035,7.693139,-2.404185,-1.906037
5,id_02,inter_0,m_1,1.217337,0.950111,6.522115,-0.004258,3.577017,6.445765,1.445073,...,-1.220743,-1.322868,5.167149,3.762687,4.57404,2.507089,-2.335096,0.474711,-4.150515,-6.042695
6,id_03,inter_0,m_0,9.552661,9.327569,13.787883,3.592331,6.928026,15.006404,10.487259,...,2.408576,4.987813,12.021242,2.374586,9.787977,5.946004,4.268031,10.594172,-1.393737,0.598317
7,id_03,inter_0,m_1,12.663927,16.840026,13.131998,5.295842,6.697065,10.975377,12.988957,...,6.786135,10.998308,10.596059,-0.525551,8.621992,2.827712,11.08774,11.800448,6.494439,12.953824
8,id_04,inter_0,m_0,11.690479,10.13806,15.216143,4.572263,8.941246,15.032345,12.148677,...,3.419616,5.66247,11.432881,4.88027,11.274342,7.893563,5.219409,10.157551,1.316205,3.1022
9,id_04,inter_0,m_1,25.606229,23.965997,18.101795,11.297584,12.54597,16.916345,24.327077,...,12.718937,18.01186,13.055889,1.680777,13.510366,9.919035,19.541723,22.086456,15.832348,25.489364


In [5]:
print(post_df.shape)
post_df.head(10)

(200, 23)


Unnamed: 0,unit,intervention,metric,t_80,t_81,t_82,t_83,t_84,t_85,t_86,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,id_00,inter_0,m_0,3.917147,2.621306,0.531534,13.895646,3.546773,4.572873,1.387589,...,7.278724,6.590304,-0.671572,4.03602,9.499931,8.311591,7.022132,4.938964,0.635259,3.635917
1,id_00,inter_0,m_1,5.70695,5.675684,4.882291,24.230089,3.496186,5.875853,5.526356,...,19.80209,1.28794,-1.65559,19.229008,11.234779,11.681415,9.031143,18.701325,4.892218,7.196081
2,id_01,inter_2,m_0,-0.531676,-0.597756,0.227066,-2.326779,-1.15377,-1.515091,-0.874168,...,-2.10333,-1.609852,-0.920673,-1.308699,-2.510578,-1.660506,-1.169406,-1.557804,-0.758334,-0.802098
3,id_01,inter_2,m_1,3.680442,2.446566,9.827968,-1.683601,-2.798766,-0.021563,4.319395,...,-0.589587,0.646444,0.121328,-2.130799,0.579524,1.880327,5.59956,-3.866677,-0.476673,4.915651
4,id_02,inter_1,m_0,3.317314,2.733881,-0.304978,11.593464,3.677374,5.445239,2.14683,...,4.798839,7.616093,1.310756,1.893499,9.430019,7.326896,5.826256,2.6054,1.311938,3.287406
5,id_02,inter_1,m_1,0.814734,0.93153,-2.490186,4.767877,3.292527,3.548168,0.980601,...,2.099755,5.837641,1.944801,-0.183144,6.960227,4.195026,2.565609,1.021803,1.019706,1.058259
6,id_03,inter_2,m_0,5.377517,3.140504,0.443323,17.677643,4.539362,5.903992,0.966594,...,6.899288,9.922463,-1.313685,2.007754,12.070264,11.006197,9.154014,3.787189,-0.479916,4.524292
7,id_03,inter_2,m_1,3.328359,1.836671,-2.273875,14.942905,2.122681,3.690156,-1.865685,...,3.929682,4.369393,-2.997017,2.455625,4.299941,4.914434,3.430587,3.549768,-0.292391,0.844417
8,id_04,inter_2,m_0,4.70545,2.470453,0.690546,15.458786,3.769081,4.205576,0.396532,...,6.316472,8.550725,-1.887128,1.869834,10.476361,9.872165,8.401345,3.786272,-0.9843,3.89753
9,id_04,inter_2,m_1,7.502924,5.285291,9.47414,16.790945,1.345381,4.354372,5.343355,...,11.489863,4.906891,-1.792628,7.802125,10.238104,10.983319,12.087964,7.295598,1.06075,8.356827


## Section 2 - Producing Counterfactual Estimates: For Each Unit Under Each Intervention

In this section, we show how to use the $\textbf{fill_tensor}$ method to produce personalized interventions for each unit. 

The input to $\textbf{fill_tensor}$ are the two pre- and post- intervention dataframes. 

The key parameter to the method is: $\textit{cum_energy} \in [0, 100]$, which decides the number of prinicpal components to retain when performing Principal Component Regression, when we learn a linear coefficient. In essence, we find the minimum of principal components required such that the percentage of the spectral energy retained is above the given parameter. 

The output of $\textbf{fill_tensor}$ is a order-3 tensor (flattended), $\hat{\mathcal{M}}^{\text{Counterfactual}}\in \mathbb{R}^{N \times (T - T_0) \times I}$, termed $\textit{df_output}$. This contains the counterfactual estiamtes for every unit $n \in [N]$ and for each intervention $i \in [I]$, over the entire post-intervention period, $T - T_0$. 

Producing this dataframe is indeed the desired counterfactual output!

In [8]:
df_output,_ = fill_tensor(pre_df, post_df, rank=5, full_matrix_denoise=True)
df_output.head(15)

Unnamed: 0,unit,intervention,metric,t_80,t_81,t_82,t_83,t_84,t_85,t_86,...,t_90,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99
0,id_00,inter_0,m_0,-1.137911,-1.912488,0.978209,-4.782258,-1.74234,-4.109608,-1.966957,...,-1.151919,-3.703159,-2.199994,-0.314962,-3.703639,-1.89746,-1.490465,-0.067854,-1.888505,-1.324043
1,id_00,inter_0,m_1,2.014358,1.138906,6.56272,3.13701,-1.196779,-1.963492,2.337636,...,7.2753,-3.748809,-2.271229,7.667529,1.017085,3.257334,3.845605,6.921717,0.347793,3.309223
2,id_00,inter_1,m_0,-0.871667,-1.853334,1.114471,-4.223107,-1.890175,-4.111998,-2.458063,...,-1.764461,-3.374895,-2.790393,-1.406244,-3.966197,-1.726034,-1.183343,-0.837614,-2.533684,-1.321018
3,id_00,inter_1,m_1,2.671894,0.988872,7.659431,1.629048,-2.869046,-2.452306,1.179477,...,2.640318,-2.986667,-3.351218,2.183063,-1.711752,1.611679,3.734811,1.500442,-1.330796,2.9325
4,id_00,inter_2,m_0,-0.693361,0.325776,-0.388392,-1.612587,-0.267529,0.6378,0.999181,...,-0.436714,-1.070154,1.406401,0.687261,-1.182386,-1.870446,-1.519336,-0.10107,1.38778,-0.381975
5,id_00,inter_2,m_1,5.21197,5.921804,14.502656,3.520379,-0.103976,3.806476,11.847124,...,10.89184,2.056287,4.396119,8.535819,9.861426,8.250114,11.037329,5.419964,4.451945,10.495066
6,id_01,inter_0,m_0,0.9563,1.791645,-1.058195,3.927885,1.540301,3.924296,1.956665,...,0.702945,3.352058,2.369927,0.126682,3.206462,1.365462,1.058637,-0.249711,1.945415,1.084937
7,id_01,inter_0,m_1,-1.41127,-0.60973,-4.411374,-4.521467,0.971504,2.116606,-0.820781,...,-7.124602,4.295207,3.127279,-7.994898,-0.007233,-2.522266,-2.454426,-7.673449,-0.159631,-2.074516
8,id_01,inter_1,m_0,0.69733,1.785186,-1.13109,3.555987,1.709685,4.017228,2.478092,...,1.468371,3.005554,2.933906,1.295095,3.50494,1.28012,0.807018,0.668701,2.564611,1.131275
9,id_01,inter_1,m_1,-1.786482,-0.113934,-4.825982,-1.781469,2.657833,2.829967,0.795158,...,-1.29569,3.433829,4.082814,-1.422199,3.20826,-0.26005,-1.744442,-1.230058,1.851139,-1.148783


## Section 3 - Diagnostic: Which Interventions can be reliably produce counterfactuals for?

In this section we show how to use our diagnostic tool method, termed $\textbf{diagnostic}$. 

$\textbf{diagnostic}$ is a function to assess if the counterfactual estimates produced are reliable. Recall, in reality, we do not get access to the counterfactual estimates. Hence, we need a test to see if any relationship we learn in the pre-intervention phase, will continue to reliably hold in the post-intervention phase. 

In essence, $\textbf{diagnostic}$ checks to see if for the (unit, interventions) pairs $\textit{we do observe}$ (i.e., the unit, intervention pairs in in $\textit{post_df}$), we can reliably reconstruct those trajectories, using $\textit{only pre-intervention data}$ (i.e., only data from $\textit{pre_df)}$. 

For each intervention, we report the average, $R^2$ error over all units which recieved that particular intervention. Note what is considered a "good enough" $R^2$ will depend greatly on the application itself. For example, if the post-intervention trajector is very stable, then a $R^2$ close to zero should be considered excellent (note we recreate the post-intervention trajectory using only pre-intervention data).

In [7]:
R2_all_interventions = diagnostic(post_df, df_output)
R2_all_interventions

Unnamed: 0,intervention,Average R^2 Value
0,inter_0,-11.789631
1,inter_1,-19.851117
2,inter_2,-21.394464
