# Multi-Action Synthetic Control Example

This Jupyter notebook is designed to be a simple, "user-friendly" tool to demonstrate the Multi-Action Synthetic Control (MA-SC) algorithm. 

The MS-SC algorithm is implented in the $\textbf{fill_tensor}$ method below. 

In Sections 1 and 2, using artificially generated data, we illustrate how to use the $\textbf{fill_tensor}$ method to generate counterfactuals for $\textit{each unit}$ under $\textit{each intervention}$ of interest (i.e., personalized interventions). 

We hope you find the method useful for your problems of interest. We have found MA-SC to product accurate counterfactual estimates across a wide vareity of fields including: econometric policy evaluation, web-scale A/B testing, sports, genetics. We hope you find it to be of use too for your problems of interest.

In [1]:
from multi_action_synthetic_control import random_rct, diagnostic, fill_tensor
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Section 1 - Generating Artificial Data from a Randomized Control Trial

### Explanation of Terms $N, I, T, T_0, r, \sigma$ 

We begin with generating artificial data for the purposes of the demonstration through the function random_rct. All the data can be captured through a 3-dimensional tensor, $\mathcal{M} \in \mathbb{R}^{N \times T \times I}$.

$N$ denotes the number of units we perform the experiments on. 

$I$ denotes the total number of intervention. Each unit $n \in N$ will recieve exactly one intervention, $i \in I$.

$T$ is the total number of time periods (i.e., total number of measurements) we perform the experiment for. 

$T_0$ is the number of pre-intervention periods. Note $1 < T_0 < T$.

$r$ denotes the "model complexity", i.e., the rank of the tensor $\mathcal{M}$. 

$\sigma$ is the level of noise added to each measurement, i.e., the variance parameter of mean zero Gaussian noise.

In [2]:
# Generate Artifical Data

# Number of Units
N = 10
# Number of Interventions
I = 4
# Number of Total Time Steps (Pre- and Post-Intervention)
T = 1000
# Number of Pre-Intervention Time Steps
T0 = 800
# Model Complexity
rank = 3
# Noise in System
sigma = 0.2

rct_data = random_rct(N, I, T, T0, rank, sigma)

### Pre-Intervention & Post-Intervention Data (pre_df, post_df)

The rct_data object returned by calling the function $\textbf{random_rct}$ is comprised of two dataframes: pre_df and post_df.

pre_df is a 2-dimensional matrix, $\mathcal{M}^{\text{pre}} \in \mathbb{R}^{N \times T_0}$. It is measurements of all units before any experiments are performed.

post_df is a 2-dimensional matrix, $\mathcal{M}^{\text{post}} \in \mathbb{R}^{N \times (T-T_0)}$. It is the intervention that each unit $n \in N$ experienced (actually observed in reality) in the post-intevention phase. 

(Note not each unit in pre_df has to have experienced an intervention. Further, a unit can experience multiple interventions. The function $\textbf{fill_tensor}$ (the MA-SC algorithm) will work as is for both. For simplicity, we illustrate on artificial data, the case where each unit in the pre-intervention phase $n \in N$ receives exactly one intervention in the post-intervention phase.)

In [3]:
# Pre- and Post- Intervention Data
pre_df, post_df = rct_data

In [4]:
print(pre_df.shape)
pre_df.head(10)

(10, 802)


Unnamed: 0,unit,intervention,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,...,t_790,t_791,t_792,t_793,t_794,t_795,t_796,t_797,t_798,t_799
0,id_0,inter_0,18.390939,17.414586,24.420175,15.388604,14.45951,21.067868,22.315071,17.318389,...,33.895646,9.10079,23.176535,19.09345,22.415192,27.262924,16.883022,13.939976,13.353294,18.088637
1,id_1,inter_0,28.256047,25.012658,36.541766,22.97128,26.667925,31.047622,30.72387,32.795573,...,49.787876,10.697397,43.067694,27.137417,32.678638,46.954542,29.396566,26.545232,21.161961,29.842384
2,id_2,inter_0,17.993483,17.737655,22.265801,15.033922,17.213898,20.659733,19.062033,20.722456,...,32.224703,7.284688,26.345388,17.905705,20.116081,29.461305,18.498285,16.802239,13.689951,18.982347
3,id_3,inter_0,25.017628,18.903139,30.657185,19.633607,26.685984,25.681443,23.019978,33.35495,...,41.163663,6.868668,42.946324,21.654007,26.285522,45.058046,28.000411,26.792415,19.150112,26.604546
4,id_4,inter_0,8.291309,13.685864,8.766852,8.575155,10.349891,14.269462,9.888843,9.172132,...,17.99761,5.567872,11.083694,10.873708,9.539388,13.73272,9.882074,8.52375,8.80837,10.63418
5,id_5,inter_0,17.119842,9.066417,25.41407,12.036468,9.709901,13.962131,20.059667,16.043932,...,27.368875,6.161487,22.371121,14.692841,21.64395,24.862469,14.096725,11.364872,8.65321,14.416047
6,id_6,inter_0,15.419657,9.724543,20.372721,11.028003,12.38259,13.927283,15.972063,16.68626,...,24.728995,4.990772,22.431765,13.201925,17.544281,24.316311,14.060787,12.637352,9.673173,14.189088
7,id_7,inter_0,18.818233,11.556413,26.303003,13.855517,13.073317,16.793396,21.634776,19.238861,...,31.014289,6.678877,25.631826,17.007314,22.980013,28.742373,16.693463,14.23703,11.265951,17.170466
8,id_8,inter_0,12.074104,8.677061,14.926348,8.873785,14.191587,11.834667,10.080135,18.086355,...,18.9751,1.966857,22.620699,9.881771,12.375903,23.201121,14.296126,14.069123,9.446654,13.216025
9,id_9,inter_0,19.044013,17.791279,24.604756,16.31586,18.167521,21.483044,20.892989,22.299758,...,34.390715,7.624373,28.678637,19.084,22.162271,31.764541,19.986981,17.996587,14.701053,20.246519


In [5]:
print(post_df.shape)
post_df.head(10)

(10, 202)


Unnamed: 0,unit,intervention,t_800,t_801,t_802,t_803,t_804,t_805,t_806,t_807,...,t_990,t_991,t_992,t_993,t_994,t_995,t_996,t_997,t_998,t_999
0,id_0,inter_0,21.233767,28.812589,36.155845,19.564092,23.059475,15.856408,14.064438,27.106618,...,15.229554,19.834934,11.964136,11.236575,25.334552,31.77062,16.951546,21.677281,9.227889,15.34543
1,id_1,inter_1,53.952439,49.837341,71.943607,38.024795,47.725998,43.891439,40.602492,59.814336,...,35.223562,38.177012,36.497245,15.897954,48.871962,58.502756,45.441634,36.935761,27.128321,37.775427
2,id_2,inter_0,23.02946,25.068776,33.571355,18.62299,22.770537,17.794287,16.942617,27.548751,...,15.260647,17.24686,14.114224,9.506482,23.826584,28.288949,20.161776,20.261381,10.01889,15.50602
3,id_3,inter_2,48.448953,36.570051,52.813223,34.964266,39.881909,38.464133,40.414327,51.851636,...,27.063279,17.264534,30.31686,17.042306,40.273745,40.996757,49.879132,37.893521,21.19523,26.351575
4,id_4,inter_0,8.184934,15.715085,16.386386,11.100813,11.343167,4.588172,4.407527,12.815279,...,5.56659,6.594064,2.048207,9.192999,13.381673,16.198752,8.721694,15.430387,1.376282,3.994292
5,id_5,inter_3,10.770955,15.320794,18.789228,9.999711,12.144763,7.933284,6.934366,14.015963,...,8.525386,10.732989,6.742269,5.649314,13.192191,17.253319,8.358501,11.210366,4.904787,8.306424
6,id_6,inter_0,20.991803,18.972327,28.046491,14.052042,18.387821,17.611311,15.608552,22.91154,...,14.036718,16.171084,14.881563,4.934138,18.483148,22.544206,16.952107,12.844032,11.217784,15.619713
7,id_7,inter_2,22.936869,26.860093,33.858548,20.218256,23.700007,17.042199,16.828633,28.469875,...,15.302107,15.020748,12.912726,12.59906,25.604716,29.061027,21.751033,24.080848,9.268888,13.921364
8,id_8,inter_1,30.122046,18.17491,31.902656,17.000735,21.990453,25.867783,24.533301,29.183424,...,18.028917,16.174624,22.311655,3.735082,20.869257,23.385553,26.398742,13.399495,16.159757,20.427229
9,id_9,inter_2,33.257032,37.302665,46.856102,29.081954,32.467982,24.25351,24.836818,40.350184,...,21.19149,19.311848,18.37993,17.756973,35.710779,39.892769,32.577573,34.459222,13.095384,18.979156


## Section 2 - Producing Counterfactual Estimates: For Each Unit Under Each Intervention

In this section, we show how to use the $\textbf{fill_tensor}$ method to produce personalized interventions for each unit. 

The input to $\textbf{fill_tensor}$ are the two pre- and post- intervention dataframes. 

The key parameter to the method is: $\textit{cumulative_energy} \in [0, 100]$, which decides the number of prinicpal components to retain when performing Principal Component Regression, when we learn a linear coefficient. In essence, we find the minimum of principal components required such that the percentage of the spectral energy retained is above the given parameter. 

The output of $\textbf{fill_tensor}$ is a order-3 tensor (flattended), $\hat{\mathcal{M}}^{\text{Counterfactual}}\in \mathbb{R}^{N \times (T - T_0) \times I}$, termed $\textit{df_output}$. This contains the counterfactual estiamtes for every unit $n \in [N]$ and for each intervention $i \in [I]$, over the entire post-intervention period, $T - T_0$. 

This is the desired output!

In [6]:
df_output = fill_tensor(pre_df, post_df, cum_energy=0.80)
df_output.head(10)

Unnamed: 0,unit,intervention,t_800,t_801,t_802,t_803,t_804,t_805,t_806,t_807,...,t_990,t_991,t_992,t_993,t_994,t_995,t_996,t_997,t_998,t_999
0,id_0,inter_0,22.754078,26.974494,35.130936,19.269421,23.176822,17.449626,15.936407,27.775517,...,15.549893,18.643768,13.581698,10.312616,24.749053,30.205726,19.23437,21.048477,10.067138,15.821658
1,id_0,inter_1,36.360366,31.314348,46.538859,24.630212,31.067721,29.900906,27.802551,39.319022,...,23.352874,24.518213,25.052454,9.503375,31.425269,37.242037,30.866846,23.191302,18.521466,25.301964
2,id_0,inter_2,31.481859,29.378234,39.32967,24.985041,28.455325,24.125679,24.916577,35.929573,...,18.893511,14.890257,18.689918,13.813006,29.935384,32.155786,31.51387,28.400572,13.191165,17.709571
3,id_0,inter_3,12.175769,17.319026,21.239834,11.303935,13.728758,8.967991,7.838789,15.844011,...,9.637319,12.132851,7.621637,6.386132,14.912798,19.503603,9.448668,12.672491,5.5445,9.389799
4,id_1,inter_0,35.101014,41.611533,54.193867,29.725494,35.753149,26.918233,24.583903,42.847212,...,23.987657,28.760346,20.951469,15.908501,38.178512,46.596114,29.671424,32.46991,15.529821,24.406888
5,id_1,inter_1,56.728929,48.856202,72.609268,38.427709,48.471418,46.65097,43.377146,61.344982,...,36.434824,38.252968,39.086484,14.827031,49.02926,58.104499,48.158017,36.182741,28.896929,39.475767
6,id_1,inter_2,49.245525,45.954928,61.52147,39.082873,44.511266,37.738615,38.975778,56.202866,...,29.554192,23.292097,29.235721,21.607007,46.826451,50.299717,49.295598,44.425619,20.634291,27.702212
7,id_1,inter_3,18.887863,26.866425,32.948643,17.535415,21.296963,13.911745,12.16005,24.578283,...,14.95005,18.821286,11.82319,9.906593,23.133724,30.255287,14.657403,19.658411,8.600997,14.56608
8,id_2,inter_0,22.16344,26.274305,34.219027,18.769236,22.575211,16.996678,15.522739,27.054535,...,15.146257,18.159823,13.229151,10.044926,24.106631,29.421663,18.735095,20.502112,9.805821,15.410968
9,id_2,inter_1,35.717545,30.760736,45.71609,24.194771,30.518469,29.372283,27.311025,38.623894,...,22.940015,24.084751,24.609547,9.335363,30.869696,36.583628,30.321146,22.781299,18.194022,24.854646


## Section 3 - Diagnostic: Which Interventions can be reliably produce counterfactuals for?

In this section we show how to use our diagnostic tool method, termed $\textbf{diagnostic}$. 

$\textbf{diagnostic}$ is a function to assess if the counterfactual estimates produced are reliable. Recall, in reality, we do not get access to the counterfactual estimates. Hence, we need a test to see if any relationship we learn in the pre-intervention phase, will continue to reliably hold in the post-intervention phase. 

In essence, $\textbf{diagnostic}$ checks to see if for the (unit, interventions) pairs $\textit{we do observe}$ (i.e., the unit, intervention pairs in in $\textit{post_df}$), we can reliably reconstruct those trajectories, using $\textit{only pre-intervention data}$ (i.e., only data from $\textit{pre_df)}$. For each intervention, we report the average error over all units that recieved that intervention.

In [7]:
R2_all_interventions = diagnostic(post_df, df_output)
R2_all_interventions

Unnamed: 0,intervention,Average R^2 Value
0,inter_0,0.852785
1,inter_1,0.875171
2,inter_2,0.742668
3,inter_3,1.0
