In [66]:
import pandas as pd
import altair as alt
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Data generation
## Why syntetic
In order to assess if the models we use to study interaction actually work we are not going to use a real dataset. Instead, we will create a syntetic dataset. By doing this, we can be in control of the underlying parameters, and see if the models discover them correctly.

## Data shape
We will create a dataset containing 2000 observation. We will assume we run two experiments on this dataset, each partitioning the universe (the 2000 observations) in two equal parts. Each of our pretend experiments will assume that one part represents the treatment in that experiment and the other part represents control. To operationally do this, we will create four groups of size 500 each: TT, TC, CT, CC. We will set `E(TT) - E(CC) < E(TC) + E(CT) - 2E(CC)`, to show a negative interaction (i.e. cannibalisation). This will be done in practice by using random samples from normal distributions with different means (i.e. location parameters), but we will assume equal variance and standard deviation (i.e. scale).

In [67]:
n = 500
CC = np.random.normal(10,1,n)
CT = np.random.normal(15,1,n)
TC = np.random.normal(20,1,n)
TT = np.random.normal(22,1,n)

The numbers I will be looking for are +5, +10, +12. These are the differences between CT, TC, TT and the CC group.
My expectation is to find a linear equation that looks something like this:  
```y = 10 + 10x1 + 5x2 - 3x1x2```.  
Let's see how close we can get.

In [72]:
df = pd.DataFrame(
    {
      "exp1_is_treatment": [False] * 2 * n + [True] * 2 * n,
      "exp2_is_treatment": [False] * n + [True] * n + [False] * n + [True] * n,
      "out": np.concatenate([CC,  CT, TC, TT],axis=0)}
)
df.dtypes

exp1_is_treatment       bool
exp2_is_treatment       bool
out                  float64
dtype: object

In [73]:
df.head(3)

Unnamed: 0,exp1_is_treatment,exp2_is_treatment,out
0,False,False,9.407312
1,False,False,8.643365
2,False,False,11.357186


In [74]:
model_1 = smf.ols(formula="out ~ exp1_is_treatment * exp2_is_treatment", data=df).fit()
model_1.summary()

0,1,2,3
Dep. Variable:,out,R-squared:,0.955
Model:,OLS,Adj. R-squared:,0.954
Method:,Least Squares,F-statistic:,13960.0
Date:,"Mon, 01 Apr 2024",Prob (F-statistic):,0.0
Time:,20:12:45,Log-Likelihood:,-2870.3
No. Observations:,2000,AIC:,5749.0
Df Residuals:,1996,BIC:,5771.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,9.9988,0.045,219.768,0.000,9.910,10.088
exp1_is_treatment[T.True],9.9896,0.064,155.257,0.000,9.863,10.116
exp2_is_treatment[T.True],5.0242,0.064,78.085,0.000,4.898,5.150
exp1_is_treatment[T.True]:exp2_is_treatment[T.True],-3.0042,0.091,-33.016,0.000,-3.183,-2.826

0,1,2,3
Omnibus:,1.783,Durbin-Watson:,1.937
Prob(Omnibus):,0.41,Jarque-Bera (JB):,1.789
Skew:,-0.073,Prob(JB):,0.409
Kurtosis:,2.985,Cond. No.,6.85


# Outcome  
Success. We found exactly the coefficients we were looking for, validating that our intuition was indeed correct. They are all statistically significant, but this is because we set a very low standard deviation (i.e. 1).

# Pitfalls
In earlier attepts to get here I went by heart on how to set the interaction effect. This means that I wrote the wrong formula (i.e. `out ~ exp1_is_treatment + exp2_is_treatment + exp1_is_treatment*exp2_is_treatment`) which was giving me unexpected results. To access the interaction effect between two variables directly, you need to use the `:` operator, and not the ~*`. The `*` operator is used to define variable 1, variable 2 and interaction. See [here](https://www.econometrics.blog/post/the-r-formula-cheatsheet/) for more info.