# ACE estimations from real RCT data
*This notebook examines the use of the `CausalEffectEstimation` module for estimating Average Causal Effects (ACE) in Randomized Controlled Trials (RCTs) within the Neyman-Rubin potential outcome framework from the STAR trial datset.*

In [2]:
import pyAgrum as gum
import pyAgrum.lib.discretizer as disc
import pyAgrum.lib.notebook as gnb
import pyAgrum.lib.explain as gexpl

import pyAgrum.causal as csl

import numpy as np
import pandas as pd

### Dataset

The data used in this notbook come from the Tennessee Student/Teacher Achievement Ratio (STAR) trial. This randomized controlled trial was designed to assess the effects of smaller class sizes in primary schools (T) on students' academic performance (Y). 

The covariates in this study include:

* `gender`
* `age`
* `g1freelunch` being the number of lunchs provided to the child per day
* `g1surban` the localisation of the school (inner city or rural)
* `ethnicity`

In [3]:
# Preprocessing

# Load data - read everything as a string and then cast
star_df = pd.read_csv("../data/STAR_data.csv", sep=",", dtype=str)
star_df = star_df.rename(columns={"race": "ethnicity"})

# Fill na
star_df = star_df.fillna({"g1freelunch": 0, "g1surban": 0})
drop_star_l = ["g1tlistss", "g1treadss", "g1tmathss", "g1classtype",
"birthyear", "birthmonth", "birthday", "gender",
"ethnicity", "g1freelunch", "g1surban"]
star_df = star_df.dropna(subset=drop_star_l, how='any')

# Cast value types before processing
star_df["gender"] = star_df["gender"].astype(int)
star_df["ethnicity"] = star_df["ethnicity"].astype(int)

star_df["g1freelunch"] = star_df["g1freelunch"].astype(int)
star_df["g1surban"] = star_df["g1surban"].astype(int)
star_df["g1classtype"] = star_df["g1classtype"].astype(int)

# Keep only class type 1 and 2 (in the initial trial,
# 3 class types where attributed and the third one was big classes
# but with a teaching assistant)
star_df = star_df[~(star_df["g1classtype"] == 3)].reset_index(drop=True)

# Compute the outcome
star_df["Y"] = (star_df["g1tlistss"].astype(int) +
                star_df["g1treadss"].astype(int) +
                star_df["g1tmathss"].astype(int)) / 3

# Compute the treatment
star_df["T"] = star_df["g1classtype"].apply(lambda x: 0 if x == 2 \
                                                        else 1)

# Transform date to obtain age (Notice: if na --> date is NaT)
star_df["date"] = pd.to_datetime(star_df["birthyear"] + "/"
+ star_df["birthmonth"] + "/"
+ star_df["birthday"], yearfirst=True, errors="coerce")
star_df["age"] = (np.datetime64("1985-01-01") - star_df["date"])
star_df["age"] = star_df["age"].dt.days / 365.25

# Keep only covariates we consider predictive of the outcome
star_covariates_l = ["gender", "ethnicity", "age",
                     "g1freelunch", "g1surban"]
star_df = star_df[["Y", "T"] + star_covariates_l]

# Map numerical to categorical
star_df["gender"] = star_df["gender"].apply(lambda x: "Girl" if x == 2 \
                                            else "Boy").astype("category")
star_df["ethnicity"] = star_df["ethnicity"].map( \
    {1:"White", 2:"Black", 3:"Asian",
     4:"Hispanic",5:"Nat_American", 6:"Other"}).astype("category")
star_df["g1surban"] = star_df["g1surban"].map( \
    {1:"Inner_city", 2:"Suburban",
     3:"Rural", 4:"Urban"}).astype("category")

star_df.describe()

Unnamed: 0,Y,T,age,g1freelunch
count,4215.0,4215.0,4215.0,4215.0
mean,540.095848,0.428233,4.879872,1.471886
std,39.267221,0.494881,0.465104,0.534171
min,439.333333,0.0,3.129363,0.0
25%,511.333333,0.0,4.525667,1.0
50%,537.333333,0.0,4.818617,1.0
75%,566.0,1.0,5.111567,2.0
max,670.666667,1.0,7.225188,2.0


It appears that there are more units in the control group. However, the control and treatment groups appear to be similar in distribution, indicating that the ignorability assumption is likely satisfied. 

We will explore how the `CausalEffectEstimation` module can estimate the causal effect of $T$ on $Y$ in both of the given datasets.

### Structure Learning and Setup

In the absence of a predefined causal structure, structure learning is utilized to uncover the underlying relationships between the variables in the dataset. To facilitate this process, a slice order will be imposed on the variables. This approach will serve as the foundation for deriving the necessary causal structure for subsequent analysis.

To enable the application of structure learning algorithms, the variables will first be discretized using the `discretizer` module. Following this, the causal structure will be derived using `gum.BNLearner`.

In [4]:
discretizer = disc.Discretizer(defaultDiscretizationMethod='uniform')
discretizer.setDiscretizationParameters("age", 'uniform', 24)
discretizer.setDiscretizationParameters("Y", 'uniform', 30)

template = discretizer.discretizedTemplate(star_df)

learner = gum.BNLearner(star_df, template)
learner.useNMLCorrection()
learner.useSmoothingPrior(1e-6)
learner.setSliceOrder([["T", "ethnicity", "gender", "age"],
                       ["g1surban", "g1freelunch", ], ["Y"]])
bn = learner.learnBN()

print(learner)

gnb.sideBySide(gexpl.getInformation(bn, size="50"),
               gnb.getInference(bn, size="50"))

Filename               : /tmp/tmpatsgnqpj.csv
Size                   : (4215,7)
Variables              : Y[30], T[2], gender[2], ethnicity[6], age[24], g1freelunch[3], g1surban[4]
Induced types          : False
Missing values         : False
Algorithm              : MIIC
Score                  : BDeu  (Not used for constraint-based algorithms)
Correction             : NML  (Not used for score-based algorithms)
Prior                  : Smoothing
Prior weight           : 0.000001
Constraint Slice Order : {ethnicity:0, T:0, g1surban:1, age:0, gender:0, g1freelunch:1, Y:2}



0,1
G age age ethnicity ethnicity g1surban g1surban ethnicity->g1surban g1freelunch g1freelunch ethnicity->g1freelunch g1surban->g1freelunch gender gender Y Y g1freelunch->Y T T T->Y,"structs Inference in 0.30ms Y  2024-09-03T22:34:19.770876  image/svg+xml  Matplotlib v3.5.1, https://matplotlib.org/  T  2024-09-03T22:34:19.836335  image/svg+xml  Matplotlib v3.5.1, https://matplotlib.org/  T->Y gender  2024-09-03T22:34:19.892966  image/svg+xml  Matplotlib v3.5.1, https://matplotlib.org/  ethnicity  2024-09-03T22:34:19.973202  image/svg+xml  Matplotlib v3.5.1, https://matplotlib.org/  g1freelunch  2024-09-03T22:34:20.155957  image/svg+xml  Matplotlib v3.5.1, https://matplotlib.org/  ethnicity->g1freelunch g1surban  2024-09-03T22:34:20.209842  image/svg+xml  Matplotlib v3.5.1, https://matplotlib.org/  ethnicity->g1surban age  2024-09-03T22:34:20.074062  image/svg+xml  Matplotlib v3.5.1, https://matplotlib.org/  g1freelunch->Y g1surban->g1freelunch"


This initial approach appears promising, as the inferred causal relationships are somewhat consistent with what might be expected from an non-expert perspective.

Now given the causal structure, we are set to instanciate the `CausalEffectEstimation` class to perform estimation.

In [5]:
causal_model = csl.CausalModel(bn)
cee = csl.CausalEffectEstimation(star_df, causal_model)

### Causal Identification

The next step involves formal causal identification. As expected, we identify the RCT adjustment, consistent with the experimental design.

In [6]:
cee.identifyAdjustmentSet(intervention="T", outcome="Y")

Randomized Controlled Trial adjustment found. 

Supported estimators include:
- CausalModelEstimator
- DM
If the outcome variable is a cause of other covariates in the causal graph,
Backdoor estimators may also be used.


'Randomized Controlled Trial'

### Causal Effect Estimation

Once the ajustment identified, we can use the appropiate estimators for estimation. 

In [7]:
cee.fitDM()
tau_hat = cee.estimateCausalEffect()

print(f"ACE = {tau_hat}")

ACE = 12.814738911047016


In [8]:
cee.fitCausalBNEstimator()
tau_hat = cee.estimateCausalEffect()

print(f"ACE = {tau_hat}")

ACE = 11.515235748777812


Let's evaluate how the backdoor adjustment estimators compare to the previously obtained estimates. In this analysis, we control for the `g1freelunch` variable.

In [9]:
cee.useBackdoorAdjustment(intervention="T", outcome="Y", confounders={"g1freelunch"})

In [10]:
cee.fitSLearner()
tau_hat = cee.estimateCausalEffect()

print(f"ACE = {tau_hat}")

ACE = 11.616979201549725


In [11]:
cee.fitTLearner()
tau_hat = cee.estimateCausalEffect()

print(f"ACE = {tau_hat}")

ACE = 11.616705516924535


In [12]:
cee.fitIPW()
tau_hat = cee.estimateCausalEffect()

print(f"ACE = {tau_hat}")

ACE = 11.77382443916551


In [13]:
cee.fitPStratification()
tau_hat = cee.estimateCausalEffect()

print(f"ACE = {tau_hat}")

ACE = 10.912212494032811


The results are consistent, suggesting that the true Average Causal Effect is approximately 11.5. For more detailed statistical properties of the estimation, we can employ custom `CausalML` estimators, which offer greater flexibility in producing these estimates.

### Using Custom CausalML Estimators

The `CausalEffectEstimation` framework allows users to specify custom estimators that adhere to the`causalml` API. To demonstrate this feature, we will employ a `BaseSLearner` from the `meta` module, which offers the additional capability of returning a confidence interval for the estimation.

In [14]:
from causalml.inference.meta.slearner import BaseSLearner
from sklearn.linear_model import LinearRegression

cee.fitCustomEstimator(
    estimator=BaseSLearner(
        learner=LinearRegression(),
        ate_alpha=0.05
    )
)

In [15]:
mean, low, high = cee.estimateCausalEffect(return_ci=True)
print(f"ACE = {mean[0]}, CI = [{low[0]}, {high[0]}]")

ACE = 11.616979201549725, CI = [9.39290735710075, 13.8410510459987]


Using the `BaseSLearner`, we obtain the exact same estimate as the built-in S-Learner estimator. However, this approach additionally provides a 5% confidence interval for the estimate.