Example: Generating data
========================

Motivation
----------

Given the fundamental problem of Causal Inference, simulating or
generating data is of particular relevance when working with ATE or
CATE estimation: it allows to have a ground truth that we don't get
from the real world.

For instance, when generating data, we can have access to the
Individual Treatment Effect and use that ground truth to evaluate a
treatment effect method at hand.

In the following example we will describe how the modules
<a href="../../api_documentation/#metalearners.data_generation"><code>data_generation</code></a> and
<a href="../../api_documentation/#metalearners.outcome_functions"><code>outcome_functions</code></a> can be used to generate data in
light of treatment effect estimation.


How-to
------

In the context of treatment effect estimation, our data usually
consists of 3 ingredients:

- Covariates
- Treatment assignments
- Observed outcomes

In this particular scenario of simulating data, we can add some
quantities of interest which are not available in the real world:

- Potential outcomes
- True CATE or true ITE

Let's generate those quantities one after another.


### Covariates

Let's start by generating covariates. We will use
<a href="../../api_documentation/#metalearners.data_generation.generate_covariates"><code>generate_covariates</code></a> for that
purpose.

In [1]:
from metalearners.data_generation import generate_covariates

features, categorical_features_idx, n_categories = generate_covariates(
        n_obs=1000,
        n_features=8,
        n_categoricals=3,
        format="pandas",
)
features.head() # type: ignore

Unnamed: 0,0,1,2,3,4,5,6,7
0,-0.43258,-1.956691,-2.72441,-4.051359,-4.275785,0,2,3
1,0.213119,-1.794246,3.335272,0.596448,-8.05307,0,2,3
2,-0.333022,-1.855324,2.567406,-0.507977,-7.255018,2,4,3
3,-1.036547,-1.37992,1.721547,-2.817249,-4.626411,2,3,1
4,-1.5141,-3.060547,-4.077247,-5.819707,-4.468868,5,0,3


We see that we generated a DataFrame with 8 columns of which the last
three are categoricals.


### Treatment assignments

In this example we will replicate the setup of an RCT, i.e. where the
treatment assignments are independent of the covariates. We rely on
<a href="../../api_documentation/#metalearners.data_generation.generate_treatment"><code>generate_treatment</code></a>.

In [2]:
import numpy as np
from metalearners.data_generation import generate_treatment

# We use a fair conflip as a reference.
propensity_scores = .5 * np.ones(1000)
treatment = generate_treatment(propensity_scores)
type(treatment), np.unique(treatment), treatment.mean()

(numpy.ndarray, array([0, 1]), 0.514)

As we would expect, an array of binary assignments is generated. The
average approximately corresponds to the universal propensity score of
$.5$.


### Potential outcomes

In this example we will rely on <a href="../../api_documentation/#metalearners.outcome_functions.linear_treatment_effect"><code>linear_treatment_effect</code></a>, which
generates additive treatment effects which are linear in the features.
Note that there are other potential outcome functions available.

In [3]:
from metalearners._utils import get_linear_dimension
from metalearners.outcome_functions import linear_treatment_effect

dim = get_linear_dimension(features)
outcome_function = linear_treatment_effect(dim)
potential_outcomes = outcome_function(features)
potential_outcomes

array([[-4.6390948 , -6.99101697],
       [-4.5927874 , -1.43775422],
       [-5.6179741 , -3.62754599],
       ...,
       [-5.81369594, -2.16523526],
       [ 0.89106589,  0.44998321],
       [-6.62191898, -7.66198481]])

We see it generates one column with the potential outcome $Y(0)$ and one column
with the potential outcome $Y(1)$. The individual treatment
effect can be inferred as a subtraction of both.

### Observed outcomes

Lastly, we can combine the treatment assignments and potential
outcomes to generate the observed outcomes. Note that there might be
noise which distinguishes the potential outcome from the observed
outcome. For that purpose we can use <a href="../../api_documentation/#metalearners.data_generation.compute_experiment_outputs"><code>compute_experiment_outputs</code></a> and run

In [4]:
from metalearners.data_generation import compute_experiment_outputs

observed_outcomes, true_cate = compute_experiment_outputs(
    potential_outcomes,
    treatment,
)