# Evaluating Causal Models


- **Main difficulty:** No ground truth available. 
- Thus many **synthetic** and **partially snythetic** data-sets 
- Partially synthetic: Use **existing data** and **synthetically** create outcomes for which we **know the individual treatment effect** by construction



**Main draw-backs of synthetic data:** 
- **Data generating process** often quite **simplistic**- questionable representation of reality
- The **DGPs** introduce **assumptions** as to the **structure** of the **problem**: e.g. **confounders** or function used to determine the treatment effect based on covariates (not clear whether results generalize)
- Reference data-sets and synthetic outcomes within them have been produced in many ways

## 1 Data Generating Processes 

The following dimensions have an influence on the data-generating process



### 1.1 Covariates
- Two main parameters: 

(1) Dimsionality, $k=|X_i|$, 

(2) Correlation between the covariates

- More covariates (that are in some way really relevant), higher dimensional problem, more difficult setting /(ceteris paribus), if sample size remains fixed
- Strong correlation or even multi-colinearity between covariates makes it harder to chose right features and correctly estimate effects 


### 1.2 Heterogeneity

- With the homogeneity and heterogeneity of a data generating process we refer to the **structure of the treatment effect**
- A **homogeneous treatment effect** means, that the **ITE for all instances is the same**, i.e. $τ(X_i) = τ, i = 1, ...., n$
- When we use a homogeneous treatment effect, we also **remove all dependencies** of the **covariates on the treatment effect** and thus **remove** significant part of **confounding by design**.
-  **Heterogeneous treatment effect**, is **harder to estimate** for most methods, because it introduces complexity in the function to estimate and also introduces possible confounding. 
- **Strength of heterogeneity** can be expressed by the variance of the effect minus the variance of the error $Var(τ(x)) − Var(\epsilon_i)$.


### 1.3 Distribution of treatment effects

- Often assume **normal distribution**
- **Non-normal** distribution can significantly **increase difficulty**
- **Particularly** for **methods** that **cannot capture heterogenous treatment effects**, e.g. S-Learner with linear regression
- Can even be **multi-modal distribution**, where these methods will fail predictably


### 1.4 Size of treatment effects

- Smaller size of treatment effect relative to to total outcome makes it more difficult to detect

$$τ_{rel}(x_i) = \frac{τ(x_i)}{Y_i(0)}$$

### 1.5 Confounding

- The **bigger** the **confounding effects** and the number of confounders the **more difficult** it is to **estimate** effect sizes
- In practice by choosing the same **covariates** to **determine treatment and outcome** (in an artificial data-set) we can introduce confounding
- **Hidden confounders**, that is covariates that **influence both treatment and outcome**, but are **not observed** in the data, are a particularly difficult problem
-


### 1.6 Treatment Assignment 

- The treatment assignment is just as essential and **closely related to the design of confounding**. 
- What **separates a randomized trial form a observational study** is the treatment assignment mechanism
- With the **treatment assignment** we also control the **overlap condition**
- That is to say, we control whether the **treatment and control groups** are **similar or completely different**.
- Also, we control the **respective sizes** of the **two groups**, thus also heavily influencing the difficulty of our DGP
- In general, we can say that **less overlap** makes the **problem more difficult**
- We denote the treatment assignment as a function of the covariates that maps to the probability of the instances described by the covariates receiving treatment. In other words, we manually fabricate the propensity score p(x).

### 1.7 Functional Form 

- For the relationship of covariates on treatment, outcome and effect, we have the freedom to choose any functional form
- That is to say, we can map the features using linear functions, polynomials, exponential functions, logits or any other form we can think of..

### 1.8 Noise

Lastly, we can determine the distribution of the error term. From a simple additive
normal to interactive measurment errors, biasing the results, there is a wide range
of possibilites.

### 1.9 Sample size

<img src="http://drive.google.com/uc?export=view&id=1JWYcmYWZxpRPHiPUqWrNWvJGROQsd8Rl" width=75%>

Source: Franz (2019) **"A Systematic Review of Machine Learning Estimators for Causal Effect"**, [here](https://justcause.readthedocs.io/en/latest/_downloads/e054f7a0fc9cf9e680173600cb4b4350/thesis-mfranz.pdf). 

## 2 Existing Benchmark data-sets

### 2.1. Infant Health Development Program
- Original study  constructed to study the effect of special child care for **low birth weight** of **premature** infants 
- In total, **six continuous and 19 binary pretreatment** variables
- Using the covariates of all instances in both treatment groups, the potential outcomes are generated synthetically 
- Finally, manipulation of observational study by omitting a non-random set of samples from the treatment group 
- The way the subset is generated from the experimental data does not ensure complete overlap 
- Specifically, the observational subset is created by throwing away the set of all children with nonwhite mothers from the treatment group
- Following data generation process used for potential outcomes

$$    Y(0) \sim \mathcal{N}(exp(X+W)\beta_B, 1) $$

$$    Y(1) \sim \mathcal{N}(X\beta_B-\omega_B^s, 1) $$


where $X$ represents the standardized covariate matrix, $W$ is an offset matrix with the dimensions of $X$ and all values set to 0.5 and finally $\omega_B^s$ is chosen such that the mean CATE is 4. The entries of the coefficient vector $β_B$ are sampled from the values (0; 0:1; 0:2; 0:3; 0:4) with probabilities (0:6; 0:2; 0:1; 0:1; 0:1) respectively. This
results in a nonlinear response with heterogeneous treatment effect.

After the adaptions from Hill, we are left with **139 instances in the treated group**
and **608 instances in the control group**.

### 2.2. Twins Dataset
The Twins dataset is derived from birth data collected in the US between 1989 and 1991. The original data is compiled and analyzed by Almond et al. [3]. From all these births, only the twins are considered, because these allow us to perform a special kind of synthetic generation. Namely, we only consider twins with a low
birth weight and then define treatment T = 1 as being the heavier twin. In doing so, we follow other authors that have used the Twins dataset for comparisons.

This construction means that we know the outcome, mortality in this case, for both
treatments The only synthetic part in the data is the assignment of treatment.
From the full data containing both potential outcomes, we want to generate data
that resemble an observational study. As mentioned above, this is done via different
functional relationships in different papers. For our purposes we present the process
described in First, we only use low birth weight twin pairs for which all 30 features are available.
That leaves us with 8215 samples. Since we know the mortality for both twins, we
know the ground truth. To generate a observational study we now assign treatment
by defining


$$    P(T|X) \sim Bern(\sigma (W^TX+n))) $$ , with

$$    W^T \sim \mathcal{U}((-0.1,0.1)^{30*1} $$ and

$$    n \sim \mathcal{N}(0,0.1) $$

where $\sigma$ is the sigmoid function, $Bern$ refers to the Bernoulli distribution, $\mathcal{U}$ to the uniform distribution and $\mathcal{N}(0,0.1)$ to a normal distribution with mean 0 and standard deviation 0.1

### 2.3. Atlantic Causal Inference Challenge - ACIC
-  based on the Linked Births and Infant Deaths Database (LBIDD)
- Use real covariates to generate synthetic data-set
- Exact data generation process not known

## 3. Measuring Model Performance (Error)

### 3.1 Average Treatment Effect (ATE)

$$ \epsilon_{ATE}=|τ-\hat{τ}|$$

where $\hat{τ}$ is often calculated as the average of individual effect estimations $\hat{τ}=n^{-1} \sum_{i=1}^{n}\hat{τ}x_i$

### 3.2 Precision in Estimation of Heterogeneous Effects

$$ \epsilon_{PEHE} = n^{-1} \sum_{i=1}^{n}\bigg( [ Y_i (1)-Y_i (0) ] -[\hat{Y}_i(1)-\hat{Y}_i(0)] \bigg)^2$$

$$ = n^{-1} \sum_{i=1}^{n}(\hat{τ}(x_i)-τ(x_i) )^2 $$

Often, instead of reporting $\epsilon_{PEHE}$ directly, the root $\sqrt{\epsilon_{PEHE}}$ is listed, thus making
the PEHE score a Root Means Squared Error (RMSE) on individual treatment
effects

### 3.3 Bias

ATE is agnostic to direction of error. Method bias allows to take account of the direction:

$$ \epsilon_{BIAS}= n^{-1} \sum_{i=1}^{n}(\hat{τ}(x_i)-τ(x_i) )$$