# 01 Causal Inference

Word "causal" means "reason".

## Association

Two events are associated does not imply there is causation. For instance, consider the following two events:

1. A person goes to bed with shoes on.

2. The person wakes up with headache on the second day.

They do not have causation inbetween, since sleeping with shoes on will not lead to headache.

<br>

However, there is a non-negligible probability that the person gets drunk in the first, sleeps with shoes and catches a headache the next day. So in the real world we will observe that "a with-shoe sleeper" is highly likely to "wake up with headache". This is known as the association.


### Notation

Let $\Omega$ be a population and $Y(\omega)$ to be the **response** random variable $(\omega\in\Omega)$. Let $A(\omega)$ be the **attributes** (features) of $\omega$.

### Association

For example, if the response and the attribute are both categorical / discrete variables, then we can derive the joint and the conditional disribution of $Y$ given attribute $A$ as below:

$$\left\{
\begin{aligned} & \mathbb P(Y=y,A=a)=\sum_{Y(\omega)=y, \ A(\omega)=a} \mathbb P(\omega)\\ & 
\\ & \mathbb P(Y=y|A=a)=\frac{\sum_{Y(\omega)=y, \ A(\omega)=a} \mathbb P(\omega)}{\sum_{ \ A(\omega)=a} \mathbb P(\omega)}.
\end{aligned}\right.$$




## Counterfactual

We assume both the attribute $A$ and outcome $Y$ are binary variables, $A\in\{0,1\},\ Y\in\{0,1\}$. 

In a controlled experiment, we can assign the attribute of each individual manually.

<br>

For example, assume we are conducting a controlled experiment to study whether a new medical treatment is truly benefical to the patients. Then, we can halve the subjects (实验体), $\omega_1,\dotsc,\omega_n$, into treatment group and controlled group. In the treatment group, every individual $\omega_i$ receives the new medical treatment and is assigned to variable $A(\omega_i) = 1$. In the controlled group, everyone does not receive the treatment and gets an $A(\omega_i) = 0$.

If for $\omega_i$ it recovers quickly, we denote the successful outcome by $Y_i = 1$. If not, we write $Y_i=0$.

### Potential Outcome

For each $\omega_i$, if $A(\omega_i)=1$, indicating it has received the treatment, we denote $Y_i^1 = Y_i$ to be the outcome. And we denote $Y_i^0$ to be the potential outcome <font color=red>had it not received</font> the new treatment.

Vise versa: if $A(\omega_i)=0$, which is in the controlled group, we denote $Y_i^0 = Y_i$ to be the outcome. And we denote $Y_i^1$ to be the potential outcome if it had received the new treatment.

That is, we always have the identity:

$$Y = Y^A = AY^1 + (1 - A)Y^0\quad\quad A\in\{0,1\}.$$


### Counterfactual


NOTE THAT for each subject $x_i$, we can only observe $Y_i^{A_i}$ by experiment, while the opposite $Y_i^{1-A_i}$ is unknown. This is known as the missing data problem. It is the fundamental problem in causal inference.

The observed outcome is $Y_i^{A_i}$ is called the factual, while the potential (hypothetical) outcome $Y_i^{1-A_i}$ is the counterfactual.

### Causal Effect 

The causal effect of some individual $\omega$ is defined by

$${\rm CE}=Y^1(\omega) - Y^0(\omega).$$

Average causal effect (ACE) measures the average / expected impact of changing the attribute.

$${\rm ACE}=\mathbb E(Y^1)-\mathbb E(Y^0)$$

### Sample ACE

We can "estimate" ACE with an observed sample $Y_1,\dotsc,Y_n$,

$${\rm SACE}=\widehat{\rm ACE}=\frac1n\sum_{i=1}^n (Y_i^1 -Y_i^0).$$

This is called the sample average causal effect (SACE) or sample average treatment effect (SATE). However, <font color=red>SACE is still unknown</font> because we can only observe one of $Y_i^1$ or $Y_i^0$  for each $i$.

<br>

Alternatively, we can use 

$$\widehat{\rm SACE} = \frac{1}{n_1}\sum_{A_i=1}Y_i^1 - \frac{1}{n_0}\sum_{A_i=0}Y_i^0.$$

Here $n_0$ and $n_1$ stand for the numbers of subjects with $A_i=0$ and $A_i=1$ respectively.

But we need some conditions for the equation to hold.


In [7]:
import pandas as pd
import numpy as np

data = {'A': [1,1,0,1,0,0,0,1], 'Y': [1,0,1,0,1,0,1,1]}

def SACE(data):
    data = pd.DataFrame(data)
    Y1 = np.where(data['A'] == 1, data['Y'], np.nan)
    Y0 = np.where(data['A'] == 0, data['Y'], np.nan)
    Y1mean = np.nanmean(Y1)
    Y0mean = np.nanmean(Y0)
    print('SACE = E(Y|A=1) - E(Y|A=0) = %.2f - %.2f = %.2f'%(
                Y1mean, Y0mean, Y1mean - Y0mean))
    return pd.DataFrame({'Y1': Y1, 'Y0': Y0, 'A': data['A']})

SACE(data)

SACE = E(Y|A=1) - E(Y|A=0) = 0.50 - 0.75 = -0.25


Unnamed: 0,Y1,Y0,A
0,1.0,,1
1,0.0,,1
2,,1.0,0
3,0.0,,1
4,,1.0,0
5,,0.0,0
6,,1.0,0
7,1.0,,1


## Assumptions

Consider the following equation

$${\rm SACE}\stackrel{?}{=}\mathbb P(Y|A=1) - \mathbb P(Y|A=0)\quad\quad(\star).$$

The plug-in estimator gives the aforementioned formula:

$$\widehat{\rm SACE}=\frac{1}{n_1}\sum_{A_i=1}Y_i^1 - \frac{1}{n_0}\sum_{A_i=0}Y_i^0.$$

Yet we need four assumptions (conditions) for ($\star$) and its plug-in estimator to hold: consistency, stable unit treatment value assumption, exchangeability and positivity.

If all four conditions are satisfied, we say that the ACE is identifiable in this case.

### Consistency

Consistency is the definition of $Y_i^{A_i}$: i.e. $Y_i^1=Y_i$ if $A_i=1$ while $Y_i^0=Y_i$ if $A_i=0$. Jointly,

$$Y_i = A_iY_i^1+(1-A_i)Y_i^0.$$

### Stable Unit Treatment Value Assumption

Stable unit treatment value assumption (SUTVA) requires no interference between subjects, so that the distribution of $Y_i$ given $A_1,\dotsc,A_n$ is equal to that given merely $A_i$:

$$\mathbb P(Y_i|A_1,\dotsc,A_n) \equiv \mathbb P(Y_i|A_i).$$


### Exchangeability

Apart from the attribute $A$, there are other **confounding** variables $C$ that have impacts on the outcome. We need to control such variables so that the treatment group and controlled group are identical in these variables.

To achieve so, we assign the value of $A_i$ (decide whether a subject should be in the treatment group before the experiment starts) with $0$ or $1$ with equal probability.

$$\mathbb P(A=1|C=c)=\mathbb P(A=0|C=c)=\frac12.$$

It is also called the assumption of unconfoundness or randomization.

<br>

Moreover, the exchangeability leads to 

$$\mathbb E(Y^1|A=1) =\mathbb E(Y^1|A=0)\quad{\rm and}\quad \mathbb E(Y^0|A=1)=\mathbb E(Y^0|A=0).$$

This indicates that: if the subjects in the controlled group had received the new treatment, they should react the same as the current treatment group.

In this case, $\mathbb P$ is a product measure over $A\times C$.

<br>


### Positivity

For all $C=c$, we need $\mathbb P(A=0|C=c)>0$ and $\mathbb P(A=1|C=c)>0$. This assumption automatically holds when the exchangeability holds.


### ACE

**Theorem** When the four assumptions hold, ${\rm ACE} = \mathbb E(Y^1) -\mathbb E(Y^0) = \mathbb E(Y|A=1) -\mathbb E(Y|A=0)$. Also, assuming the confounding variable $C$ is categorical we have 

$$\mathbb E(Y^a) =\sum_c \mathbb E(Y|A=a, \ C=c)\mathbb P(C = c).$$

**Proof** For the first equation, note that the exchangeability implies $\mathbb E(Y^1|A=1) =\mathbb E(Y^1|A=0)=\mathbb E(Y^1)$ and also $\mathbb E(Y^0|A=0) = \mathbb E(Y^0)$. Therefore,

$$\begin{aligned}\mathbb E(Y^1) -\mathbb E(Y^0) \stackrel{\rm Exchangeability}{=\!=\!=\!=\!=\!=\!=}\mathbb E &(Y^1|A=1)-\mathbb E(Y^0|A=0)\\  \stackrel{\rm Consistency}{=\!=\!=\!=}\mathbb E &(Y|A=1) -\mathbb E(Y|A=0).\end{aligned}$$

For the second, we have 

$$\begin{aligned}\mathbb E(Y^a) = \sum_c & \mathbb E(Y^a|C=c)\mathbb P(C=c)
\\ \stackrel{\rm Exchangeability}{=\!=\!=\!=\!=\!=\!=\!=} \sum_c & \mathbb E(Y^a|A=a,C=c)\mathbb P(C=c)
\\ \stackrel{\rm Consistency}{=\!=\!=\!=} \sum_c & \mathbb E(Y|A=a,C=c)\mathbb P(C=c).
\end{aligned}$$

(SUTVA guarantees the plug-in estimator.)