# Chapter 9 Causal Inference


## 9.1 Two Paradoxes 


**Simpson's paradox** The following table summarizes number of successes and fails for two treatments (A and B) for small and large kidney stones. Treatment A is an open surgical procedures, and Treatment B is a minimally-invasive procedure. A more successful treatment should yield higher success rate compare to its alternative. 

|     | Small | stones |  Large | stones |
| :- | :-:| :-: | :-: | :-: |
|  | success | fail | success | fail   |
| Treatment A | 81 | 6 | 192 |71   |
| Treatment B | 234 | 36 | 55 |25   |

We can calculate the success rates as follow （Treatment A v.s. Treatment B). 

- Success rate for small stones:   **{{round(81/87, digits=2)}}** (81/87)   > **{{round(234/270, digits=2)}}** (234/270)

- Success rate for large stones:   **{{round(192/263, digits=2)}}** (192/263) > **{{round(55/80, digits=2)}}** (55/80)

- Overall success rate:    **{{round(273/350, digits=2)}}**  (273/350)  <span style="color:red">**<**</span>  **{{round(289/350, digits=2)}}** (289/350)


<span style = "color:red"> **Which treatment is more effective?** </span> Read more about Simpson's paradox [here](https://en.wikipedia.org/wiki/Simpson's_paradox).


**Lord's paradox** Lord (1967) discussed a paradox concerning whether the  effects of the diet provided in the dining hall differ for males and females. 

The data is shown in the following scatter plot. The variables involved are gender $(G_i)$, weight in 1963 $(X_i)$, and weight in 1964 $(Y_i)$. We have $\text{mean}(Y_i \mid G_i=1)=\text{mean}(X_i \mid G_i=1)=150$ and $\text{mean}(Y_i \mid G_i=0)=\text{mean}(X_i \mid G_i=0)=130$.
 
 
 <img src="../Figures/Ch10/lord.png" style="width: 500px;"/>


- Researcher A: Average weights unchanged for both males and females. 
- Researcher B: $Y_i =\beta_0+\beta_g G_i+\beta_X X_i+\epsilon_i$ $\rightsquigarrow$   $\widehat{\beta}_g=6.34$.

- Statistician C: $Y_i =\beta_0+\beta_g G_i+\beta_X X_i+\beta G_i X_i+\epsilon_i$ $\rightsquigarrow$   $\widehat{\beta}=0.036 \ (\text{s.e.} = 0.019)$


<span style = "color:red"> **Who is correct?** </span> Read more about the Lord's paradox [here](https://en.wikipedia.org/wiki/Lord's_paradox). 


## 9.2 Overview of causal inference

The two paradoxes should, at the very least, make it clear that association does not necessarily imply causation. In our everyday life, one usually never makes an effort to distinguish association from causation---unless this one is an statistician. Furthermore, most of the quantitative methods can only make claims on associations but not associations. 

However, it is often the causal inference that matters in the real world. For instance, we might be interested in the following questions.
- What would happen to the patient if they received treatment $A$ instead of $ B$?
- What would happen to the unemployment rate if the U.S. government increased minimum wages?
- What would happen to the case number if a state took a different action in April?

In all these what-ifs, we notice that it is always comparing _outcomes_ under _different_ conditions for the _same_ subject(s). In other words, **causal inference** is the comparison between _potential outcomes_ under _treatment_ and _control_ for the _same_ unit(s). 

It has been well-known and a common practice that randomized experiments are warranted for any decision and action that involves causal inference. Two prominent examples are the randomized clinical trials for the evaluation of drugs and treatments, and the A/B testing for evaluation of business strategies. 

Randomization seems to have answered the what-ifs. With a randomized clinical trial, we can compare the outcomes when patients randomly received the two treatments. However, there are some small but crucial caveats. For instance, in a randomized clinical trial, there can not be comparison on the same unit: a patient can not take one treatment and then roll back time to take another treatment. In other words, the comparison is not done on the _same_ unit. After all, is it even possible to strictly limit causal inference only when comparison is conducted on the same unit? We need a "language" to rigorously discuss this question. 

We will use the [potential outcome framework](https://en.wikipedia.org/wiki/Rubin_causal_model) in this note. There are many other approaches including the [direct acyclic graphs](https://en.wikipedia.org/wiki/Causal_graph).

Suppose we **observe** a data set with treatment $Z_i$ ($A$ or $B$) and outcome $Y_i$ (1 for success and 0 for fail) for $i=1,2,\ldots, n$. Further consider the **potential outcomes** $Y_i( \text{treatment} = A)$ that represents the outcome if subject $i$ receives treatment $A$, and $Y_i( \text{treatment} = B)$,  the outcome if subject $i$ receives treatment $B$. Using the notation of potential outcomes, we can write the observed outcome as $Y_i(Z_i)$, which shows that only one potential outcome is observed for each unit. The following table show the potential outcomes and the **observed values** in boldface. 

| Unit $i$ | $Z_i$ | &nbsp; $Y_i(A)$ | &nbsp; $Y_i(B)$ |
| ---  | ---  | ---- | ---- |
|1  |  A     |   **0**        |      1|
|2  |      B    |    0        |       **1** |
|3  |       A  |     **0**        |      0 |
|4  |       A |      **1**       |      1 |


In a real data, we only get to observe the following. 


| Unit $i$ | $Z_i$ | $Y_i$ | 
| --- | --- | ---  |
|1  |       A     |         **0**        |     
|2  |      B    |       **1** |
|3  |       A  |          **0**        |     
|4  |       A |          **1**       |     
|5  |       B  |           **0** |

With these new notations, we can see that the what-ifs describe the comparison of potential outcomes, e.g., $Y_i(A)-Y_i(B)$. To be specific, a causal effect is  defined to be the comparison of the  potential outcomes on the **same units**.
- **Individual causal effect**:  $Y_i(A)-Y_i(B)$.
- **Average causal effect** (ACE): $\text{mean}\{Y_i(A)-Y_i(B)\}$

It is important to notice that $\text{mean}\{Y_i(A)\}$ does not equal to $\text{mean}\{Y_i \mid Z_i=A\} $ in general. We can see this using the examples in the above tables.
- $\text{mean}\{Y_i(A)\}$: average of $Y_i(A)$ for units 1, 2, 3, 4.
- $\text{mean}\{Y_i\mid Z_i=A\}$: average of $Y_i$ for units 1,3.


The above tables also reveal the fundamental problem of causal inference. That is only one potential outcome is observed.  In randomized experiments, randomization ensures $\text{mean}\{Y_i(A)\} \neq \text{mean}\{Y_i \mid Z_i=A\} $, which allows us to estimate the average causal effects. We should always think about what can and what cannot be learned from the observed data.

> If your experiment needs statistics, you ought to have done a better experiment.
>
>        --  Ernest Rutherford 
        

> Regression models often seem to be used to compensate for problems in measurement, data collection, and study design. By the time models are deployed, the scientific position is nearly hopeless. 
> 
>     --  David Freedman 


A new task that will be often discussed in this chapter is **Identification**: what can be identified if there are infinite amount of data. This tasks concerns the design of experiements or how the data is collected. In contrast, the **Inference** problem, where we study what can be learned given a finite sample, is what we have focused on in most of the statistical methods. 

## 9.3 Potential outcome 



Recall our notation for potential outcome in Section 10.2. Suppose that we observe a data set with treatment $Z_i \in \{0,1\}$ and outcome $Y_i \in \{0,1\}$ for  $i=1,2,\ldots, n$. Further consider the pair of **potential outcomes** $\{Y_i(1), Y_i(0)$ for each unit. We only get to observe $Y_i \equiv Y_i(Z_i)$ in the data set. This notation system can be generalized to case when the treatments take more than two values. 


Any causal quanity is a function of potential outcomes. We can define the causal effect for unit $i$ as $\tau_i \equiv Y_i(1)-Y_i(0)$. We can define more causal effects as long as they are intepretable and meaningful, such as $\log Y_i(1)-\log Y_i(0)$, $Y_i(1)/Y_i(0)$, etc. 

Some notes on the potential outcomes. 
1. Potential outcomes are thought to be fixed for each unit. 
2. The observed outcome is random for each unit, because the treatment is random.
3. Potential outcomes are seen as fixed _attributes_ of each subject. 
4. Potential outcomes have a distribution across units. 
5. Treatments determine which potential outcomes to see. 
6. In real life, only one potential outcome is observed for each unit. 

It is important to note that the notation itselfs introduces assumptions. 
1. Causal ordering: $Y_i$ cannot causally affect $\tau_i$. 
2. No interference between units.
3. Same version of treatments. 

These are combined in to the Stable Unit Treatment Value Assumption (SUTVA). It requires that if $\tau_i =\tau_i'$, then $Y_i(\vec{\tau})=Y_i(\vec{\tau}')$, where $\vec{\tau}\equiv (\tau_1,\tau_2, \ldots, \tau_n)^T$ and $\vec{\tau}'\equiv (\tau'_1,\tau'_2, \ldots, \tau'_n)^T$. The SUTVA is violated when there exist simultaneity (feedback efforts), spillover effects, or multiple versions of treatments. 

In addition, the treatment can not be an immutable characteristic, such as gender, race, age, etc. A general rule for causal inference is that there is no causation without manipulation. There are, however, novel strategies to learn/define causal effects for immutable characteristics.  

We can now examine why causal interpretation is easier from randomized experiments. Recall that the average treatment effect is defined as  
\[
{\rm ACE} \equiv \mathbb{E}[Y(1)-Y(0)]=\mathbb{E}[Y(1)]-\mathbb{E}[Y(0)].
\]
Randomization of treatments means $\{Z_i\}_{i=1}^n \perp \{ Y_i(0), Y_i(1)\}_{i=1}^n$. 

We can derive 
\begin{align}
{\rm ACE} & = \mathbb{E}[Y(1)]-\mathbb{E}[Y(0)]=\mathbb{E}[Y(1)|Z=1]-\mathbb{E}[Y(0)|Z=0]\\
& = \mathbb{E}[Y|Z=1]-\mathbb{E}[Y|Z=0],
\end{align}
where the second equality follows from independence between treatments and potential outcomes. 