# Chapter 9 Causal Inference


## 9.1 Two Paradoxes 


**Simpson's paradox.** The following table summarizes number of successes and fails for two treatments (A and B) for small and large kidney stones. Treatment A is an open surgical procedures, and Treatment B is a minimally-invasive procedure. A more successful treatment should yield higher success rate compare to its alternative. 

|     | Small | stones |  Large | stones |
| :- | :-:| :-: | :-: | :-: |
|  | success | fail | success | fail   |
| Treatment A | 81 | 6 | 192 |71   |
| Treatment B | 234 | 36 | 55 |25   |

We can calculate the success rates as follow （Treatment A v.s. Treatment B). 

- Success rate for small stones:   0.93 (81/87)   > 0.87 (234/270)

- Success rate for large stones:   0.73  (192/263) > 0.69 (55/80)

- Overall success rate:   0.78 (273/350)  <span style="color:red"> <  </span> 0.83 (289/350)


<span style = "color:red"> Which treatment is more effective? </span> Read more about Simpson's paradox [here](https://en.wikipedia.org/wiki/Simpson's_paradox).


**Lord's paradox.** Lord (1967) discussed a paradox concerning whether the  effects of the diet provided in the dining hall differ for males and females. 

The data is shown in the following scatter plot. The variables involved are gender $(G_i)$, weight in 1963 $(X_i)$, and weight in 1964 $(Y_i)$. We have $\text{mean}(Y_i \mid G_i=1)=\text{mean}(X_i \mid G_i=1)=150$ and $\text{mean}(Y_i \mid G_i=0)=\text{mean}(X_i \mid G_i=0)=130$.
 
 
 <img src="../Figures/Ch9/lord.png" style="width: 500px;"/>


- Researcher A: Average weights unchanged for both males and females. 
- Researcher B: $Y_i =\beta_0+\beta_g G_i+\beta_X X_i+\epsilon_i$ $\rightsquigarrow$   $\widehat{\beta}_g=6.34$.

- Statistician C: $Y_i =\beta_0+\beta_g G_i+\beta_X X_i+\beta G_i X_i+\epsilon_i$ $\rightsquigarrow$   $\widehat{\beta}=0.036 \ (\text{s.e.} = 0.019)$


<span style = "color:red"> **Who is correct?** </span> Read more about the Lord's paradox [here](https://en.wikipedia.org/wiki/Lord's_paradox). 


## 9.2 Overview of causal inference


The two paradoxes should, at the very least, make it clear that association does not necessarily imply causation. In our everyday life, one usually never makes an effort to distinguish association from causation---unless this one is an statistician. Furthermore, most of the quantitative methods can only make claims on associations but not causations. 



However, it is often the causal inference that matters in the real world. For instance, we might be interested in the following questions.
- What would happen to the patient if they received treatment $A$ instead of $ B$?
- What would happen to the unemployment rate if the U.S. government increased minimum wages?
- What would happen to the case number if a state took a different action in April?


In all these what-ifs, we notice that it is always comparing _outcomes_ under _different_ conditions for the _same_ subject(s). In other words, **causal inference** is the comparison between _potential outcomes_ under _treatment_ and _control_ for the _same_ unit(s). 




It has been well-known and a common practice that randomized experiments are warranted for any decision and action that involves causal inference. Two prominent examples are the randomized clinical trials for the evaluation of drugs and treatments, and the A/B testing for evaluation of business strategies. 

Randomization seems to have answered the what-ifs. With a randomized clinical trial, we can compare the outcomes when patients randomly received the two treatments. However, there are some small but crucial caveats. For instance, in a randomized clinical trial, there can not be comparison on the same unit: a patient can not take one treatment and then roll back time to take another treatment. In other words, the comparison is not done on the _same_ unit. After all, is it even possible to strictly limit causal inference only when comparison is conducted on the same unit? We need a "language" to rigorously discuss this question. 



We will use the [potential outcome framework](https://en.wikipedia.org/wiki/Rubin_causal_model) in this note. There are many other approaches including the [direct acyclic graphs](https://en.wikipedia.org/wiki/Causal_graph).



Suppose we **observe** a data set with treatment $Z_i$ ($A$ or $B$) and outcome $Y_i$ (1 for success and 0 for fail) for $i=1,2,\ldots, n$. Further consider the **potential outcomes** $Y_i( \text{treatment} = A)$ that represents the outcome if subject $i$ receives treatment $A$, and $Y_i( \text{treatment} = B)$,  the outcome if subject $i$ receives treatment $B$. Using the notation of potential outcomes, we can write the observed outcome as $Y_i(Z_i)$, which shows that only one potential outcome is observed for each unit. The following table show the potential outcomes and the **observed values** in boldface. 

| Unit $i$ | $Z_i$ | &nbsp; $Y_i(A)$ | &nbsp; $Y_i(B)$ |
| ---  | ---  | ---- | ---- |
|1  |  A     |   **0**        |      1|
|2  |      B    |    0        |       **1** |
|3  |       A  |     **0**        |      0 |
|4  |       A |      **1**       |      1 |


In a real data, we only get to observe the following. 


| Unit $i$ | $Z_i$ | $Y_i$ | 
| --- | --- | ---  |
|1  |       A     |         **0**        |     
|2  |      B    |       **1** |
|3  |       A  |          **0**        |     
|4  |       A |          **1**       |     
|5  |       B  |           **0** |



With these new notations, we can see that the what-ifs describe the comparison of potential outcomes, e.g., $Y_i(A)-Y_i(B)$. To be specific, a causal effect is  defined to be the comparison of the  potential outcomes on the **same units**.
- **Individual causal effect**:  $Y_i(A)-Y_i(B)$.
- **Average causal effect** (ACE): $\text{mean}\{Y_i(A)-Y_i(B)\}$

It is important to notice that $\text{mean}\{Y_i(A)\}$ does not equal to $\text{mean}\{Y_i \mid Z_i=A\} $ in general. We can see this using the examples in the above tables.
- $\text{mean}\{Y_i(A)\}$: average of $Y_i(A)$ for units 1, 2, 3, 4.
- $\text{mean}\{Y_i\mid Z_i=A\}$: average of $Y_i$ for units 1,3.




The above tables also reveal the fundamental problem of causal inference. That is only one potential outcome is observed.  In randomized experiments, randomization ensures $\text{mean}\{Y_i(A)\} = \text{mean}\{Y_i \mid Z_i=A\} $, which allows us to estimate the average causal effects. We should always think about what can and what cannot be learned from the observed data.

> If your experiment needs statistics, you ought to have done a better experiment.
>
>        --  Ernest Rutherford 
        

> Regression models often seem to be used to compensate for problems in measurement, data collection, and study design. By the time models are deployed, the scientific position is nearly hopeless. 
> 
>        --  David Freedman 



A new task that will be often discussed in this chapter is **Identification**: what can be identified if there are infinite amount of data. This tasks concerns the design of experiements or how the data is collected. In contrast, the **Inference** problem, where we study what can be learned given a finite sample, is what we have focused on in most of the statistical methods. 

## 9.3 Potential outcome 




Recall our notation for potential outcome in Section 9.2. Suppose that we observe a data set with treatment $Z_i \in \{0,1\}$ and outcome $Y_i \in \{0,1\}$ for  $i=1,2,\ldots, n$. Further consider the pair of **potential outcomes** $\{Y_i(1), Y_i(0)\}$ for each unit. We only get to observe $Y_i \equiv Y_i(Z_i)$ in the data set. This notation system can be generalized to case when the treatments take more than two values. 




Any causal quanity is a function of potential outcomes. We can define the causal effect for unit $i$ as $\tau_i \equiv Y_i(1)-Y_i(0)$. We can define more causal effects as long as they are intepretable and meaningful, such as $\log Y_i(1)-\log Y_i(0)$, $Y_i(1)/Y_i(0)$, etc. 




Some notes on the potential outcomes. 
1. Potential outcomes are thought to be fixed for each unit. 
2. The **observed** outcome is random for each unit, because the treatment is random.
3. **Potential** outcomes are seen as fixed _attributes_ of each subject. 
4. Potential outcomes have a distribution across units. 
5. Treatments determine which potential outcomes to see. 
6. In real life, only one potential outcome is observed for each unit. 



It is important to note that the notation itselfs introduces assumptions. 
1. Causal ordering: $Y_i$ cannot causally affect $Z_i$. 
2. No interference between units.
3. Same version of treatments. 

These are combined in to the *Stable Unit Treatment Value Assumption* (SUTVA). It requires that if $Z_i =Z_i'$, then $Y_i(\vec{Z})=Y_i(\vec{Z}')$, where $\vec{Z}\equiv (Z_1,Z_2, \ldots, Z_n)^T$ and $\vec{Z}'\equiv (Z'_1,Z'_2, \ldots, Z'_n)^T$. The SUTVA is violated when there exist simultaneity (feedback efforts), spillover effects, or multiple versions of treatments. 



In addition, the treatment can not be an **immutable** characteristic, such as sex, race, age, etc. A general rule for causal inference is that there is no causation without manipulation. There are, however, novel strategies to learn/define causal effects for immutable characteristics.  



We can now examine why causal interpretation is easier from randomized experiments. Recall that the average treatment effect is defined as  
$$
{\rm ACE} \equiv \mathbb{E}[Y(1)-Y(0)]=\mathbb{E}[Y(1)]-\mathbb{E}[Y(0)].
$$
**Randomization of treatments means $\{Z_i\}_{i=1}^n \perp \{ Y_i(0), Y_i(1)\}_{i=1}^n$.**

We can derive 
\begin{align}
{\rm ACE} & = \mathbb{E}[Y(1)]-\mathbb{E}[Y(0)]=\mathbb{E}[Y(1)|Z=1]-\mathbb{E}[Y(0)|Z=0]\\
& = \mathbb{E}[Y|Z=1]-\mathbb{E}[Y|Z=0],
\end{align}
where the second equality follows from independence between treatments and potential outcomes. 

## 9.4 Observational studies 

### 9.4.1 Overview 

Observational studies are studies when randomization is infeasible. In the potential outcome framework, observational data means that there is no randomization of treatment assignments, i.e., $\{Y_i(1), Y_i(0)\}$ is not independent of $Z_i$. 

Most of real world data are observational. There are a wide range of challenges in analyzing data in each field. Three main statistical challenges are as follows. 

* Treatment assignment mechanism is often unknown.
* There exist observed and unobserved confounders.
* Missing data are widely present. 

### 9.4.2 Selection bias 



We have discussed in Section 9.2 that, in an observational study, the expected difference between the treatment and control group does not equal to the average causal effects (ACE). We can take a closer look at the difference (i.e., bias)
$$
\mathbb{E}[Y_i|Z_i =1] - \mathbb{E}[Y_i | Z_i=0] - {\rm ACE}={\rm pr}(Z_i=0) \big\{\mathbb{E}[Y_i(1)|Z_i=1]-\mathbb{E}[Y_i(1)|Z_i=0] \big\}+{\rm pr}(Z_i=1) \big\{\mathbb{E}[Y_i(0)|Z_i=1]-\mathbb{E}[Y_i(0)|Z_i=0] \big\}. 
$$



To remove the bias, one common assumption in the literature is known as the **ignorability** assumption. Roughtly, the ignorability assumption imposes that, with covariates $X_i$, units are more homegeneous in the outcome distribitions, i.e.,
$$
Y_i(1) | Z_i = 1, X_i \sim Y_i(1)|Z_i=0, X_i
$$
$$
Y_i(0) | Z_i = 1, X_i \sim Y_i(0)|Z_i=0, X_i
$$

Formally, the ignorability assumption states that $Z_i \perp Y_i(Z) | X_i$ for $Z=0,1$. This is also known as uncounfoundedness, exogeneity, selection on observables. 



In addition to the ignorability assumption, we need a few more assumptions to identify the ACE in an observational study. The assumptions are as follows. 

- $\{Z_i, Y_i(1), Y_i(0), X_i\}_{i=1}^n$ are i.i.d. (stronger than SUTVA)
- Ignorability
- Overlap: $0<{\rm pr}(Z_i=1| X_i =x)<1$ for any $x$

The overlap assumption is a technical requirement for identification. Without overlap, the sets $\{Z_i=1, X_i\}$ or $\{Z_i=0, X_i\}$ can be empty, which means that no comparison is possible. 

Under these assumptions, we can start to decompose the selection bias. The second term becomes 
\begin{align}
& \mathbb{E}[Y_i(0)|Z_i=1]-\mathbb{E}[Y_i|Z=0]\\
= & \int_{S_1 \setminus S} \mathbb{E}[Y_i(0) | Z_i =1, X_i=x] d F_{X_i|Z_i=1} (x)  - \int_{S_0 \setminus S} \mathbb{E}[Y_i(0) | Z_i =0, X_i=x] d F_{X_i|Z_i=0} (x) \\
& + \int_{S} \mathbb{E}[Y_i(0)| Z_i=0, X_i=x] d \{F_{X_i|Z_i=1}(x)-F_{X_i|Z_i=0}(x) \}\\
& + \int_{S} \big\{\mathbb{E}[Y_i(0)| Z_i=1, X_i=x] -\mathbb{E}[Y_i(0)| Z_i=0, X_i=x] \big\} d \{F_{X_i|Z_i=1}(x)\},
\end{align}
where the first two terms are biases due to non-overlaps, the second-to-last term is the bias due to imbalance of observables, and the last term is the bias due to unobservables. 



### 9.4.3 Elementary strategies

**Stratification**


We can identify the ${\rm ACE}$ via the conditional average causal effect as follows 
\begin{align}
{\rm ACE} & = \mathbb{E}[Y_i(1)-Y_i(0)]= \mathbb{E}\big\{ \mathbb{E}[Y_i(1)-Y_i(0)|X_i] \big\}\\
& = \int \big\{ \mathbb{E}[Y_i|Z_i=1,X_i]-\mathbb{E}[Y_i|Z_i,X_i] \big\} d F( X).
\end{align}

Suppose that $x_i \in \{1,2,\ldots, K\}$. Under the ignorability assumption that $Y_{i}(Z) \perp Z_i \mid X_i$, which means that $Z_i$ is randominzed in subgroup $\{i: x_i=k\}$. 
We then have 
$$
\hat{\rm ACE}=\sum_{k=1}^K \frac{n_{[k]}}{n} \hat{\rm ACE}_{[k]}.
$$
It is challenging when we have a larger $K$ (e.g., when there are multiple variables). 


**Outcome regression**

We can use linear outcome models for the potential outcomes. 
$$
\begin{cases}
Y_i(1)=\alpha_1 + x_i'\gamma_1 + \epsilon_i(1) & \mathbb{E}[\epsilon_i(1)|X_i]=0\\
Y_i(0)=\alpha_0 + x_i'\gamma_0 + \epsilon_i(0) & \mathbb{E}[\epsilon_i(0)|X_i]=0
\end{cases}
$$
This yields 
$$
\begin{cases}
\mathbb{E}[ Y_i(1)| x_i]= \mathbb{E}[ Y_i| Z_i=1, x_i] =\alpha_1 + x_i'\gamma_1 \\
\mathbb{E}[ Y_i(0)| x_i]= \mathbb{E}[ Y_i| Z_i=0, x_i] =\alpha_0 + x_i'\gamma_0
\end{cases}
$$
Then 
$$
{\rm ACE} = \mathbb{E}\big\{ \mathbb{E}[Y_i(1)-Y_i(0)|X_i] \big\}=(\alpha_1-\alpha_0)+(\gamma_1 -\gamma_0)' \mathbb{E}[X_i]
$$

The above equations lead to $Y_i=Z_i Y_i(1)+(1-Z_i) Y_i(0)$ and $\epsilon_i=Z_i \epsilon_i(1)+(1-Z_i) \epsilon_i(0)$  for the observed outcomes. We then have 
$$
Y_i = \alpha_0 + (\alpha_1-\alpha_0)Z_i +X_i'\gamma_0+Z_i X_i(\gamma_1 -\gamma_0)+\epsilon_i.
$$
Note that the same does not hold for logistic regression and other nonlinear regression. 



## 9.5 Propensity score

### 9.5.1 Definition




The key of the potential outcome framework is its separation of the science, i.e. ${\rm pr}(Y_i(1), Y_i(0) | X_i)$ and the treatment assignment mechanism  
$$
e(X_i, Y_i(1), Y_i(0)) \equiv {\rm pr}(Z_i=1 | X_i, Y_i(1), Y_i(0)). 
$$
Under the ignorability assumption, the treatment assignment mechanism $e(X_i)={\rm pr}(Z_i=1 | X_i)$. This is known as the **propensity score** by Rosenbaum and Rubin (1983). 



**Theorem** (Propensity score as a dimension reduction tool) Assume that 
1. $\{X_i, Y_i(1), Y_i(0), Z_i\}_{i=1}^n$ are i.i.d. 
2. $Z_i \perp \{Y_i(1), Y_i(0)\} | X_i$,
3. $\eta \leq e(X_i) \leq 1- \eta$ for some positive $\eta$. 

Then $Z_i \perp \{Y_i(1), Y_i(0)\} | e(X_i)$. 



**Proof** Just need to show that.
$$
{\rm pr}\{Z=1 | Y(1), Y(0), e(X) \} = {\rm pr}\{ Z=1 | e(X)\}.
$$

The left-hand side satisfies that 
\begin{align}
{\rm LHS} \ & = \mathbb{E}[Z| Y(1), Y(0), e(X)] \\
& = \mathbb{E}[ \mathbb{E}[Z| Y(1), Y(0), e(X), X] | Y(1), Y(0), e(X)] \\
& = \mathbb{E}[e(X)| Y(1), Y(0), e(X)] =e(X),
\end{align}
where the third equality holds because 
\begin{align}
& \mathbb{E}[Z| Y(1), Y(0), e(X), X] \\
= & \mathbb{E}[Z| Y(1), Y(0), X] \ \ \ ({\rm redundancy \  of }\ e(X) \ \ {\rm given} \ X)\\
= & \mathbb{E}[Z|  X] \ \ \  {\rm (Ignorability)} \\
= & {\rm pr}(Z=1|X)=e(X).
\end{align}

Furthermore, we have 
\begin{align}
{\rm RHS} \ & = \mathbb{E}[Z|e(X)] = \mathbb{E}\big[ \mathbb{E}[Z|e(X), X ] \mid e(X) \big]\\
& = \mathbb{E}[e(X)|e(X)]=e(X),
\end{align}
where the third equality holds since 
$$
\mathbb{E}[Z|e(X), X ]= \mathbb{E}[Z|X]={\rm pr}(Z=1|X) =e(X). 
$$



**Take-aways** 

Simple case: $e(X) \in \{ e_1, e_2, \ldots, e_K\}$. Just do stratified analysis! 

Problem: (1) $e(X)$ usually takes continuous values; (2) $e(\cdot)$ is unknown. 

General case: (1) Estimate $e(X_i)$ with logistic regression; (2) Discretize $\hat{e}(X_i)$ with cutoffs.

Problems: (1) Small $K$ leads to bias; Large $K$ leads to no overlap with strata. So how to choose $K$? 
(2) $\hat{e}(X)$ is estimated. 


### 9.5.2 Propensity score weighting

**Theorem** (IP score weighting) Assume that conditions (1), (2), (3) as before, we have 
$$
\mathbb{E}[Y_i(1)] = \mathbb{E}\left[\frac{Z_i Y_i}{e(X_i)} \right],\ 
\mathbb{E}[Y_i(0)] = \mathbb{E}\left[\frac{ (1-Z_i) Y_i}{1-e(X_i)} \right].
$$

$$
{\rm ACE} \ = \mathbb{E}[Y_i(1)-Y_i(0)]=\mathbb{E}\left[ \frac{Z_i Y_i }{e(X_i)}\right]-\mathbb{E}\left[\frac{ (1-Z_i) Y_i}{1-e(X_i)} \right].
$$

Horvitz-Thompson inverse probability weighting 
$$
\widehat{\rm ACE}^{\rm ipw} =\frac{1}{n} \sum_{i=1}^n \frac{Z_i Y_i}{\hat{e}(X_i)} - \frac{1}{n} \sum_{i=1}^n \frac{(1-Z_i)Y_i}{1-\hat{e}(X_i)}.
$$
Hajek inverse probaility weighting 
$$
\widehat{\rm ACE}^{\rm ipw} =\frac{  \sum_{i=1}^n \frac{Z_i Y_i}{\hat{e}(X_i)}}{\sum_{i=1}^n \frac{Z_i }{\hat{e}(X_i)}} - \frac{\sum_{i=1}^n \frac{(1-Z_i)Y_i}{1-\hat{e}(X_i)}}{\sum_{i=1}^n \frac{1-Z_i}{1-\hat{e}(X_i)}}.
$$

The two estimators have the same limit asymptotically. 

Problems (1) Unstable (trim propensity score) (2) variance not neat (use bootstrap). 


### 9.5.3 Design using propensity scores

The central role of the propensity score in designing observational studies. 

**Theorem** (Balancing property of propensity score) Assume that conditions 1 and 3 hold. We have $Z \perp X \mid e(X)$ and for any $h(x)$, given $\mathbb{E}|h(X)| < \infty$, 
$$
\mathbb{E} \left[ \frac{Z h(X)}{e(X)} \right] =\mathbb{E} \left[ \frac{(1-Z) h(x)}{1-e(X)} \right]
$$

Why is this theorem useful? 

* Design of observational studies without outcome 
* Crucial to check covariate balance
* How to choose $h(X)$? 

### 9.5.4  Doubly robustness



We know two equalities of ${\rm ACE}$. 
$$
{\rm (i)} \ {\rm ACE} \ = \mathbb{E} \left[\frac{ZY}{e(X)} \right] - \mathbb{E} \left[\frac{(1-Z)Y}{1-e(X)} \right] \ \ {\rm (IPW)}.
$$

$$
{\rm (ii)} \ {\rm ACE} \ = \mathbb{E}[\mu_1(X)]-\mathbb{E}[\mu_0(X)],
$$
where $\mu_1(X) = \mathbb{E}[Y(1)\mid X]$ and $\mu_0(X)=\mathbb{E}[Y(0)\mid X]$. 

Both (i) and (ii) hold under assumptions. Thus, any convex combination of (i) and (ii) gives ${\rm ACE}$. We will discuss one case. 



Consider 
$$
\begin{cases}
Y(1) = Y(1) - \mu_1(X) + \mu_1(X)\\
Y(0) = Y(0) - \mu_0(X) + \mu_0(X)
\end{cases}
$$
We can derive 
\begin{align}
{\rm ACE} \ = \ &  \mathbb{E}[Y(1) - \mu_1(X)] -\mathbb{E}[Y(0) - \mu_0(X)] \ {\rm ACE \ for \ residuals}\\
& \mathbb{E}[\mu_1(X)]-\mathbb{E}[\mu_0(X)] \ {\rm ACE \ for }\ X \\
= & \mathbb{E}\left[\frac{Z [Y-\mu_1(X)] }{e(X)} \right] - \mathbb{E}\left[\frac{(1-Z) [Y-\mu_0(X)] }{1-e(X)} \right]+\mathbb{E}[\mu_1(X)] -\mathbb{E}[\mu_0(X)]\  {\rm IPW}\\
= & \mathbb{E}\left[\frac{Z [Y-\mu_1(X)] }{e(X)} +\mu_1(X) \right] - \mathbb{E}\left[\frac{(1-Z) [Y-\mu_0(X)] }{1-e(X)} +\mu_0(X)\right].
\end{align}



**Theorem** (Double robustness property) 
If either $e(X,\alpha)=e(X)$ or $\mu_1(X,\beta_1)=\mu_1(x)$, then $\tilde{\mu}_{1,{\rm DR}} =\mathbb{E}[Y(1)]$;
If either $e(X,\alpha)=e(X)$ or $\mu_0(X,\beta_0)=\mu_0(x)$, then $\tilde{\mu}_{0,{\rm DR}} =\mathbb{E}[Y(0)]$;
If either $e(X,\alpha)=e(X)$ or $[\mu_0(X,\beta_0), \mu_1(X,\beta_1) ]=[ \mu_0(x), \mu_1(x)]$, then ${\rm ACE}=\tilde{\mu}_{1,{\rm DR}}-\tilde{\mu}_{0,{\rm DR}}$.



**Doubly robust estimator**

Step 1: Fit a propensity score model $e(X,\hat{\alpha})$.

Step 2: Fit two outcome models $\mu_1(X,\hat{\beta}_1)$ and $\mu_0(X,\hat{\beta}_0)$.

Step 3: 
\begin{align}
\hat{\mu}_{1,{\rm DR}} & =\frac{1}{n} \sum_{i=1}^n \left\{ \frac{Z_i\big[Y_i -\mu_1\big(X_i, \hat{\beta}_1\big) \big] }{e\big( X_i, \hat{\alpha}\big)}+\mu_1\big(X_i, \hat{\beta}_1\big) \right\}\\
\hat{\mu}_{0,{\rm DR}} & =\frac{1}{n} \sum_{i=1}^n \left\{ \frac{(1-Z_i)\big[Y_i -\mu_0\big(X_i, \hat{\beta}_0\big) \big] }{e\big( X_i, \hat{\alpha}\big)}+\mu_0\big(X_i, \hat{\beta}_0\big) \right\}\\
\widehat{\rm ACE}_{\rm DR} & =
\hat{\mu}_{1,{\rm DR}}- \hat{\mu}_{0,{\rm DR}} = \frac{1}{n}\sum_{i=1}^n \hat{\tau}_{i, {\rm DR}},
\end{align}
where 
$$
\hat{\tau}_{i, {\rm DR}}=\frac{Z_i\big[Y_i -\mu_1\big(X_i, \hat{\beta}_1\big) \big] }{e\big( X_i, \hat{\alpha}\big)}-\mu_1\big(X_i, \hat{\beta}_1\big) - \frac{(1-Z_i)\big[Y_i -\mu_0\big(X_i, \hat{\beta}_0\big) \big] }{e\big( X_i, \hat{\alpha}\big)}-\mu_0\big(X_i, \hat{\beta}_0\big).
$$

Variance estimator 

(i) Lunceford and Davidian (2004, Stat Med)
$$
\hat{\rm var}\big(\hat{\rm ACE}_{\rm DR} \big)=\frac{1}{n(n-1)} \sum_{i=1}^n \big(\hat{\tau}_{i,{\rm DR}} - \hat{\rm ACE}_{\rm DR} \big)^2
$$

(ii) Bootstrap 

## 9.6 Instrumental variable

Work-in-progress

- Definition of instrumental variable
- IV in DAGs
- Assumptions of IV
- Two-stage least squares estimation