## 03 Observational Study

Sometimes we cannot do randomized experiment.

For example we want to study the effect of smoking on lung cancer. We cannot force a group of people to smoke.

In such case, we may **find** a group of people who smoke and a group of people who do not smoke. Then we compare the lung cancer rate between the two groups. This is called an observational study.

## Confounding Variables

We cannot control the confounding variables in an observational study. (Recall the confounding variables are other variables that may affect the outcome.)

For instance, if there are more male than female in the group of people who smoke, then gender is a confounding variable. We shall eliminate the effect of gender before we compare the lung cancer rate between the two groups.


### No Unobserved Confounding Assumption (NUCA)

Assume all possible confounding variables are taken into consideration, denoted by set $C$.


### g-Formula

With NUCA in an observational study, the g-formula is:

$$\mathbb E(Y^a) = \sum_{c \in C} \mathbb E(Y | A=a, C=c) \mathbb P(C=c)$$

where $Y^a$ is the outcome if we set the treatment to $a$. $C$ is the set of all confounding variables.

**Proof** 

$$\begin{aligned}\mathbb E(Y^a) &= \sum_{c\in C}\mathbb E(Y^a|C=c)\mathbb P(C=c)
\\ &= \sum_{c\in C}\mathbb E(Y^a|A=a,C=c)\mathbb P(C=c)
\\ &= \sum_{c\in C}\mathbb E(Y|A=a,C=c)\mathbb P(C=c)
\end{aligned}$$

In [48]:
from IPython.display import display
import pandas as pd
import sympy as sp

def g_formula_analyze(data):
    Pc = []
    N = sp.S(data['N'].sum())
    for c in range(data['C'].nunique()):
        Pc.append(data[data['C'] == c]['N'].sum() / N)
        formula = ' + '.join(str(i) for i in data[data['C'] == c]['N'])
        print('P(C=%d) = (%s)/%d = %s'%(c, formula, N, Pc[c]))

    for a in range(2):
        x = data[data['A'] == a]
        formula = ' + '.join('%s*%s'%(e, Pc[i]) for e, i in zip(x['E(Y|A=a,C=c)'], x['C']))
        EYA = sum((e*Pc[i]) for e, i in zip(x['E(Y|A=a,C=c)'], x['C']))
        print('E(Y^%d) = %s = %s'%(a, formula, EYA))

def g_formula_cond_analyze(data):
    for a in range(2):
        x = data[data['A'] == a]
        Pc_cond = []
        N = sp.S(x['N'].sum())
        for c in range(data['C'].nunique()):
            Pc_cond.append(x[x['C'] == c]['N'].sum() / N)
            formula = ' + '.join(str(i) for i in x['N'])
            print('P(C=%d|A=%d) = %d/(%s) = %s'%(c, a, Pc_cond[c] * N, formula, Pc_cond[c]))
        formula = ' + '.join('%s*%s'%(e, Pc_cond[i]) for e, i in zip(x['E(Y|A=a,C=c)'], x['C']))
        EYA = sum((e*Pc_cond[i]) for e, i in zip(x['E(Y|A=a,C=c)'], x['C']))
        print('E(Y|A=%d) = %s = %s\n'%(a, formula, EYA))

data = pd.DataFrame({
    'N':  [4000,3000,8000,9000],
    'A':  [1,1,0,0],
    'C': [0,1,0,1],
    'E(Y|A=a,C=c)': [24,36,10,22] 
}, index = range(1, 5))
display(data)
[g_formula_analyze(data), print(), g_formula_cond_analyze(data), None][-1]

Unnamed: 0,N,A,C,"E(Y|A=a,C=c)"
1,4000,1,0,24
2,3000,1,1,36
3,8000,0,0,10
4,9000,0,1,22


P(C=0) = (4000 + 8000)/24000 = 1/2
P(C=1) = (3000 + 9000)/24000 = 1/2
E(Y^0) = 10*1/2 + 22*1/2 = 16
E(Y^1) = 24*1/2 + 36*1/2 = 30

P(C=0|A=0) = 8000/(8000 + 9000) = 8/17
P(C=1|A=0) = 9000/(8000 + 9000) = 9/17
E(Y|A=0) = 10*8/17 + 22*9/17 = 278/17

P(C=0|A=1) = 4000/(4000 + 3000) = 4/7
P(C=1|A=1) = 3000/(4000 + 3000) = 3/7
E(Y|A=1) = 24*4/7 + 36*3/7 = 204/7




### Exchangeability

From the example above we can see that usually $\mathbb E(Y^a)\neq \mathbb E(Y|A=a)$ in the observational study.

**Theorem** When the exchangeability holds, $\mathbb P(A=1|C=c)=\mathbb P(A= 0|C=c)=\frac12$ for all $c\in C$, then $\mathbb E(Y^a) = \mathbb E (Y|A=a)$.

**Proof** We have 

$$\left\{\begin{aligned}\mathbb E(Y^a) &= \sum_{c\in C}\mathbb E(Y|A=a,C=c)\mathbb P(C=c)
\\ \mathbb E(Y|A=a)  &= \sum_{c\in C}\mathbb E(Y|A=a,C=c)\mathbb P(C=c|A=a)
\end{aligned}\right.$$

Note that $\mathbb P(C=c|A=a) = \dfrac{\mathbb P(A=a|C=c)\mathbb P(C=c)}{\mathbb P(A=a)}= \dfrac{\frac12\mathbb P(C=c)}{\mathbb P(A=a)}$. Also, 
$$\mathbb P(A=a) = \sum_{c\in C}\mathbb P(A=a|C=c)\mathbb P(C=c) = \sum_{c\in C}\frac12\mathbb P(C=c) = \frac12.$$

Hence, $\mathbb P(C=c|A=a) = \mathbb P(C=c)$. And $\mathbb E(Y^a) = \mathbb E (Y|A=a)$.


### Conditional Average Causal Effect

We can define the average causal effect conditioning on part of the confounding variables. Assume $C = \{V,W\}$ where the set $V$ is what we want to condition on. Then the conditional average causal effect is: $\mathbb E(Y^1|V=v) - \mathbb E(Y^0|V=v)$.

When assuming NUCA, we have analogous g-formula:
$$\mathbb E(Y^a|V=v) = \sum_{w\in W}\mathbb E(Y|A=a,V=v,W=w)\mathbb P(W=w|V=v).$$