In [1]:
!pip install dowhy

Collecting dowhy
  Downloading dowhy-0.6-py3-none-any.whl (123 kB)
[K     |████████████████████████████████| 123 kB 6.9 MB/s eta 0:00:01
Collecting pydot>=1.4
  Downloading pydot-1.4.2-py2.py3-none-any.whl (21 kB)
Installing collected packages: pydot, dowhy
Successfully installed dowhy-0.6 pydot-1.4.2


# Science of Science Summer School (S4) 2021
## Day 5: Causal Inference
- Daniel E. Acuna, School of Information, Syracuse University

# Contents

- Motivation
- Potential outcomes framework
- DAG
- Backdoor criterion
- Example identification
- Estimation

# Movitations
- We can use ML/AI/Deep learning to predict, but that does not mean that there is _causal_ effect (even if we have 100% accuracy!)

Examples from Microsoft's DoWhy (https://github.com/microsoft/dowhy)
- Will it work?
  - Does a proposed change to a system improve people's outcomes?
- Why did it work?
  - What led to a change in a system's outcome?
- What should we do?
  - What changes to a system are likely to improve outcomes for people?
- What are the overall effects?
  - How does the system interact with human behavior?
  - What is the effect of a system's recommendations on people's activity?


# Recommendations

- Most of this material taken from "Causal Inference: The Mixtape" https://mixtape.scunning.com/

# An example

![](images/intervention_data.png)
from https://www.inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example/

# Potential outcomes framework

> Rubin, Donald. 1974. “Estimating Causal Effects of Treatments in Randominzed and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688–701.

# Potential outcomes framework (2)

- Potential outcomes are defined as $Y^1_i$ if unit $i$ received the treatment and $Y_i^0$ if the unit did not
- Both outcomes have the same unit $i$ (it is the same person, scientist, journal, etc)
- Therefore, *we can only observe* one state of the world.
- Each unit has only one state of the world (either treatment occurred $Y^1$ or not $Y^0$)
- Observable or "actual" outcomes $Y_i$ are different from potential outcomes. They are the outcomes that actually occur for unit $i$

# Potential outcomes framework (3)

- Let's define $D_i$ as the assignment of unit $i$, where $D_i = 1$ is unit assigned to treatment and $D_i = 0$ if not.
- The following equation relates potential outcomes, observable outcomes, and assignment
$$Y_i = D_i Y_i^1 + (1-D_i) Y_i^0$$
- Notice that when $D_i=1$, we recover $Y_i = Y_i^1$ and when $D_i=0$ we recover $Y_i = Y_i^0$
- Let's define the unit specific treatment effect as
$$\delta_i = Y_i^1 - Y_i^0$$
- Try to understand what that means!

# Average treatment effects
- For a population, we would like to know the _average treatment effect_
\begin{align}
   ATE & = E[\delta_i] \nonumber      \\
       & = E[Y^1_i - Y^0_i] \nonumber \\
       & = E[Y^1_i] - E[Y^0_i]        
\end{align}
- Notice that we require to know both potential outcomes for each unit $i$.
- **This is impossible**. This is known as the "fundamental problem of causal inference".
- We must _infer_ $ATE$

# Average treatment effect for the treatment group (ATT)
- Alternatively, we might be interested in the treatment effect for the treated

\begin{align}
   ATT & = E\big[\delta_i\mid D_i=1\big] \nonumber                 
   \\
       & = E\big[Y^1_i - Y^0_i \mid D_i = 1\big] \nonumber          
   \\
       & = E\big[Y^1_i\mid D_i=1\big] - E\big[Y^0_i\mid D_i=1\big] 
\end{align}

- $E\big[Y^0_i\mid D_i=1\big]$ is a **counterfactual**: _what would have happened to unit $i$ who received surgery had they received chemo instead?_
- ATT will likely differ from ATE because the assignment might affect how they react to treatment (e.g., people who received the COVID-19 trial vaccines were in general more scientifically oriented, careful, etc.)

# Average treatment effect for the untreated group (ATU)
- Alternatively, we might be interested in the treatment effect for the treated

\begin{align}
   ATU & = E\big[\delta_i\mid D_i = 0\big] \nonumber                          
   \\
       & = E\big[Y^1_i - Y^0_i\mid D_i = 0\big] \nonumber                     
   \\
       & =E\big[Y^1_i\mid D_i=0\big]-E\big[Y^0_i\mid D_i=0\big] 
\end{align}

- Depending on the questions, ATE, ATT, or ATU, or all three, might be of interest. None of them, however, can be observed.

# Example data

- Patients who have cancer, and two medical procedures or treatments. Surgery ($D_i=1$) or chemotherapy ($D_i=0$). Potential outcome is post-treatment life span.
- Estimate 
$$ATT=E[Y_i^1]-E[Y_i^0] = 0.6$$

![](images/potential_outcomes_example.png)

# Example data: perfect doctor
- Assume a perfect doctor who chooses treatment based on best prospects
$ATT=E\big[Y^1_i\mid D_i=1\big]-E\big[Y^0_i\mid D_i=1\big]=4.4$
$ATU=E\big[Y^1_i\mid D_i=0\big]-E\big[Y^0_i\mid D_i=0\big]=-3.2$
However, a naive estimation based on observation
$E[Y_1 \mid D=1] - E[Y_0 \mid D=0]=-0.4$ (it seems like surgey is worse!)

![](images/potential_outcomes_perfect_doctor.png)

# What's going on

- We can decompose our estimation:

\begin{align}
\underbrace{\dfrac{1}{N_T} \sum_{i=1}^n \big(y_i\mid d_i=1\big)-\dfrac{1}{N_C}
   \sum_{i=1}^n \big(y_i\mid d_i=0\big)}_{ \text{Simple Difference in Outcomes}}
&= \underbrace{E[Y^1] - E[Y^0]}_{ \text{Average Treatment Effect}}
\\
&+ \underbrace{E\big[Y^0\mid D=1\big] - E\big[Y^0\mid D=0\big]}_{ \text{Selection bias}}
\\
& + \underbrace{(1-\pi)(ATT - ATU)}_{ \text{Heterogeneous treatment effect bias}}
\end{align}
where $\pi$ is the share of patient who receive surgery
- Selection bias: Difference between groups if both received chemo --- what if they never received a treatment in the first place?
- Heterogeneous treatment effect bias difference between the surgery and chemo groups (counterfactually) weighted by how many people went to chemo.

# How to eliminate these biases?

\begin{align}
\underbrace{\dfrac{1}{N_T} \sum_{i=1}^n \big(y_i\mid d_i=1\big)-\dfrac{1}{N_C}
   \sum_{i=1}^n \big(y_i\mid d_i=0\big)}_{ \text{Simple Difference in Outcomes}}
&= \underbrace{E[Y^1] - E[Y^0]}_{ \text{Average Treatment Effect}}
\\
&+ \underbrace{E\big[Y^0\mid D=1\big] - E\big[Y^0\mid D=0\big]}_{ \text{Selection bias}}
\\
& + \underbrace{(1-\pi)(ATT - ATU)}_{ \text{Heterogeneous treatment effect bias}}
\end{align}

- Heterogeneous treatment effect bias: assume that treatment effects are the same for all units $\forall i \delta_i = \delta $
- Selection bias: **the goal of causal inference is to try to eliminate selection bias**

# The "simplest" approach

- Assume the conditional independence of potential outcomes with respect to assignment
$$(Y_1 , Y_0) \perp \!\!\! \perp D $$
- This means that we assign patients to the surgery condition for reasons completely unrelated to potential gains in the surgery.
- In our toy example, this is not true: surgery if $Y^1 > Y^0$, otherwise chemo.
- Almost all observations of human behavior violate these assumptions: humans choose a treatment _because_ they expect some gain.

# The "simplest" approach (2)
- What if doctor chooses surgery independent of expected gains? Examples: alphabetical order, first half gets surgery, second half doesn't.
- In general, if there is any source of external randomness for assignment, we can claim independence.
- This would mean:
\begin{align}
   E\big[Y^1\mid D=1\big] - E\big[Y^1\mid D=0\big]=0 \\
   E\big[Y^0\mid D=1\big] - E\big[Y^0\mid D=0\big]=0 
\end{align}
- No selection bias.

# The "simplest" approach (3)

- What about heterogeneity treatment bias.
\begin{gather}
   ATT = E\big[Y^1\mid D=1\big] - E\big[Y^0\mid D=1\big]
   \\
   ATU = E\big[Y^1\mid D=0\big] - E\big[Y^0\mid D=0\big]
\end{gather}

therefore

\begin{align}
   ATT-ATU & =\mathbf{E\big[Y^1\mid D=1\big]}-E\big[Y^0\mid D=1\big]    \\
           & - \mathbf{E\big[Y^1 \mid D=0\big]}+ E\big[Y^0\mid D=0\big] \\
           & = 0                                                        
\end{align}

# Other assumptions
- We usually need to assume that:
   - All unites receive the same "treatment" dosage: easy to violate: some doctors are better than others
   - No "externalities" from one unit to others (independence among units)": easy to violate in networks. 
   - No "general equilibrium" effects: this means good generalization from experience to actual real world.

# Directed Acyclical Graphs (DAG)

# Why DAGs

- Before starting causal inference, we must make sure that some conditions are met.
- These conditions have to do with not allowing unwanted "flows" of information to go from unmeasured variables or measured variables that change the statistical structure of the problem.
- To explain these ideas, we need to understand Directed Acyclical Graphs.

# DAG

- The kinds of DAGs we need are _causal_ graph and not probabilistic graphical models*
- They represent causal relationships between variables. For example
![](images/dag1.png)
- We have three random variables D, Y, and X. 
- D causes Y and X causes D and Y

# DAG

- We say that there is a path between the assignment $D$ and the outcome $Y$ like so $D \rightarrow Y$
- But we also say that there is a **backdoor path** from $D \leftarrow X \rightarrow Y$
![](images/dag1.png)
- We call $X$ a confounder because it jointly determines $D$ and $Y$
- Leaving a "backdoor" open creates bias.

# DAG

- We say that there is a path between the assignment $D$ and the outcome $Y$ like so $D \rightarrow Y$
- But we also say that there is a **backdoor path** from $D \leftarrow X \rightarrow Y$
![](images/dag1.png)
- We call $X$ a confounder because it jointly determines $D$ and $Y$
- Leaving a "backdoor" open creates bias.

# DAG

- Sometimes we have variables that we cannot observe
- Here, we do not observe U, and therefore the backdoor is open.
![](images/dag2.png)

# DAG
- Sometimes, we have variables that are _colliders_. $X$ is a collider
- We have a path $D \rightarrow Y$ and backdoor path $D \rightarrow X \leftarrow Y$, **however, the backdoor path is _closed_ because of the collider** (causal effects do not flow from $X$.)
![](images/dag3.png)

# DAG conditioning

- Conditioning in DAGs is different from traditional probability distributions
- To condition on this causal graph, we have to use different methods such as subclassification, matching, regression, etc.
- We have to be careful, however:
   - If a backdoor is _open_, we have to close it by **conditioning** on such variable
   - If a backdoor is _closed_, we leave it alone. If we condition on it, it will become an **open backdoor**

# DAG example

![](images/intervention_data_dag.png)
from https://www.inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example/

# DAG interventions
- do-calculus by Judea Pearl is equivalent to the potential outcomes framework http://bayes.cs.ucla.edu/home.htm
- it tends to be more popular in AI/ML settings but less applicable
- potential outcomes is more popular with economics, public policy, statistics, but more applicable

![](images/intervention_data_results.png)

# Closing backdoors with subclassification

- If we assume Conditional Independence Assumption (CIA)

$$(Y^1,Y^0) \perp \!\!\! \perp D\mid X$$
- and 
$$0<p(D=1 \mid X) <1$$

then we can assume independence between assignment and potential outcomes, and make the observations match the Average Treatment Effect (ATE)

\begin{align}
   E\big[Y^1\mid D=1,X\big]=E\big[Y^1\mid D=0,X\big]
   \\
   E\big[Y^0\mid D=1,X\big]=E\big[Y^0\mid D=0,X\big]
\end{align}




# Closing backdoors with subclassification


> Subclassification is a method of satisfying the backdoor criterion by weighting differences in means by strata-specific weights. These strata-specific weights will, in turn, adjust the differences in means so that their distribution by strata is the same as that of the counterfactual’s strata. This method implicitly achieves distributional balance between the treatment and control in terms of that known, observable confounder.

# Closing backdoors with subclassification

- If we assume Conditional Independence Assumption (CIA)

$$(Y^1,Y^0) \perp \!\!\! \perp D\mid X$$
- and 
$$0<p(D=1 \mid X) <1$$

then we can assume independence between assignment and potential outcomes, and make the observations match the Average Treatment Effect (ATE)

\begin{align}
   E\big[Y^1\mid D=1,X\big]=E\big[Y^1\mid D=0,X\big]
   \\
   E\big[Y^0\mid D=1,X\big]=E\big[Y^0\mid D=0,X\big]
\end{align}




# Conclusions
- There are many other ways of meeting the backdoor criterion:
   - Exact matching, approximate matching (propensity score, CEM)
   - Regression discontinuity
   - Instrumental variables
   - Difference-in-difference
   - Synthetic controls
- More in this book "Causal Inference: The Mixtape" https://mixtape.scunning.com/
- Packages for computing causal effects: https://github.com/microsoft/dowhy
