# Introduction to Causal Inference

In [6]:
import pandas as pd
import numpy as np
%matplotlib inline

Why Study Causal Inference? 
- What are some mistakes that could happen if we do a naive analysis?
  - Confounding example
  - The definition of the do-operator

RCTs
    To find causal effects, one could randomize the treatment assignment. 

Chain

Fork

Collider

Descendant of a collider

Backdoor Criterion - 
  - When is seeing = doing?
  - assumes you know the structure of the problem well enough
  
do-calculus

Adjustment Formula
  - derivation
  
Frontdoor Criterion
  - lets us deal with partial knowledge of the graph / unmeasured variables
  
Crazy example

Transportability

Resources

Book Club

## Causality is Ubiquitous

A lot of interesting questions are **causal** questions.

Epidemiology: I heard vaping is killing people. A friend of mine vapes. Should he switch back to smoking?

Economics: What would happen to our economy if we increase taxes?

Software Engineering: What code is *slowing down* our application?

Marketing: What changes on our website would increase visits of Marketing Qualified Leads?

Education: What interventions are *increasing* graduation rates of high school students?

### Hypothetical Scenario

We want to improve students' graduation rates. Let's say we have some treatment T and we're trying to analyze the efficacy of treatment T. We find that on average, the graduation rates of those who get treatment T are 31 percentage points *worse* than those who didn't get treatment T. In other words,
\begin{equation}
\begin{aligned}
    P(G=1 \mid T=1) - P(G=0 \mid T=0) \approx -31\%
\end{aligned}
\end{equation}

### Question:
Should we recommend treatment T to students?

### Answer:

It depends on the data-generating mechanism!

## Correlation Does Not Necessarily Imply Causation
Here's an example where looking at associations naively leads to the wrong conclusion. 

Below is an example where treatment is negatively associated with graduation rate, even though the treatment is actually beneficial, on average!

How could that be? Let's say in a hypothetical world, there are only three variables that matter: treatment $T$, graduation rate $G$. Students classified as "at-risk" $R$ are more likely to be sent to receive tutoring. At-risk students by definition are less likely to graduate. In other words, $R$ is a common cause of $T$ and $G$. Another way to say that is $T$ and $G$ are **confounded** by $R$. So if there's a lot of people at-risk who receive treatment, then it is quite possible that there is a negative association between the treatment and outcome, *even though the treatment actually improves the outcomes for most/all individuals*.

| Risk confounds Tutoring and Graduate |
| - |
![Risk is a common cause of treatment and graduation](./img/risk-tutoring-graduate.png)

### Confounding Example Code

In [59]:
sample_size = 1000000

In [60]:
def risk(sample_size=10000):
    """
      We generate two types of people: At-risk vs. Not at-risk. 
      Risk, for example, could be in relation to dropping out of high school.
    """
    
    return np.random.binomial(n=1, p=0.3, size=sample_size)

In [61]:
def tutoring(riskiness, proba_tutor_given_not_risky=0.1, proba_tutor_given_risky=0.9):
    """
        Non-risky people (riskiness == 0) have a 10% chance of receiving a tutoring.
        However, risky people have a 90% chance of receiving the tutoring.
    """
    
    probability_of_receiving_tutoring = \
        (riskiness == 0) * proba_tutor_given_not_risky + \
        (riskiness == 1) * proba_tutor_given_risky
    
    return np.random.binomial(n=1, p=probability_of_receiving_tutoring)

In [73]:
def graduate(riskiness, tutored):
    """
        Tutoring increases graduation rates by 10 percentage points.
        
        If risky and tutoring, graduation rate = 0.3
        If risky and not tutoring, graduation rate = 0.2
        If not-risky and tutoring, graduation rate = 0.9
        If not-risky and not tutoring, graduation rate = 0.8
        
    """
    
    risky_and_tutored_grad_rate = 0.3
    risky_and_not_tutored_grad_rate = 0.2
    not_risky_and_tutored_grad_rate = 0.9
    not_risky_and_not_tutored_grad_rate = 0.85
    
    graduation_probas = (riskiness == 1) * (tutored == 1) * risky_and_tutored_grad_rate + \
        (riskiness == 1) * (tutored == 0) * risky_and_not_tutored_grad_rate + \
        (riskiness == 0) * (tutored == 1) * not_risky_and_tutored_grad_rate + \
        (riskiness == 0) * (tutored == 0) * not_risky_and_not_tutored_grad_rate
    
    return np.random.binomial(n=1, p=graduation_probas)

In [74]:
riskiness = risk(sample_size)
tutored = drug(riskiness)
graduated = graduate(riskiness, tutored)

In [75]:
df = pd.DataFrame({
    'risk': riskiness,
    'tutored': tutored,
    'graduated': graduated
})

Associated Risk Difference: $P(G=1 \mid T=1) - P(G=0 \mid T=0) \approx -31\%$

In [77]:
round(
    df[df['tutored'] == 1].graduated.mean() - df[df['tutored'] == 0].graduated.mean(),
    2
)

-0.31

However, if we do a randomized control trial (i.e. we randomize the assignment of the treatment), we see that the treatment actually improves the outcomes by 6%, on average!

In [78]:
# We generate a population with roughly the same percentage of at-risk kids
treated_sample = risk(sample_size)
untreated_sample = risk(sample_size)

round(
    graduate(treated_sample, tutored=1).mean() - graduate(untreated_sample, tutored=0).mean(),
    2
)

0.06

Causal Risk Difference: RCT of graduation rates given treated minus graduation rates given untreated = $6\%$

If we were to only look at the associated risk difference (i.e. make a comparison of graduation rates of those who got treated vs. those who didn't get treated), we would erroneously conclude that the treatment, on average, is bad (i.e. decreases graduation rate by 31%), **but in actuality, it boosts graduation rate, on average, by 6 percentage points!**

### The meaning of the *do*-operator

Now we introduce the $do$-notation. $do(T=t)$ means we force the variable $T$ for the population of interest to be set to $t$ in an idealized way. When we $do(T=t)$, $T$ stops listening to other variables (i.e. becomes *exogenous*) and is set to $t$. 

Graphically, we can represent $do(T=t)$ as removing the arrow from $R$ to $T$:

| $do(T=t)$ removes arrow from R to T |
|-|
| ![Graph like above, but R no longer has an arrow to T](./img/rct-tutoring-graduate.png) |

## Randomized Control Trials (A/B Testing)

One way to compute the average causal effect in the example above is to get two random samples of the treatment, assign one of them to the treatment, and then not assign the other. We could then subtract the means of each group. In our example above, the causal risk difference, a measure of causal effect, is computed as follows:

\begin{equation}
\begin{aligned}
    \text{CRD} &= P(G=1 \mid do(T=1)) - P(G=1 \mid do(T=0))
\end{aligned}
\end{equation}

As we can see from above, the results of $P(G=1 \mid do(T=t))$ can be different from $P(G=1 \mid T=t)$, which is just another way of saying that correlation (positive, zero, or negative) does not necessarily imply causation.


### Pros:
- Removes confounding on treatment. *Nice especially when there are a lot of confounding variables*.
- Assuming that all the people in the beginning of the study are still available at the end of the study (i.e. no selection bias), the only thing that's different between the treatment group and the control group is the presence or lack of treatment. Therefore, the difference in outcomes between the two could be attributed to the treatment!
- Good [*internal validity*](https://en.wikipedia.org/wiki/Internal_validity).
- Results are generally easy to understand, easy to interpret.

Fig 1: Treatment Group in RCT (one square = 100k of that animal), assuming no selection bias. Dog='At Risk', Cat='Not At Risk'

|*|*|*|
|-|-|-|
| ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) |
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) |
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) |

Fig 2: Control Group in RCT (one square = 100k of that animal), assuming no selection bias. Dog='At Risk', Cat='Not At Risk'

|*|*|*|
|-|-|-|
| ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) |
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) |
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) |

### Cons

- Might be unethical / really expensive (e.g. assessing the effect of smoking on cancer in the 70s).

| * |
| - |
| ![Image of lungs and a lit cigarette](http://a360-wp-uploads.s3.amazonaws.com/wp-content/uploads/rtmagazi/2015/09/dreamstime_xs_58648955-399x350.jpg) |

- Still susceptible to Selection Bias.

| Success Academy Alleged Bias | Corresponding Diagram |
| - | - |
| ![NYTimes: Filing Alleges Bias at Success Academy Network Against Students With Disabilities](./img/success-academy-bias.png) | ![Risk points to Transfer variable, which is adjusted for](./img/transfer-selection-bias.png) | 

Example: If Tutoring is randomized, but we only have access to data of students who *didn't* transfer, then our results could be biased.

In other words, when analyzing the results of the randomized control trial, that gives us $P(G=1 \mid do(T=1), Tr=0) - P(G=1 \mid do(T=0), Tr=0)$, which could be different from our target quantity $P(G=1 \mid do(T=1)) - P(G=1 \mid do(T=0))$.

- Questionable transportability: (e.g. RCT on mice. Would the results of that study be applicable to humans?)

| Lab Mouse |
| - |
| ![Lab mouse](https://speakingofresearch.files.wordpress.com/2018/04/mouse-cv.jpg?w=863) |

To address the shortcomings of RCTs, today we will learn some causal graph theory!

### Chains

| An example of a a chain |
| - |
| ![Middle School -> High School -> College](./img/mid-high-college.png) |

Let $c$ stand for some value of `Graduate College`, $h$ stand for some value of `Graduate High School`, and $m$ stand for some value of `Graduate Middle School`.

The left node and the right node are associated, i.e. the path from $\text{Graduate Middle School} \rightarrow \text{Graduate High School} \rightarrow \text{Graduate College}$ is open. However, if we condition on the middle node $\text{Graduate High School}$ (i.e. if we know someone graduated or did not graduate, then the other nodes are no longer associated. Doing so *blocks* the path $\text{Graduate Middle School} \rightarrow \text{Graduate High School} \rightarrow \text{Graduate College}$.

\begin{equation}
\begin{aligned}
    P(c) &\neq P(c \mid m) &\text{$m$ tells us something about $c$} \\
    P(m) &\neq P(m \mid c) &\text{$c$ tells us something about $m$} \\
    P(c \mid h ) &= P(c \mid h, m) & \text{Once we know h, $m$ tells us nothing about $c$.} \\
    P(m \mid h ) &= P(m \mid h, c) & \text{Once we know h, $c$ tells us nothing about $m$.} \\ 
\end{aligned}
\end{equation}

\begin{equation}
\begin{aligned}
    C &\perp \!\!\! \perp  M \mid H & \text{$C$ is independent of $M$ given $H$}
\end{aligned}
\end{equation}

### Forks

| An example of a fork |
| - |
| ![Age causes intelligence and shoe size](./img/age-intel-shoe-size.png) |

Let $i$ stand for some value of `Intelligence`, $a$ for some value of `Age`, and $s$ for some value of `Shoe Size`.

Just like in the chain example, the non-middle-nodes are associated until we know the value of the middle node. The path $\text{Intelligence} \leftarrow \text{Age} \rightarrow \text{Shoe Size}$ is *open* and transmits information. However, if we know the `Age`, then `Intelligence` tells us nothing about `Shoe Size` and vice versa. In other words, conditioning on `Age` blocks the path $\text{Intelligence} \leftarrow \text{Age} \rightarrow \text{Shoe Size}$.

\begin{equation}
\begin{aligned}
    P(i) &\neq P(i \mid s) & \text{$s$ tells us something about $i$.} \\
    P(s) &\neq P(s \mid i) & \text{$i$ tells us something about $s$.} \\
    P(i \mid a) &= P(i \mid a, s) & \text{Once we know $a$, $s$ tells us nothing about $i$.} \\
    P(s \mid a) &= P(s \mid a, i) & \text{Once we know $a$, $i$ tells us nothing about $s$.} \\
\end{aligned}
\end{equation}

\begin{equation}
\begin{aligned}
    I &\perp \!\!\! \perp S \mid A & \text{$I$ is independent of $S$ given $A$.}
\end{aligned}
\end{equation}


### Colliders

| An example of a collider |
| - |
| ![SAT causes college scholarship, Musical skill also causes college scholarship](./img/sat-scholarship-musical.png) |

Let $s$ stand for some value of `SAT score`, $m$ for some value of `Musical Skill` , and $c$ for some value of `College Scholarship`.

`Musical Skill` is independent of `SAT score` in the general population. In other words, the path $\text{SAT score} \rightarrow \text{College Scholarship} \leftarrow \text{Musical Skill}$ is blocked if we don't know the value of `College Scholarship`. However, if we know that someone has a college scholarship and they have low `Musical Skill`, then they are more likely to have a high `SAT score`. In other words, conditioning on the value of `College Scholarship` opens the path $\text{SAT score} \rightarrow \text{College Scholarship} \leftarrow \text{Musical Skill}$.

\begin{equation}
\begin{aligned}
    P(s) &= P(s \mid m) & \text{$m$ tells me nothing about $s$, if we don't know about $c$.}\\
    P(m) &= P(m \mid s) & \text{$s$ tells me nothing about $m$, if we don't know about $c$.} \\
    P(s) &\neq P(s \mid m, c) &\text{Once we know about $c$, and $m$, then that tells us something about $s$.}\\
    P(m) &\neq P(m \mid s, c) & \text{Once we know about $c$, and $s$, then that tells us something about $m$.} \\
\end{aligned}
\end{equation}

\begin{equation}
\begin{aligned}
    S &\perp\!\!\!\perp M &\text{$S$ and $M$ are marginally independent} \\
    S & {\not\!\perp\!\!\!\perp} M \mid C &\text{$S$ tells us about $M$, and vice versa, once we know $C$.}
\end{aligned}
\end{equation}




### Descendants of Colliders

| Student Debt is a descendant of a collider | 
| - |
| <img alt="Student Debt listens to College Scholarship" src="./img/desc-collider.png" width=500> |

Let $s$ stand for some value of `SAT score`, $m$ for some value of `Musical Skill`, $c$ for some value of `College Scholarship`, and $d$ for some value of `Student Debt`.

Again, like in the collider example above `SAT score` and `Musical Skill` are independent of each other in the general population. However, if we know the value of the descendant of a collider, `Student Debt in the first month of freshman year` then `SAT score` and `Musical Skill` become dependent. In other words, conditioning on a descendant of a collider opens the path $\text{SAT score} \rightarrow \text{College Scholarship} \leftarrow \text{Musical Skill}$.

\begin{equation}
\begin{aligned}
    P(s) &= P(s \mid m) & \text{$m$ tells me nothing about $s$, if we don't know about $c$.}\\
    P(m) &= P(m \mid s) & \text{$s$ tells me nothing about $m$, if we don't know about $c$.} \\
    P(s) &\neq P(s \mid m, d) &\text{Once we know about $d$, then knowing $m$ tells us something about $s$.}\\
    P(m) &\neq P(m \mid s, d) & \text{Once we know about $d$, then knowing $s$ tells us something about $m$.} \\
\end{aligned}
\end{equation}

\begin{equation}
\begin{aligned}
    S &\perp\!\!\!\perp M &\text{$S$ and $M$ are marginally independent} \\
    S & {\not\!\perp\!\!\!\perp} M \mid D &\text{$S$ tells us about $M$, and vice versa, once we know $D$.}
\end{aligned}
\end{equation}

### Backdoor Criterion

### $do$-calculus

### Front-door Criterion

### Complex Arbitrary DAG with Unmeasured Variables Example

### Mediation & Direct Effects

### Transportability

# Resources


| Image | Notes | Link |
| - | - | - |
| <img src='https://prodimage.images-bn.com/pimages/9781541698963_p0_v1_s600x595.jpg' alt='Book of Why: The New Science of Cause & Effect cover' width=500> | An introduction meant for the more general public. It still is technical, has some math, but focuses more on stories and anecdotes instead of derivations. | [Book of Why: The New Science of Cause & Effect](https://www.amazon.com/Book-Why-Science-Cause-Effect/dp/046509760X) | 
| <img alt='Causal Inference in Statistics: A Primer' src='https://s3.amazonaws.com/vh-woo-images/causal-inference-in-statistics-a-primer-1st-edition.jpg' width=500> | Recommended by Pearl to be read after the Book of Why. Dives more into the math. Has end-of-chapter exercises. *Note: I have the solutions manual! I told Pearl I was self-studying and he graciously gave me a copy!* | [Causal Inference in Statistics: A Primer](https://www.amazon.com/Causal-Inference-Statistics-Judea-Pearl/dp/1119186846) |
| <img src='https://images-na.ssl-images-amazon.com/images/I/511aGcbGLyL._SX343_BO1,204,203,200_.jpg' alt='Causality' width=500> | Goes more in-depth than the Primer book. | [Causality](https://www.amazon.com/Causality-Reasoning-Inference-Judea-Pearl/dp/052189560X) |
| <img alt='Causal Inference: The Mixtape' src='https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1566276665i/47867837._UY630_SR1200,630_.jpg' width=500> | Has a section on DAGs, but focused more on causal inference techniques that are more commonly being used in Economics. *FREELY AVAILABLE*. | [Causal Inference: The Mixtape](https://www.scunning.com/mixtape.html) |
| <img alt='Causal Diagrams: Draw your Assumptions Before Conclusions' src='./img/causal-diagrams-draw-your-assumptions.png' width=500> | *FREE* course on EDX. Makes use of Epidemiological case studies to serve as context as to why drawing your assumptions is important.  | [Causal Diagrams: Draw your Assumptions Before Conclusions](https://online-learning.harvard.edu/course/causal-diagrams-draw-your-assumptions-your-conclusions) |
| not applicable | "Causal Inference: What If" is a book that dives into Hernan & Robins' Potential Outcomes with DAGs approach. *FREE*. | [Causal Inference: What if](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/) |