# Introduction to Causal Inference: Part 1

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline

## Causality is Ubiquitous

A lot of interesting questions are **causal** questions.

Epidemiology: I heard vaping is killing people. A friend of mine vapes. Should he switch back to smoking?

Economics: What would happen to our economy if we increase taxes?

Software Engineering: What code is *slowing down* our application?

Marketing: What changes on our website would increase visits of Marketing Qualified Leads?

Education: What interventions are *increasing* graduation rates of high school students?

It is clear that causal questions are everywhere around us. Knowing tools to help answer these questions is quite valuable!

## Randomized Control Trials (A/B Testing)

One way to compute treatment effects is to assign treatment at random. Randomizing the treatment enables researchers to ensure that the treated and the untreated are essentially the same -- the only difference between the two groups is the treatment itself.

### Pros:
- Removes confounding on treatment. *Nice especially when there are a lot of confounding variables*.
- Assuming that all the people in the beginning of the study are still available at the end of the study (i.e. no selection bias), the only thing that's different between the treatment group and the control group is the presence or lack of treatment. Therefore, the difference in outcomes between the two could be attributed to the treatment!
- Good [*internal validity*](https://en.wikipedia.org/wiki/Internal_validity).
- Results are generally easy to understand, easy to interpret.


### Cons

- Might be unethical / really expensive (e.g. assessing the effect of smoking on cancer in the 70s).

Randomizing a treatment, especially when it might lead to adverse outcomes such as death, is problematic.


| * |
| - |
| ![Image of lungs and a lit cigarette](http://a360-wp-uploads.s3.amazonaws.com/wp-content/uploads/rtmagazi/2015/09/dreamstime_xs_58648955-399x350.jpg) |

- Still susceptible to Selection Bias.

| Success Academy Alleged Bias | Corresponding Diagram |
| - | - |
| ![NYTimes: Filing Alleges Bias at Success Academy Network Against Students With Disabilities](./img/success-academy-bias.png) | ![Risk points to Transfer variable, which is adjusted for](./img/transfer-selection-bias.png) | 

Example: Let's say we care about the effect of tutoring on graduation rates. Let's say that we randomized tutoring. However, before students could graduate from the school, the school decides to discriminate against risky students and push them out of the school. As a result, risky students are more likely to transfer out of the school. Even though we have randomized treatment, we could get a biased estimate of the causal effect of tutoring on graduation rates for the general school population as less risky students will be more represented in our sample. Thus, RCTs does not necessarily protect against *selection bias*, defined as selection after treatment (in Epidemiology).

- Questionable transportability: (e.g. RCT on mice. Would the results of that study be applicable to humans?)

| Lab Mouse |
| - |
| ![Lab mouse](https://speakingofresearch.files.wordpress.com/2018/04/mouse-cv.jpg?w=863) |

To address the shortcomings of RCTs, today we will learn some causal graph theory! In this notebook, we will specifically focus on the issue of *confounding* when RCTs are too expensive or unethical. We will address other issues in following notebooks.

### Hypothetical Scenario

We want to improve students' graduation rates. Let's say we have some treatment T and we're trying to analyze the efficacy of treatment T. We find that on average, the graduation rates of those who get treatment T are 40 percentage points *worse* than those who didn't get treatment T. In other words,
\begin{equation}
\begin{aligned}
    P(G=1 \mid T=1) - P(G=1 \mid T=0) \approx -40\%
\end{aligned}
\end{equation}

### Question:
Should we recommend treatment T to students?

### Answer:

We don't have enough information! It depends on the data-generating mechanism, which is not available in the data alone. We need information about the causal structure to know the causal effect of the treatment on the outcome. 

Below, we show a scenario where the treatment actually improves the outcome even though it's negatively associated with it.

## Beneficial Treatment, but Negatively Associated with Outcome
Here's an example where looking at associations naively leads to the wrong conclusion. 

Below is an example where treatment is negatively associated with graduation rate, even though the treatment is actually beneficial, on average!

How could that be? Let's say in a hypothetical world, there are only three variables that matter: treatment $T$, graduation rate $G$. Students classified as "at-risk" $R$ are more likely to be sent to receive tutoring by educators. At-risk students by definition are less likely to graduate. In other words, $R$ is a common cause of $T$ and $G$. Another way to say that is $T$ and $G$ are **confounded** by $R$. So if there's a lot of people at-risk who receive treatment, then it is quite possible that there is a negative association between the treatment and outcome, *even though the treatment actually improves the outcomes for most/all individuals*.

### Treatment and Control Group

Another way to show the above information is through Emojis. Dog emoji represents 10k "not-at-risk" students. Cat emoji represents 10k "at-risk" students.

Fig 1: Treatment Group in Observational Setting. 

|*|*|*|
|-|-|-|
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) |
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) |
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) |


Fig 2: Untreated Group in Observational Setting. 

|*|*|*|
|-|-|-|
| ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) |
| ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) |
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | 

The treatment group has many more at-risk students than the untreated group, so the former group is less likely to do well **even though the treatment improves outcomes, on average!**.


### Confounding Example Code

Below we show an example where the Associated Risk Difference, $P(G=1 \mid T=1) - P(G=1 \mid T=0)$, is pretty bad ($\approx -40\%$), but the treatment itself is *actually beneficial.*

In [2]:
sample_size = 100000000

In [3]:
def risk(sample_size=10000):
    """
      We generate two types of people: At-risk vs. Not at-risk. 
      Risk, for example, could be in relation to dropping out of high school.
    """
    
    return np.random.binomial(n=1, p=0.3, size=sample_size)

In [4]:
def tutoring(riskiness, proba_tutor_given_not_risky=0.1, proba_tutor_given_risky=0.9):
    """
        Non-risky people (riskiness == 0) have a 10% chance of receiving a tutoring.
        However, risky people have a 90% chance of receiving the tutoring.
    """
    
    probability_of_receiving_tutoring = \
        (riskiness == 0) * proba_tutor_given_not_risky + \
        (riskiness == 1) * proba_tutor_given_risky
    
    return np.random.binomial(n=1, p=probability_of_receiving_tutoring)

In [5]:
def graduate(riskiness, tutored):
    """
        Tutoring increases graduation rates by 10 percentage points.
        
        If risky and tutoring, graduation rate = 0.3
        If risky and not tutoring, graduation rate = 0.2
        If not-risky and tutoring, graduation rate = 0.9
        If not-risky and not tutoring, graduation rate = 0.8
        
    """
    
    risky_and_tutored_grad_rate = 0.3
    risky_and_not_tutored_grad_rate = 0.2
    not_risky_and_tutored_grad_rate = 0.9
    not_risky_and_not_tutored_grad_rate = 0.85
    
    graduation_probas = (riskiness == 1) * (tutored == 1) * risky_and_tutored_grad_rate + \
        (riskiness == 1) * (tutored == 0) * risky_and_not_tutored_grad_rate + \
        (riskiness == 0) * (tutored == 1) * not_risky_and_tutored_grad_rate + \
        (riskiness == 0) * (tutored == 0) * not_risky_and_not_tutored_grad_rate
    
    return np.random.binomial(n=1, p=graduation_probas)

In [6]:
riskiness = risk(sample_size)
tutored = tutoring(riskiness)
graduated = graduate(riskiness, tutored)

In [7]:
df = pd.DataFrame({
    'risk': riskiness,
    'tutored': tutored,
    'graduated': graduated
})

Associated Risk Difference: $P(G=1 \mid T=1) - P(G=0 \mid T=0) \approx -40\%$.

In [8]:
round(
    df[df['tutored'] == 1].graduated.mean() - df[df['tutored'] == 0].graduated.mean(),
    2
)

-0.4

However, let's consider a hypothetical scenario: Imagine the case of being able to randomly sample (with a really large sample size) from the population and that we had full control in assigning who gets treatment and who doesn't. Let's randomly sample a treatment group and a control group:

Fig 1: Treatment Group in RCT. Dog emoji represents 10k "not-at-risk" students. Cat emoji represents 10k "at-risk" students.

|*|*|*|
|-|-|-|
| ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) |
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) |
| ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) |

Fig 2: Control Group in RCT. Dog emoji represents 10k "not-at-risk" students. Cat emoji represents 10k "at-risk" students.

|*|*|*|
|-|-|-|
| ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) |
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) |
| ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) | ![Apple-dog](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/dog-face_1f436.png) |

Randomly sampling from the population for both treatment and control groups gives us groups that are (in large samples) essentially the same. In other words, the two groups are *comparable*. Now, imagine that we are somehow able to set the value of a variable of treatment $T$ to 1 for the treated group, and set the value of $T$ to 0 for the untreated group.

The treatment and control groups are the same. **The only thing that is different between the two groups is the treatment! Therefore, whatever difference we find can be attributed to having or not having the treatment!** and is a causal effect of having treatment for the population.

#### Idealized RCT code

In [9]:
# We generate a population with roughly the same percentage of at-risk kids for treated and untreated.
treated_sample = risk(sample_size)
untreated_sample = risk(sample_size)

# We set the value of Treatment to 1 for the treated group.
# We set the value of Treatment to 0 for the untreated (control) group.
# Then we take the means and subtract them from each other.

idealized_rct_estimate = round(
    graduate(treated_sample, tutored=1).mean() - graduate(untreated_sample, tutored=0).mean(),
    2
)

print_rct_estimate = 'Our RCT causal risk difference estimate of the causal effect of tutoring on graduation is ' + str(idealized_rct_estimate) + '.'
print(print_rct_estimate)

Our RCT causal risk difference estimate of the causal effect of tutoring on graduation is 0.06.


Causal Risk Difference: RCT of graduation rates given treated minus graduation rates given untreated = $6\%$

If we were to only look at the associated risk difference (i.e. make a comparison of graduation rates of those who got treated vs. those who didn't get treated), we would erroneously conclude that the treatment, on average, is bad (i.e. decreases graduation rate by 40%), **but in actuality, it boosts graduation rate, on average, by 6 percentage points!**

However, we normally don't have access to the functions of the data-generating process. What we instead have is a probability distribution collected from the data.

### The *do*-operator

As we showed above, the quantity $P(G=1 \mid T=1)$, the probability that people graduate given that we *observe* that people get treatment, is not the same as the probability that people graduate given that we *set* the treatment value. The former is a measure of association, which, as we showed before, could be a mix of causal and spurious relationships. The latter, on the other hand, is purely causal; it captures what happens to $G$ when $T$ is *set* to 1. The latter can be represented with the $do$-operator: $P(G=1 \mid do(T=1))$. We can then represent the causal risk difference in our idealized RCT example as:

\begin{equation}
\begin{aligned}
    \text{CRD} &= P(G=1 \mid do(T=1)) - P(G=1 \mid do(T=0))
\end{aligned}
\end{equation}

The causal risk difference, in plain english, is basically saying, on average, how much does graduation rates change when we switch from no-treatment to treatment? Another way to say this is: "What is the causal effect of treatment on the outcome, on average? 

The main question we'll look into today is: "Can we emulate the idealized RCT above, using observational data?**. As we'll see later on, the answer is *sometimes.* We'll show some examples where it is possible, and show some cases where it is not. To understand those examples, we'll first need to learn some causal graphical theory.

## Directed Acyclic Graphs

Variables are represented as nodes. If a variable $A$ directly affects another $B$, then there should be an arrow from $A$ to $B$: $A \rightarrow B$, hence *directed*. The graph is *acyclic*, meaning that a variable can't cause itself. For scenarios where there seems to be feedback, we could represent them as varying over time. For example, tutoring at time 1 affects grades at time 2. They both affect tutoring at time 3, which then affect grades at time 4, etc.

## Risk, Tutoring, Graduation Example

Let's go back to the "Risk, Tutoring, Graduation Rate" example. When treatment is not-randomized, we have the following DAG:


| Risk confounds Tutoring and Graduate |
| - |
![Risk is a common cause of treatment and graduation](./img/risk-tutoring-graduate.png)

While there may be an effect of $\text{Tutoring}$ on $\text{Graduate}$, there is a third variable that relates the two: $\text{Risk}$ is a common cause of $\text{Tutoring}$ and $\text{Graduate}$, which is a source of spurious (non-causal) relationship. In other words, information flows from $T$ to $G$ through the non-causal pathway $T \leftarrow R \rightarrow G$.

In the idealized RCT case, we randomly assigned treatment without regard to a student's $\text{Risk}$ level. 

| $do(T=t)$ removes arrow from R to T |
|-|
| <img alt="Graph like above, but R no longer has an arrow to T" src="./img/rct-tutoring-graduate.png" width=500> |

The $\text{Tutoring}$ variable no longer listens to any variable, including $\text{Risk}$. As it'll become more clear later on, there are no other paths that transmit information from $\text{Tutoring}$ to $\text{Graduate}$. The only path that exists is $T \rightarrow G$, which is the causal path. Therefore, if we find an associational relationship between $T$ and $G$ in the idealized RCT, then it must be *causal*.

**The million-dollar question is this: How de we stop information from flowing from the treatment to the outcome? In our example, how do we stop non-causal paths, such as $T \leftarrow R \rightarrow G$, from transmitting information so that the only association between treatment and outcome is causal?**

To answer this question, we'll need to learn about path-blocking!

### Path Blocking

### Chains

| An example of a a chain |
| - |
| ![Middle School -> High School -> College](./img/mid-high-college.png) |

Let $c$ stand for some value of $\text{Graduate College}$, $h$ stand for some value of $\text{Graduate High School}$, and $m$ stand for some value of $\text{Graduate Middle School}$.

The left node and the right node are associated, i.e. the path from $\text{Graduate Middle School} \rightarrow \text{Graduate High School} \rightarrow \text{Graduate College}$ is open. However, if we condition on the middle node $\text{Graduate High School}$ (i.e. if we know someone graduated or did not graduate, then the other nodes are no longer associated. Doing so *blocks* the path $\text{Graduate Middle School} \rightarrow \text{Graduate High School} \rightarrow \text{Graduate College}$.

\begin{equation}
\begin{aligned}
    P(c) &\neq P(c \mid m) &\text{$m$ tells us something about $c$} \\
    P(m) &\neq P(m \mid c) &\text{$c$ tells us something about $m$} \\
    P(c \mid h ) &= P(c \mid h, m) & \text{Once we know h, $m$ tells us nothing about $c$.} \\
    P(m \mid h ) &= P(m \mid h, c) & \text{Once we know h, $c$ tells us nothing about $m$.} \\ 
\end{aligned}
\end{equation}

\begin{equation}
\begin{aligned}
    C &\perp \!\!\! \perp  M \mid H & \text{$C$ is independent of $M$ given $H$}
\end{aligned}
\end{equation}

### Forks

| An example of a fork |
| - |
| ![Age causes intelligence and shoe size](./img/age-intel-shoe-size.png) |

Let $i$ stand for some value of $\text{Intelligence}$, $a$ for some value of $\text{Age}$, and $s$ for some value of $\text{Shoe Size}$.

Just like in the chain example, the non-middle-nodes are associated until we know the value of the middle node. The path $\text{Intelligence} \leftarrow \text{Age} \rightarrow \text{Shoe Size}$ is *open* and transmits information. However, if we know the $\text{Age}$, then $\text{Intelligence}$ tells us nothing about $\text{Shoe Size}$ and vice versa. In other words, conditioning on $\text{Age}$ blocks the path $\text{Intelligence} \leftarrow \text{Age} \rightarrow \text{Shoe Size}$.

\begin{equation}
\begin{aligned}
    P(i) &\neq P(i \mid s) & \text{$s$ tells us something about $i$.} \\
    P(s) &\neq P(s \mid i) & \text{$i$ tells us something about $s$.} \\
    P(i \mid a) &= P(i \mid a, s) & \text{Once we know $a$, $s$ tells us nothing about $i$.} \\
    P(s \mid a) &= P(s \mid a, i) & \text{Once we know $a$, $i$ tells us nothing about $s$.} \\
\end{aligned}
\end{equation}

\begin{equation}
\begin{aligned}
    I &\perp \!\!\! \perp S \mid A & \text{$I$ is independent of $S$ given $A$.}
\end{aligned}
\end{equation}


### Colliders

| An example of a collider |
| - |
| ![SAT causes college scholarship, Musical skill also causes college scholarship](./img/sat-scholarship-musical.png) |

Let $s$ stand for some value of $\text{SAT score}$, $m$ for some value of $\text{Musical Skill}$ , and $c$ for some value of $\text{College Scholarship}$.

$\text{Musical Skill}$ is independent of $\text{SAT score}$ in the general population. In other words, the path $\text{SAT score} \rightarrow \text{College Scholarship} \leftarrow \text{Musical Skill}$ is blocked if we don't know the value of $\text{College Scholarship}$. However, if we know that someone has a college scholarship and they have low $\text{Musical Skill}$, then they are more likely to have a high $\text{SAT score}$. In other words, conditioning on the value of $\text{College Scholarship}$ opens the path $\text{SAT score} \rightarrow \text{College Scholarship} \leftarrow \text{Musical Skill}$

\begin{equation}
\begin{aligned}
    P(s) &= P(s \mid m) & \text{$m$ tells me nothing about $s$, if we don't know about $c$.}\\
    P(m) &= P(m \mid s) & \text{$s$ tells me nothing about $m$, if we don't know about $c$.} \\
    P(s) &\neq P(s \mid m, c) &\text{Once we know about $c$, and $m$, then that tells us something about $s$.}\\
    P(m) &\neq P(m \mid s, c) & \text{Once we know about $c$, and $s$, then that tells us something about $m$.} \\
\end{aligned}
\end{equation}

\begin{equation}
\begin{aligned}
    S &\perp\!\!\!\perp M &\text{$S$ and $M$ are marginally independent} \\
    S & {\not\!\perp\!\!\!\perp} M \mid C &\text{$S$ tells us about $M$, and vice versa, once we know $C$.}
\end{aligned}
\end{equation}




### Descendants of Colliders

| Student Debt is a descendant of a collider | 
| - |
| <img alt="Student Debt listens to College Scholarship" src="./img/desc-collider.png" width=500> |

Let $s$ stand for some value of $\text{SAT score}$, $m$ for some value of $\text{Musical Skill}$, $c$ for some value of $\text{College Scholarship}$, and $d$ for some value of $\text{Student Debt}$.

Again, like in the collider example above $\text{SAT score}$ and $\text{Musical Skill}$ are independent of each other in the general population. However, if we know the value of the descendant of a collider, $\text{Student Debt}$ then $\text{SAT score}$ and $\text{Musical Skill}$ become dependent. In other words, conditioning on a descendant of a collider opens the path $\text{SAT score} \rightarrow \text{College Scholarship} \leftarrow \text{Musical Skill}$.

\begin{equation}
\begin{aligned}
    P(s) &= P(s \mid m) & \text{$m$ tells me nothing about $s$, if we don't know about $c$.}\\
    P(m) &= P(m \mid s) & \text{$s$ tells me nothing about $m$, if we don't know about $c$.} \\
    P(s) &\neq P(s \mid m, d) &\text{Once we know about $d$, then knowing $m$ tells us something about $s$.}\\
    P(m) &\neq P(m \mid s, d) & \text{Once we know about $d$, then knowing $s$ tells us something about $m$.} \\
\end{aligned}
\end{equation}

\begin{equation}
\begin{aligned}
    S &\perp\!\!\!\perp M &\text{$S$ and $M$ are marginally independent} \\
    S & {\not\!\perp\!\!\!\perp} M \mid D &\text{$S$ tells us about $M$, and vice versa, once we know $D$.}
\end{aligned}
\end{equation}

### Backdoor Criterion

When analyzing the effect of some treatment on an outcome using observational data, we need to *block all backdoor paths* that transmit information in a non-causal way. Once we do that (without conditioning on mediators), we can find the total average effect of treatment on the outcome, sometimes for certain subgroups, and sometimes for the whole population, as we'll see below.

| Risk confounds Tutoring and Graduate |
| - |
![Risk is a common cause of treatment and graduation](./img/risk-tutoring-graduate.png)

In this example, the causal path is $\text{Tutoring} \rightarrow \text{Graduate}$. However, there is a non-causal path $\text{Tutoring} \leftarrow \text{Risk} \rightarrow \text{Graduate}$ that is transmitting a spurious relationship between $\text{Tutoring}$ and $\text{Graduate}$. By conditioning on $\text{Risk}$, we block the only non-causal *backdoor path* $\text{Tutoring} \leftarrow \text{Risk} \rightarrow \text{Graduate}$ (since conditioning on a middle node of a fork blocks the path). **In this case, the only way that $\text{Risk}$ would be related to $\text{Graduate}$ is through the causal path $\text{Tutoring} \rightarrow \text{Graduate}$.**

\begin{equation}
\begin{aligned}
    P(G=1 \mid T=1, R=1) &= P(G=1 \mid do(T=1), R=1) & \text{Seeing is doing, for that subpopulation of at-risk students!}
\end{aligned}
\end{equation}

Likewise:

\begin{equation}
\begin{aligned}
    P(G=1 \mid T=1, R=0) &= P(G=1 \mid do(T=1), R=0) & \text{Seeing is doing, for that subpopulation of not-at-risk students!}
\end{aligned}
\end{equation}


Assuming the data was generated by the DAG above, if we only look at those who were treated and were at-risk and looked at the graduation rates, it would be as if we only looked at the people who were at risk, and then gave those people the treatment (i.e. set the value of $\text{Tutoring}$ to 1).

The CRD for the at-risk subpopulation is then:

\begin{equation}
\begin{aligned}
    P(G=1 \mid do(T=1), R=1) - P(G=1 \mid do(T=0), R=1) &= P(G=1 \mid T=1, R=1) - P(G=1 \mid T=0, R=1) 
\end{aligned}
\end{equation}

Visualizing the control and treatment groups could help explain this:

Fig 1: Treatment Group in our Observational Sample. Dog emoji represents 10k "not-at-risk" students. Cat emoji represents 10k "at-risk" students.

|*|*|*|
|-|-|-|
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) |
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) |
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) |

Fig 2: Control Group in our Observational Sample. Dog emoji represents 10k "not-at-risk" students. Cat emoji represents 10k "at-risk" students.

|*|*|*|
|-|-|-|
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) |
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) |
| ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) | ![Apple-cat](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/apple/225/cat_1f408.png) |

**By only looking at at-risk students only from our observational data (with large samples), the treatment and control groups are the *same*. The only difference between them is the treatment! Therefore, the differences in outcome between the two groups can be attributed to the presence or lack of treatment!**


Earlier, we showed the result of randomizing $\text{Tutoring}$ in that example. That removes the arrows leading to $\text{Tutoring}$ such as the arrow $\text{Risk} \rightarrow \text{Tutoring}$: 

| $do(T=t)$ removes arrow from R to T |
|-|
| <img alt="Graph like above, but R no longer has an arrow to T" src="./img/rct-tutoring-graduate.png" width=500> |

In this case, all the backdoor paths are blocked -- there are no backdoor paths to block in the first place. Let $P_m$ represent the probability distributions induced by data-generating processes in the "mutilated" graph produced by our idealized RCT. The only way for $\text{Tutoring}$ and $\text{Graduate}$ to be associated is the path  $\text{Tutoring} \rightarrow \text{Graduate}$, and therefore:

\begin{equation}
\begin{aligned}
    P_m(G=1 \mid T=1) &= P_m(G=1 \mid do(T=1)) & \text{Seeing is "doing" in the idealized RCT}.
\end{aligned}
\end{equation}

This implies that the Causal Risk Difference for the whole population can be estimated with the associated differences in the treatment and control groups of our RCT.

\begin{equation}
\begin{aligned}
    \text{CRD} &= P_m(G=1 \mid do(T=1)) - P_m(G=1 \mid do(T=0)) \\
    &= P_m(G=1 \mid T=1) - P_m(G=1 \mid T=0)
\end{aligned}
\end{equation}

In the confounded case where we had observational data, we were able to find average causal effects for subpopulations (i.e. those at risk (and those not at risk)). In the non-confounded, idealized RCT case, we were able to find average causal effects for the whole population. One might wonder if we can use the values of the former to get the latter. It turns out that we can. We'll show this later on.

But first, let's look at other examples:

### M-bias example

| M-bias example |
|- |
| <img alt="M-bias example" src="./img/m-bias.png" width=600> |

#### Question: What variables should we condition on to block the backdoor path?

#### Answer

We don't need to condition on anything! $\text{Treatment_X} \leftarrow \text{SAT_score} \rightarrow \text{College Scholarship} \leftarrow \text{Musical_Skill} \rightarrow \text{Outcome_Y}$ is already blocked by default without conditioning on anything! It's blocked because the only that's the only path and $\text{College Scholarship}$ is a collider. If we don't condition on $\text{College Scholarship}$ or its descendant $\text{Student Debt}$, then the path is already blocked. Thus:

\begin{equation}
\begin{aligned}
    P(Y \mid X=1) &= P(Y \mid do(X=1)) & \text{Seeing is doing for the general population!}
\end{aligned}
\end{equation}

#### What happens when we condition on $\text{College Scholarship}$?

If we were to condition on $\text{College Scholarship}$, then that opens the backdoor path.
Therefore, 

\begin{equation}
\begin{aligned}
    P(Y \mid X=1, C=1) &\neq P(Y \mid do(X=1), C=1) & \text{Seeing is *not* doing for that subpopulation!}
\end{aligned}
\end{equation}


We'll need to condition on another middle variable, such as $\text{SAT_score}$ to block the path $\text{Treatment_X} \leftarrow \text{SAT_score} \rightarrow \text{College Scholarship} \leftarrow \text{Musical_Skill} \rightarrow \text{Outcome_Y}$. Thus, $\text{SAT_score}$ and $\text{College Scholarship}$ is also a valid *deconfounding set*. It's sufficient, *but not necessary.*

\begin{equation}
\begin{aligned}
    P(Y \mid X=1, C=1, S=1) &= P(Y \mid do(X=1), C=1, S=1) & \text{Seeing is doing for that subpopulation!}
\end{aligned}
\end{equation}


### What if I have a complex web of variables? Is there software that gives me a deconfounding set, if it exists?

Yes there is! See [dagitty.net/dags.html](dagitty.net/dags.html).

It gives you an deconfounding (i.e. adjustment) set if possible. Here's the most recent example. It says we don't need to adjust for anything! 

It also gives us testable implications of our graph. If our graph is true, then the conditional independencies such as $\text{SAT_score}  \perp\!\!\!\perp \text{Musical_Skill}$ must also hold in the data! If that doesn't hold, then there's something wrong with the graph.

![M-bias dagitty recommendations](./img/m-bias-dagitty.png)


## Average Causal Effects for the Population

In our earlier example, where $\text{Risk}$ confounds $\text{Tutoring}$ and $\text{Graduation}$, we found that conditioning on $\text{Risk}$ blocks the only backdoor path.

| Risk confounds Tutoring and Graduate |
| - |
![Risk is a common cause of treatment and graduation](./img/risk-tutoring-graduate.png)

We found that: 

\begin{equation}
\begin{aligned}
    P(G=1 \mid T=1, R=1) &= P(G=1 \mid do(T=1), R=1) & \text{Seeing is doing, for that subpopulation of at-risk students!}
\end{aligned}
\end{equation}

and

\begin{equation}
\begin{aligned}
    P(G=1 \mid T=1, R=0) &= P(G=1 \mid do(T=1), R=0) & \text{Seeing is doing, for that subpopulation of not-at-risk students!}
\end{aligned}
\end{equation}

We were able to get queries such as $P(G=1 \mid do(T=1), R=1)$ and $P(G=1 \mid do(T=0), R=1)$, subpopulation-specific effects. But what if we're interested in effects for the whole population? (i.e. $P(G=1 \mid do(T=1))$, i.e. what's the effect of treatment $T$ on the general population?

There are multiple ways to find that. A couple of ways are 1. the $do$-calculus, and 2. Truncated Factorization.

### $do$-calculus


#### Action/Observation Exchange

We already learned this rule.

If all the backdoor paths are blocked, then the only way the treatment and the outcome could be related is through the causal pathway $\text{Tutoring} \rightarrow \text{Graduate}$. Therefore, any association left over after blocking backdoor paths is causal. Because it's causal, we can write as follows:

\begin{equation}
\begin{aligned}
    P(G=1 \mid T=1, R=1) &= P(G=1 \mid do(T=1), R=1) & \text{Seeing is doing, for that subpopulation!}
\end{aligned}
\end{equation}

#### Insertion & Removal of Actions

When we *set* the treatment to a value regardless of any other variable, we depict that as removing the arrows going into $T$., i.e. $P(R=1 \mid do(T=1))$. Let's say we care about the effect of the treatment $T$ on $R$.

| $do(T=t)$ removes arrow from R to T |
|-|
| <img alt="Graph like above, but R no longer has an arrow to T" src="./img/rct-tutoring-graduate.png" width=500> |

In this modified graph, the only path between $T$ and $R$ is $T \rightarrow G \leftarrow R$. $G$ is a collider. We are not conditioning on $G$, so therefore $T$ and $R$ must be independent -- $T$ has no effect on $R$.

Since $T$ has no effect on $R$, we could remove $do(T=1)$ from the right hand side:

\begin{equation}
\begin{aligned}
    P(R=1 \mid do(T=1)) &= P(R=1) & \text{Intervening on T tells us nothing about R}
\end{aligned}
\end{equation}

### How do we put it all together? (Adjustment Formula Derivations)

#### Derivation 1: $do$-calculus

Here's our example: 

| Risk confounds Tutoring and Graduate |
| - |
![Risk is a common cause of treatment and graduation](./img/risk-tutoring-graduate.png)

\begin{equation}
\begin{aligned}
    P(G=1 \mid do(T=1)) &= \sum_r P(G=1, R=r \mid do(T=1)) \\
    &= \sum_r \frac{ P(G=1, R=r, do(T=1)) }{ P(R=r, do(T=1)) } \cdot \frac{ P(R=r, do(T=1)) }{ P(do(T=1)) }  \\
    &= \sum_r P(G=1 \mid do(T=1), R=r) \cdot P(R=r \mid do(T=1)) &\text{Algebra} \\
    &= \sum_r P(G=1 \mid do(T=1), R=r) \cdot P(R=r) &\text{do-calc: Insertion / Deletion of Actions} \\
    &= \sum_r P(G=1 \mid T=1, R=r) \cdot P(R=r) &\text{do-calc: Action / Observation Exchange } \\
\end{aligned}
\end{equation}

Thus, we have the casual risk difference for the whole population:

\begin{equation}
\begin{aligned}
    \text{CRD} &= P(G=1 \mid do(T=1)) - P(G=1 \mid do(T=0)) \\
    &= \sum_r P(G=1, R=r \mid do(T=1)) \\
    &= \sum_r [P(G=1 \mid T=1, R=r) - P(G=1 \mid T=0, R=r] \cdot P(R=r)
\end{aligned}
\end{equation}

We have a $do$-free expression, which means that, for this example, we can use observational data to estimate the causal effect of a treatment on some outcome, assuming we have the right types of data!

#### Derivation 2: Truncated Factorization

We could find the CRD another way. 

We can look back at the idealized RCT graph:


| $do(T=t)$ removes arrow from R to T |
|-|
| <img alt="Graph like above, but R no longer has an arrow to T" src="./img/rct-tutoring-graduate.png" width=500> |

We could make use of Markovian factorization to get the joint distribution of observed variables. $pa(x_i)$ stands for parents of the $X_i$ variable. $x_i$ is a particular realization of $X_i$.

\begin{equation}
\begin{aligned}
    P(x_0, x_1, ... x_z) &= \prod_{i=0}^z P(x_i \mid pa(x_i)) \\
\end{aligned}
\end{equation}

In the confounded case, we have:

\begin{equation}
\begin{aligned}
    P(g, t, r) &= P(g \mid t, r) \cdot P(t \mid r) \cdot P(r) \\
\end{aligned}
\end{equation}


In the modified graph, we have:

\begin{equation}
\begin{aligned}
    P_m(g, t, r) &= P_m(g \mid t, r) \cdot P_m(t) \cdot P_m(r) \\
    P_m(g, r \mid t) &= P_m(g \mid t, r) \cdot P_m(r) & \text{Divide both sides by } P_m(t) \\
    P_m(g \mid t) &= \sum_r P_m(g \mid t, r) \cdot P_m(r) \\
    &= \sum_r P(g \mid t, r) \cdot P_m(r) & \text{ Invariance } \\
    &= \sum_r P(g \mid t, r) \cdot P(r) & \text{ Invariance } \\
\end{aligned}
\end{equation}

Setting $T$ to $t$ has no effect on the distributions of $P(G=1 \mid T=1, R=r)$ nor $P(R=r)$. They are *invariant* in the confounded and non-confounded scenario, which gives us the license to swap them.

### Adjustment Formula in Code (with large samples)

Let's pretend, like in a real-life scenario, that we couldn't randomize the treatment. Let's use the Adjustment Formula to estimate it. Our value for the average effect of treatment on graduation rates should be close to 7 percentage points.

In [13]:
def prob_grad_do(df, tutored=1):
    """
        tutored: 1 or 0, which represents the treatment
        df: the DataFrame containing observational data.
    """
    

    return (df['risk'] == 1).mean() * df[( df['risk'] == 1 ) & ( df['tutored'] == tutored )]['graduated'].mean() + \
        (df['risk'] == 0).mean() * df[( df['risk'] == 0 ) & ( df['tutored'] == tutored )]['graduated'].mean()

In [14]:
print("The Causal Risk Difference estimate from observational data is " + str(round(prob_grad_do(df, tutored=1) - prob_grad_do(df, tutored=0), 2)) +  ".")

The Causal Risk Difference estimate from observational data is 0.07.


In [12]:
print(print_rct_estimate)

Our RCT causal risk difference estimate of the causal effect of tutoring on graduation is 0.06.


As we can see from above, our CRD estimate using observational data is very close to the CRD acquired through our RCT! In other words, we were able to estimate the causal effect of a treatment on the outcome using observational data sampled from the population. We were able to do that by:

1. having enough knowledge about the DAG. 

Having the DAG structure lets us know in what ways the treatment and the outcome could be spuriously related, and how they could be blocked (if possible). Assuming we could block all backdoor paths (i.e. satisfy the backdoor criterion), we could then do the next step, which is

2. using the adjustment formula (g-formula).

We derived the adjustment formula in two ways. Once we know what variables to adjust for (i.e. a deconfounding set which blocks all backdoor paths), we could use the adjustment formula to find the causal effect of treatment on the outcome using observational data. **This helps researchers find causal effects when randomizing treatment is unethical or too costly.**

### In the future

In following notebooks, we'll cover more causal graph theory that would help us go beyond the limitations of RCTs. 

We'll cover:
* estimating causal effects with smaller sample sizes. 
* validating our causal model (i.e. testable implications of the model)
* other strategies for finding causal effects, such as by satisfying the Front-door criterion and making use of the $do$-calculus. 
* mediation strategies.
* how using DAGs could illuminate issues on selection bias, and what to do next, if possible.
* how to transparently represent combining data from different sources to be able to generalize our findings (i.e. transportability).

# Resources


| Image | Notes | Link |
| - | - | - |
| <img src='https://prodimage.images-bn.com/pimages/9781541698963_p0_v1_s600x595.jpg' alt='Book of Why: The New Science of Cause & Effect cover' width=500> | An introduction meant for the more general public. It still is technical, has some math, but focuses more on stories and anecdotes instead of derivations. | [Book of Why: The New Science of Cause & Effect](https://www.amazon.com/Book-Why-Science-Cause-Effect/dp/046509760X) | 
| <img alt='Causal Inference in Statistics: A Primer' src='https://s3.amazonaws.com/vh-woo-images/causal-inference-in-statistics-a-primer-1st-edition.jpg' width=500> | Recommended by Pearl to be read after the Book of Why. Dives more into the math. Has end-of-chapter exercises. *Note: I have the solutions manual! I told Pearl I was self-studying and he graciously gave me a copy!* | [Causal Inference in Statistics: A Primer](https://www.amazon.com/Causal-Inference-Statistics-Judea-Pearl/dp/1119186846) |
| <img src='https://images-na.ssl-images-amazon.com/images/I/511aGcbGLyL._SX343_BO1,204,203,200_.jpg' alt='Causality' width=500> | Goes more in-depth than the Primer book. | [Causality](https://www.amazon.com/Causality-Reasoning-Inference-Judea-Pearl/dp/052189560X) |
| <img alt='Causal Inference: The Mixtape' src='https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1566276665i/47867837._UY630_SR1200,630_.jpg' width=500> | Has a section on DAGs, but focused more on causal inference techniques that are more commonly being used in Economics. *FREELY AVAILABLE*. | [Causal Inference: The Mixtape](https://www.scunning.com/mixtape.html) |
| <img alt='Causal Diagrams: Draw your Assumptions Before Conclusions' src='./img/causal-diagrams-draw-your-assumptions.png' width=500> | *FREE* course on EDX. Makes use of Epidemiological case studies to serve as context as to why drawing your assumptions is important.  | [Causal Diagrams: Draw your Assumptions Before Conclusions](https://online-learning.harvard.edu/course/causal-diagrams-draw-your-assumptions-your-conclusions) |
| not applicable | "Causal Inference: What If" is a book that dives into Hernan & Robins' Potential Outcomes with DAGs approach. *FREE*. | [Causal Inference: What if](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/) |