In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression
NUM_DATA_POINTS = 100000

# Introduction

I just finished studying Brady Neal's [Introduction to Causal Inference](https://www.bradyneal.com/causal-inference-course) course. It was a great introduction to the topic for me, and the author was kind enough to release both the lectures and the associated course book for free (at the time of writing this there were a few chapters not yet published). The main goal of this article is to get my hands dirty and solidify some of the things I learned - it would be a nice bonus if it turns out helping someone else as well!

Let's start off with some semantics. The figure 4.1 from the course book illustrates a general procedure of causal inference - the flow from a causal estimand + causal model to a statistical estimand, and from a statistical estimand + data to an estimate. Here are some relevant terms, using the definitions from the [Introduction to Causal Inference](https://www.bradyneal.com/causal-inference-course)-book:

* **Causal Estimand**
  * The causal expression we are interested in evaluating. Can not be directly estimated from data.
  * Contains one or more potential outcomes or $do$-operators, e.g. $\mathbb{E}[Y(t)] = \mathbb{E}[Y|do(T=t)]$.
    * Both sides of the equality can be read as "the expected value of the outcome variable $Y$, given that the population of interest receives treatment $T=t$". The left side utilizes potential outcome notation and the right side $do$-operator notation.
    * Note that "receives treatment" doesn't have to mean a designed intervention in this context, see [Does Obesity Shorten Life? Or is it the Soda? On Non-manipulable Causes (Pearl, 2018)](https://ftp.cs.ucla.edu/pub/stat_ser/r483-reprint.pdf) for more information.
  * In this article I will primarily consider the specific causal estimand *average treatment effect* (ATE): $E[Y(1) - Y(0)] = E[Y|do(T=1)] - E[Y|do(T=0)]$. I will focus on the case where the treatment is binary, with $T=1$ denoting treatment and $T=0$ no treatment. However, most of the concepts I will cover easily extend to the continuous treatment case.
* **Statistical Estimand**
  * Unlike causal estimands, statistical estimands contain no potential outcomes or $do$-operators and can be estimated from data. Therefore a big part of causal inference is identification - the process of moving from a causal estimand to a statistical estimand.
  * What statistical estimand is a proper identification of the causal estimand of interest depends on the causal model. Causal models are commonly visualized by a *directed acyclic graph* (DAG) and/or specified by a *structural casual model* (SCM).
* **Estimate (noun)**
  * An approximation of the estimand we want to estimate (verb) - a concrete value or distribution. The outcome of estimating a statistical estimand using data.
  * An estimator is a function that maps from a dataset to an estimate of the (statistical) estimand we are considering.
  * There are many ways to do estimation. In this article I will focus on *conditional outcome modeling* (COM). Other types of methods include, but are not limited to:
    * Matching methods
    * Inverse probability weighting
    * Double machine learning
    * Causal trees and forests

![figures/identification_estimation_flowchart.png](figures/identification_estimation_flowchart.png)

# Causal Model

When working with observational data and attempting to make causal inferences, thinking about the underlying causal model that generated the data is essential (in the experimental setting, when running well designed randomized controlled trials, it's less important due to the random selection into treatment and control groups that makes sure there is no confounding). Association does not imply causation, due to the fact that association doesn't have a direction (e.g. a rooster's crow is associated with the sun rising, but is not the cause of it), and due to the possibility of confounding variables that have an effect on both treatment and outcome variables (e.g. eating ice cream is probably associated with drowning, since going to the beach increases the probability of both ice cream consumption and drowning). One might be tempted (many researchers have been) to simply control for as many covariates as possible in order to reduce the probability of significant confounding. However, the section about colliders will explain why this can be a bad strategy.

## Graphical Causal Model (DAG)

Graphical models can be utilized to visualize causal relationships between variables. Commonly used are DAGs, where each node represents a variable and each edge a causal relationship. An edge from $A$ to $B$ means that $A$ is a cause of $B$. You can read more about causal graphs in chapters 3 and 4 of the [Introduction to Causal Inference](https://www.bradyneal.com/causal-inference-course)-book. The simple DAG below (copied from the book) has a common structure, where treatment $T$ is a cause of outcome $Y$, and confounder $W$ is a common cause of both $T$ and $Y$. We are usually interested in the causal effect of the treatment on the outcome.

![figures/dag_1.png](figures/dag_1.png)

## Structural Causal Model (SCM)

Graphical causal models are very useful, but SCMs can bring additional clarity and allow us to compute counterfactuals. You can read about SCMs in section 4.5 of the [Introduction to Causal Inference](https://www.bradyneal.com/causal-inference-course)-book, where they are defined as:

> A structural causal model is a tuple of the following sets:
> 1. A set of endogenous variables $V$
> 2. A set of exogenous variables $U$
> 3. A set of functions $f$, one to generate each endogenous variable as a function of other variables

Endogenous variables are variables with parents in the corresponding causal graph, i.e. the variables we are modeling the cause(s) of. Exogenous variables are the variables without parents in the causal graph. An example from the book are the following *structural equations*

![figures/scm_example.png](figures/scm_example.png)

that correspond to the following DAG

![figures/scm_example_dag.png](figures/scm_example_dag.png)

Here the endogenous variables are ${B,C,D}$ and the exogenous variables are ${A,U_B,U_C,U_D}$. The set ${U_B,U_C,U_D}$ are noise variables that make the endogenous variables random variables even if all functions in $f$ are deterministic (noise variables are usually ommited from DAGs).

TODO!!!! Show SCM used for data generating process below

## Data generating process

In [2]:
def f_w():
    u_w = np.random.choice([0, 1])
    return u_w

def f_t(w):
    u_t = np.random.normal()
    intermediate = 0.5 * w + u_t
    return 1 if intermediate > 0 else 0

def f_y(w, t):
    u_y = np.random.normal()
    return 0.8 * w + 1.2 * t + u_y

W = np.array([f_w() for _ in range(NUM_DATA_POINTS)])
T = np.array([f_t(w) for w in W])
Y = np.array([f_y(w, t) for w, t in zip(W, T)])

In [None]:
TODO: Show that association is not caussal effect here

# Identification

Backdoor criterion

Backdoor adjustment

# Estimation

Linear model

In [3]:
WT = np.array([W, T]).T
WT_1 = np.array([W, np.ones(len(W))]).T
WT_0 = np.array([W, np.zeros(len(W))]).T
model = LinearRegression()
model.fit(WT, Y)
ate_estimate = np.mean(model.predict(WT_1) - model.predict(WT_0))
print("ATE estimate:", ate_estimate)

ATE estimate: 1.2012474981998513


In [4]:
print("ATE estimate:", model.coef_[1])

ATE estimate: 1.2012474981998507


# Colliders

![figures/dag_2.png](figures/dag_2.png)

In [5]:
def f_w():
    u_w = np.random.choice([0, 1])
    return u_w

def f_t(w):
    u_t = np.random.normal()
    intermediate = 0.5 * w + u_t
    return 1 if intermediate > 0 else 0

def f_y(w, t):
    u_y = np.random.normal()
    return 0.8 * w + 1.2 * t + u_y

def f_z(t, y):
    u_z = np.random.normal()
    intermediate = 1.5 * t + y - 2 + u_z
    return 1 if intermediate > 0 else 0

W = np.array([f_w() for _ in range(NUM_DATA_POINTS)])
T = np.array([f_t(w) for w in W])
Y = np.array([f_y(w, t) for w, t in zip(W, T)])
Z = np.array([f_z(t, y) for t, y in zip(T, Y)])

In [6]:
WZT = np.array([W, Z, T]).T
WZT_1 = np.array([W, Z, np.ones(len(W))]).T
WZT_0 = np.array([W, Z, np.zeros(len(W))]).T
model = LinearRegression()
model.fit(WZT, Y)
ate_estimate = np.mean(model.predict(WZT_1) - model.predict(WZT_0))
print("ATE estimate:", ate_estimate)

ATE estimate: 0.40757848747942743


In [7]:
WT = np.array([W, T]).T
WT_1 = np.array([W, np.ones(len(W))]).T
WT_0 = np.array([W, np.zeros(len(W))]).T
model = LinearRegression()
model.fit(WT, Y)
ate_estimate = np.mean(model.predict(WT_1) - model.predict(WT_0))
print("ATE estimate:", ate_estimate)

ATE estimate: 1.196696813274864


# Unobserved confounding

![figures/dag_3.png](figures/dag_3.png)

In [8]:
def f_w():
    u_w = np.random.choice([0, 1])
    return u_w

def f_u():
    u_u = np.random.normal()
    return u_u

def f_t(w, u):
    u_t = np.random.normal()
    intermediate = 0.5 * w + 0.8 * u + u_t
    return 1 if intermediate > 0 else 0

def f_y(w, u, t):
    u_y = np.random.normal()
    return 0.8 * w - 2 * u + 1.2 * t + u_y

W = np.array([f_w() for _ in range(NUM_DATA_POINTS)])
U = np.array([f_u() for _ in range(NUM_DATA_POINTS)])
T = np.array([f_t(w, u) for w, u in zip(W, U)])
Y = np.array([f_y(w, u, t) for w, u, t in zip(W, U, T)])

In [9]:
WT = np.array([W, T]).T
WT_1 = np.array([W, np.ones(len(W))]).T
WT_0 = np.array([W, np.zeros(len(W))]).T
model = LinearRegression()
model.fit(WT, Y)
ate_estimate = np.mean(model.predict(WT_1) - model.predict(WT_0))
print("ATE estimate:", ate_estimate)

ATE estimate: -0.8108111630460578


## Sensitivity analysis

# Conclusion

TODO

Just a small part of causal inference