### Ladder of Causation

A metaphor for understanding the distinct levels of relationships between variables. The ladder has 3 rungs and each rung is related to different activity and offers answers to different types of causal questions. Each rung also comes with a distinct set of mathematical tools.

1. <u><b>Association</b></u>: The activity related to this level is <u>observing</u>. Using association, we can answer questions about how seeing one thing changes our belief about another thing. E.g. how observing a successful product launch changes our belief that the company's stock price will go up.
2. <u><b>Intervention</b></u>: We can intervene on one variable to check how it influences some other variable. E.g. if I go to bed earlier, will I have more energy the following morning. 
3. <u><b>Counterfactual reasoning</b></u>: Imagining/understanding. What would have happened had we done something differently. E.g. would I have made it to the office on time, had I taken the train rather than the car?

#### Associations

We can quantify associational relationships using <u>conditional probability</u>. E.g. what is the probability that a person will buy book A, given they've bought book B: P(book A|book B).

This question does not give us any information on the causal relationship between both events. We don't know if buying book B caused the customer to buy book A, vice versa, or if there is some other (unobserved) event that caused both. We only get information about non-causal association between these events. 

<b><u>Structural Causal Models (SCMs)</u></b> are a simple tool to encode causal relationships between variables. We use them as our <u>data-generating process</u>. After generating the data, we will pretend to forget what our SCM was in order to mimic a frequent real-world scenario where the true data-generating process is unknown and the only thing we have is observational data. Code an example:

U<sub>0</sub> ~ U(0,1) (a continuous random variable uniformly distributed between 0 and 1)  
U<sub>1</sub> ~ N(0,1) (a normally distributed random variable with a mean value of 0 and a standard deviation of 1)  
A := 1<sub>{U<sub>0</sub>>0.61}</sub> (a binary variable)  
B := 1<sub>{(A + 0.5 * U<sub>1</sub>)>0.2}</sub> (a binary variable)

In [2]:
import numpy as np
from scipy import stats

In [2]:
class BookSCM:
    def __init__(self, random_seed=None):
        self.random_seed = random_seed
        self.u_0 = stats.uniform()
        self.u_1 = stats.norm()

    def sample(self, sample_size=100):
        "Samples from the SCM"
        u_0 = self.u_0.rvs(sample_size)
        u_1 = self.u_1.rvs(sample_size)
        a = u_0 > 0.61
        b = (a + 0.5 * u_1) > 0.2
        
        return a, b

In [3]:
# Instantiate SCM and set random seed
scm = BookSCM(random_seed=42)

# Sample 100 samples from it
buy_book_a, buy_book_b = scm.sample(100)

# Check whether shapes are as expected
print(f"Book a shape: {buy_book_a.shape}")
print(f"Book b shape: {buy_book_b.shape}")

Book a shape: (100,)
Book b shape: (100,)


In [4]:
# Compute the conditional probability P(book A|book B)
proba_book_a_given_book_b = buy_book_a[buy_book_b].sum() / buy_book_a[buy_book_b].shape[0]

print(f"Probability of buying book A given B: {proba_book_a_given_book_b:0.3f}")

Probability of buying book A given B: 0.567


A 56.7% probability of buying book A, given we bought book B. This indicates a positive relationship between both variables (if there was no association between them, we would expect the result to be 39%).
- If A and B were independent (not associated), then knowing B = 1 wouldn't change the chance of A = 1.
- So the conditional probability P(A=1|B=1) would simply equal the marginal probability P(A=1).
- The marginal probability of A = 0.39 = 39%, given how A is defined: P(A=1) = P(U<sub>0</sub>>0.61)=1 = (1 - 0.61) = 0.39.

The result tells us that we can make meaningful predictions using observations alone. This is at the core of most supervised ML models. Associations can have high practical significance in the absence of knowledge of the data-generating process. Conditional probability allowed us to draw conclusions based on the observational data alone.

#### Interventions

We change one thing in the world and observe whether and how this change affects another thing in the world. To describe interventions mathematically, we use a <i>do</i> operator, like so: P(Y = 1|<sub>do</sub>(X = 0)).

The above states that the probability of Y = 1, given that we set X to 0. That we change the value of X is critical here - this is the inherent difference between intervening and conditioning (the operation used to obtain conditional probabilities). Conditioning only modifies our view of the data, while intervening affects the distribution by actively setting one (or more) variable(s) to a fixed value (or a distribution). Intervention changes the system, but conditioning does not. 

For this example:
U<sub>0</sub> ~ N(0,1)  
U<sub>1</sub> ~ N(0,1)  
A := U<sub>0</sub>  
B := 5A + U<sub>1</sub>

Here we set A and B to be continuous variables as opposed to binary. 

In [3]:
# Define sample size and set random seed
SAMPLE_SIZE = 100
np.random.seed(42)

# Build SCM
u_0 = np.random.randn(SAMPLE_SIZE)
u_1 = np.random.randn(SAMPLE_SIZE)
a = u_0
b = 5 * a + u_1

# Compute correlation coefficient between A and B
r, p = stats.pearsonr(a, b)

print(f"Mean of B before any intervention: {b.mean():.3f}")
print(f"Variance of B before any intervention: {b.var():.3f}")
print(f"Correlation between A and B:\nr = {r:.3f}; p = {p:.3f}\n")

Mean of B before any intervention: -0.497
Variance of B before any intervention: 20.144
Correlation between A and B:
r = 0.978; p = 0.000



The correlation between A and B is very high - not surprising since B is a linear function of A. The mean of B is slightly below 0, and the variance is around 20. Now let's intervene on A by fixing its value at 1.5.

In [4]:
a = np.array([1.5] * SAMPLE_SIZE)
b = 5 * a + u_1

print(f"Mean of B after intervention: {b.mean():.3f}")
print(f"Variance of B after intervention: {b.var():.3f}")

Mean of B after intervention: 7.522
Variance of B after intervention: 0.900


The intervention has changed the system - both the mean and the variance have changed. The value of our intervention on A is much bigger than we'd expect from our original distribution of A (centered at 0), while the variance has shrunk (because A became constant - the only remaining variability in B comes from its stochastic parante, U<sub>1</sub>). What happens if we intervene on B instead?

In [5]:
a = u_0
b = np.random.randn(SAMPLE_SIZE)

r, p = stats.pearsonr(a, b)

print(f"Mean of B after the intervention on B: {b.mean():.3f}")
print(f"Variance of B after the intervention on B: {b.var():.3f}")
print(f"Correlation between A and B after intervening on B:\nr = {r:.3f}; p = {p:.3f}\n")

Mean of B after the intervention on B: 0.065
Variance of B after the intervention on B: 1.164
Correlation between A and B after intervening on B:
r = 0.191; p = 0.057



The correlation dropped significantly and the p-value indicates a lack of significance. This indicates that after the intervention, A and B became (linearly) independent. The result suggests no causal link from B to A, whereas the previous result demonstrated a causal link from A to B. 