6E1.  List three mechanisms by which multiple regression can produce false inferences about causal effects

**AI Prompt**

some variable is included which opens the gate in the block, variable is missing in the fork, variable is missing in the pipe case

**AI Response**

I can see you're working on understanding how multiple regression can lead to incorrect causal inferences. Your answer touches on some key concepts, but let me help you refine and expand on these ideas.

You've identified three important scenarios. Let me clarify what these might refer to:

1. **Collider bias** (your "gate" case) - including a variable that's caused by both the treatment and outcome
2. **Confounding** (your "fork" case) - omitting a variable that affects both treatment and outcome  
3. **Post-treatment bias** (your "pipe" case) - including/excluding variables in the causal pathway

Are you familiar with causal diagrams (DAGs) to visualize these relationships? And would you like to work through a specific example of one of these mechanisms first?

Which of these three feels least clear to you right now?



6E3.  List the four elemental confounds. Can you explain the conditional dependencies of each?

**AI Prompt**

collider, fork, pipe and descendant

**AI Response**

Good start! You've identified four important causal structures. Let me help you think through the conditional dependencies for each one.

You mentioned: collider, fork, pipe, and descendant. 

Let's start with one at a time. Can you explain what happens with conditional dependencies in a **fork** structure? 

For example, if we have X ← Z → Y (where Z is a common cause of both X and Y), what happens to the relationship between X and Y when we condition on Z versus when we don't?

Once you work through that one, we can move to the others. What do you think?



6E4.  How is a biased sample like conditioning on a collider? Think of the example at the open of the chapter.

**AI Prompt**

biased sample  has had effect on some variable that is not in the model an thus true cause of the effect is hidden

**AI Response**

You're on the right track with thinking about hidden causes, but let me help you connect this more directly to collider bias.

In a biased sample, the **selection process itself** acts like conditioning on a collider. 

Think about it this way: if your sample selection depends on both the treatment and the outcome (or variables related to both), then you've inadvertently conditioned on a collider - the selection variable.

For example, if you're studying the effect of education on income, but your sample only includes people who applied for a particular job, what might that create? The job application itself might depend on both education level and expected income.

Can you think through what this sampling bias would do to the apparent relationship between education and income in your data?



6M2.  Sometimes, in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters is the conditional association, not the association before the variables are included in the model. To highlight this, consider the DAG X → Z → Y. Simulate data from this DAG so that the correlation between X and Z is very large. Then include both in a model prediction Y. Do you observe any multicollinearity? Why or why not? What is different from the legs example in the chapter?

In [None]:
import numpy as np
import pandas as pd

np.random.seed(42)
n = 100

# X1 -> X2 -> Y chain
X = np.random.normal(0, 1, n)
Z = 2 * X + np.random.normal(0, 0.5, n)  
Y = 3 * Z + np.random.normal(0, 1, n)

data = pd.DataFrame({'X': X, 'Z': Z, 'Y': Y})

In [None]:
data.corr()

Unnamed: 0,X,Z,Y
X,1.0,0.965494,0.951633
Z,0.965494,1.0,0.982584
Y,0.951633,0.982584,1.0


**AI Prompt**

is simulation correct?

**AI Response**

Your simulation looks correct for the DAG X → Z → Y! You've created:

- X as a random normal variable
- Z that depends strongly on X (correlation of 0.965)
- Y that depends on Z

This matches the causal structure you want to test.

Now, to answer the multicollinearity question, you'll need to fit a regression model with both X and Z predicting Y. 

What do you think will happen when you include both X and Z in the model? Will there be multicollinearity problems, even though X and Z are highly correlated?

Try fitting the model and see what happens to the coefficients and their standard errors.



In [None]:
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm

with pm.Model() as simple:
    sigma = pm.Exponential("sigma", 1)
    b1 = pm.Normal("b1", 0, 0.5)
    a = pm.Normal("a", 0, 0.2)
    b2= pm.Normal("b2", 0, 0.5)
    mu = pm.Deterministic("mu", a + b1 * data['X'] + b2* data['Z'])

    Y = pm.Normal("Y", mu, sigma, observed=data["Y"])

    simple_trace = pm.sample()

Initializing NUTS using jitter+adapt_diag...


Multiprocess sampling (4 chains in 4 jobs)


NUTS: [sigma, b1, a, b2]


Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 3 seconds.


In [None]:
az.summary(simple_trace)

Unnamed: 0,mean,sd,hdi_3%,hdi_97%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
b1,0.763,0.317,0.191,1.354,0.007,0.005,1792.0,2013.0,1.00
a,0.066,0.095,-0.109,0.250,0.002,0.002,2621.0,2309.0,1.01
b2,2.699,0.164,2.402,3.001,0.004,0.003,1810.0,2175.0,1.00
sigma,1.086,0.078,0.951,1.240,0.001,0.001,3240.0,2237.0,1.00
mu[0],1.216,0.152,0.918,1.486,0.003,0.003,1881.0,1751.0,1.00
...,...,...,...,...,...,...,...,...,...
mu[95],-8.429,0.187,-8.775,-8.083,0.003,0.003,3627.0,2766.0,1.00
mu[96],0.698,0.121,0.466,0.922,0.003,0.002,2001.0,1694.0,1.00
mu[97],1.882,0.105,1.682,2.073,0.002,0.002,2792.0,2854.0,1.00
mu[98],0.176,0.095,-0.002,0.355,0.002,0.002,2646.0,2255.0,1.01


**AI Prompt**

dont see multicollinearity because this is pipe case where Z needs to be included to account for

**AI Response**

Exactly right! You've identified the key insight. 

In this **pipe/chain** structure (X → Z → Y), there's no multicollinearity problem even though X and Z are highly correlated (r = 0.965). 

Look at your results: both coefficients are well-estimated with reasonable standard errors. The coefficient for X (b1 ≈ 0.76) represents the *direct* effect of X on Y, while the coefficient for Z (b2 ≈ 2.70) represents the direct effect of Z on Y.

This is different from the "legs example" because in that case, both leg measurements were trying to capture the same underlying information (height), creating redundancy. Here, X and Z are in a causal sequence - they're correlated *because* one causes the other, not because they're measuring the same thing.

What do you think would happen if you ran a model with *only* X predicting Y? How would that coefficient compare to what you see here?



6M3. Learning to analyze DAGs requires practice. For each of the four DAGs below, state which variables, if any, you must adjust for (condition on) to estimate the total causal influence of X on Y.

**AI Prompt**

first dag: A > Z > X > Y, Z > Y, A > Y. control for A and Z?

**AI Response**

Good start! Let me help you think through this systematically.

For the DAG: A → Z → X → Y, with additional edges Z → Y and A → Y

You suggested controlling for A and Z. Let's check this step by step:

1. What paths connect X to Y? Can you trace all the pathways from X to Y (both direct and indirect)?

2. Which of these paths represent causal effects we want to keep open, and which represent confounding paths we need to block?

Try listing out the paths first, then we can determine what to condition on. What paths do you see from X to Y?



**AI Prompt**

X > Y is the path from X to Y

**AI Response**

Good! X → Y is indeed one path from X to Y.

But look more carefully at the DAG. Are there any other paths that connect X to Y? Remember, paths can go backwards through arrows too.

For example, could there be a path that goes from X back through some other variables and then to Y?

Take another look at: A → Z → X → Y, Z → Y, A → Y

What other routes can you trace from X to Y?



**AI Prompt**

one path goes through Z and one through A > Z

**AI Response**

Exactly! You've found the backdoor paths. So we have:

1. **X → Y** (the direct causal path we want to measure)
2. **X ← Z → Y** (backdoor path through Z)  
3. **X ← Z ← A → Y** (backdoor path through A and Z)

Now, to estimate the total causal effect of X on Y, we need to block the backdoor paths while keeping the direct path open.

What do you think happens if we condition on Z? Does that block the backdoor paths?



**AI Prompt**

yes

**AI Response**

Great! Yes, conditioning on Z blocks both backdoor paths:

- X ← Z → Y gets blocked at Z
- X ← Z ← A → Y also gets blocked at Z

So you only need to condition on **Z** to identify the total causal effect of X on Y.

Your original suggestion of "A and Z" would work, but it's more than necessary - Z alone is sufficient.

Ready for the next DAG in the problem?



**AI Prompt**

X > Z > Y, X > Y, A > Z, A > Y. control for A?

**AI Response**

Let me help you work through this DAG systematically.

For: X → Z → Y, X → Y, A → Z, A → Y

First, let's identify all the paths from X to Y:

1. **X → Y** (direct path)
2. **X → Z → Y** (indirect path through Z)
3. Are there any backdoor paths?

Look at the arrows again - can you trace any path that goes backwards from X to some common cause and then forward to Y?

What do you see?



**AI Prompt**

no

**AI Response**

Correct! There are no backdoor paths in this DAG.

All paths from X to Y are causal paths:
- X → Y (direct effect)
- X → Z → Y (indirect effect through Z)

Since there are no confounding backdoor paths to block, you don't need to condition on anything to estimate the total causal effect of X on Y.

So the answer is: **no adjustment needed**.

Ready for the third DAG?



**AI Prompt**

A > X > Y, A > Z, X > Z, Y > Z, control for nothing?

**AI Response**

Let me help you work through this one step by step.

For the DAG: A → X → Y, A → Z, X → Z, Y → Z

You suggested controlling for nothing. Let's check by finding all paths from X to Y.

I see the direct path X → Y. 

Are there any backdoor paths? Look for paths that go backwards from X through a common cause and then forward to Y.

What do you see when you trace backwards from X?



**AI Prompt**

I see none

**AI Response**

Good! You're right - there are no backdoor paths from X to Y in this DAG.

The only path from X to Y is the direct causal path: X → Y

Even though A connects to X (A → X), there's no path from A to Y, so A doesn't create a backdoor confounding path.

Z is a collider (X → Z ← Y), so we definitely don't want to condition on it.

So your answer is correct: **control for nothing**.

Ready for the fourth and final DAG?



**AI Prompt**

A > X > Z > Y, A > Z, X > Y, I should control for A and Z?

**AI Response**

Let me help you work through this one carefully.

For the DAG: A → X → Z → Y, A → Z, X → Y

You suggested controlling for A and Z. Let's trace the paths from X to Y first:

1. **X → Y** (direct path)
2. **X → Z → Y** (indirect path through Z)
3. Any backdoor paths?

Look at A - it connects to X, and does it also connect to Y? Trace the paths that go backwards from X through A.

What paths do you find?



**AI Prompt**

A > Z > Y

**AI Response**

Exactly! So you found the backdoor path: X ← A → Z → Y

Now we have:
1. **X → Y** (direct causal path)
2. **X → Z → Y** (indirect causal path) 
3. **X ← A → Z → Y** (backdoor confounding path)

To estimate the total causal effect of X on Y, we need to block that backdoor path.

What single variable could you condition on to block the path X ← A → Z → Y?



**AI Prompt**

Z

**AI Response**

Exactly right! Conditioning on **Z** blocks the backdoor path X ← A → Z → Y at Z.

So for this DAG, you only need to **control for Z** to identify the total causal effect of X on Y.

Your original suggestion of "A and Z" would also work, but Z alone is sufficient and more efficient.

Great work working through all four DAGs! You're getting good at identifying backdoor paths and minimal adjustment sets. Do you feel confident about the approach, or would you like to practice with another example?



**AI Prompt**

I would like to understand the process. I should identify backdoor paths and then try to block them?

**AI Response**

Exactly! You've got the core process right. Here's the systematic approach:

1. **Identify all paths** from X to Y (both causal and backdoor)
2. **Keep causal paths open** - these represent the effect you want to measure
3. **Block backdoor paths** - these create confounding

A backdoor path is one that goes backwards from X (through ←) to a common cause, then forward to Y.

To block a path, you condition on any variable along that path (except colliders, which you avoid conditioning on).

Does this process make sense? Would you like to try applying it to a new example, or do you have questions about any of these steps?



**AI Prompt**

in the last example why not condition on A?

**AI Response**

Great question! Let's think about what happens if you condition on A in that last DAG.

The backdoor path was: X ← A → Z → Y

If you condition on A, you would block this path at A, which would work to eliminate the confounding.

So conditioning on A alone would also be a valid solution! 

The key insight is that you can block a backdoor path by conditioning on *any* non-collider variable along that path. Both A and Z would work in this case.

Why might you prefer one over the other? What factors might influence your choice between conditioning on A versus Z?

