### Ex. 6.7.1 - Collider Bias

Collider bias is an issue that we run into when we want to find the relationship between two variables, but we also condition on another variable innapproriately. The simplest way to explain is through an example. Imagine that you want to find out how much an actors acting skill is correlated to their attractivness. Now let's say that you also know how popular each actor is. We'll suppose that the actor's popularity is linear in both the acting skill and attractivness.

If you just regress the actor's attractivness in their acting skill you'll get a correct estimator, but if you include the popularity in your regression something strange happens. Since if an actor is popular we can suppose that they are some combination of skillful, hot and lucky, the popularity portion of the actors sucsess will leak through some of the  some of the actor skill information, especially when combined with the actor hotness. What happens then is that an actor who is skillful will appear to be less likely to be hot than average, even if these two are independently distributed, since if an actor is extremely hot but has little popularity that could indicate that acting skill is holding them back and vice-versa, since the actor popularity is confounding the estimator. 

In [9]:
##########################
# Demo Code
##########################
import pandas as pd 
from random import random
import statsmodels.formula.api as smf

# Set up the dataset
hotness = [random() for i in range(100000)]
skill = [random() for i in range(100000)]
popularity = [h + s + random() for h, s in zip(hotness,skill)]

actors = pd.DataFrame({'Hotness' : hotness, 'Skill' : skill, 'Popularity' : popularity})

# Predict parameter estimates in each case
# Popularity = H + S + e_P
# -> E[Popularity | H] = H 
# S = H + e_P - P
# -> E[Skill | Popularity]  = (since we know P = H + S + e_p the best predictor for each of those is P/3, by symmetry/exchangeability) 
# -> E[Skill | Popularity, H] = E[H + e_P - P] = (Since P = e_P + H + S, P - H = e_P + S, E[e_P] = E[S] must be 1/2)
# -> E[Skill | Popularity, H] = (H - P)/2
# -> E[E[Skill | Popularity, H = h]] = E[(C - h)/2] = -h/2

# Predicted parameter correct/incorrect
hs_correct = 0
hs_incorrect = -1/2

direct_model = smf.ols('skill ~ hotness', data = actors).fit()
indirect_model = smf.ols('skill ~ hotness + popularity', data = actors).fit()

print(f"Predicted Connection Direct {hs_correct}")
print(f"Estimated Connection Direct: {direct_model.params['hotness']}")
print(f"Predicted Connection Collider {hs_incorrect}")
print(f"Estimated Connection Collider: {indirect_model.params['hotness']}")

Predicted Connection Direct 0
Estimated Connection Direct: -0.0011147943011112545
Predicted Connection Collider -0.5
Estimated Connection Collider: -0.49765322575896237


The above results reflect what we expect, by opening an unintentional backdoor path, which lead to a bias in the estimator. This backdoor path isn't all bad though, since it does give some insight on how these variable are related, under those conditons. Consider a real world case, in hollywood we do condition on both an actor actually being popular, and on hotness. So we should expect that if an actor is ugly but still in hollywood, they either have acting chops or some crazy luck or connections.