# Sporious Relationships

In [1]:
from scipy.stats import norm
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from patsy import dmatrices
%matplotlib inline

  from pandas.core import datetools


## Introduction
a spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are not causally related to each other, yet it may be wrongly inferred that they are, due to either coincidence or the presence of a certain third, unseen factor. [Wikipedia](https://en.wikipedia.org/wiki/Spurious_relationship)

### Case 1: Antecedent (common cause)
A non-causal correlation can be spuriously created by an antecedent (W) which causes both  X and Y (W → X and W → Y). 

Example: Economic grows (W) results in home construction bulding growth (X) and jewelry sales grows (Y).  

Lets simulate that:

In [2]:
N = 500
df = pd.DataFrame()
df['W'] = norm.rvs(size=N)

In [3]:
a = 5
b = 0.5
df['X'] = a + b * norm.rvs(size=N, loc=df['W'])
df['Y'] = a + b * norm.rvs(size=N, loc=df['W'])

Let's look at the linear relation between W and X, which should be similar to W and Y:

In [4]:
Y, W = dmatrices('Y ~ W', data=df)
mod = sm.OLS(Y, W)
res = mod.fit()
res.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.491
Dependent Variable:,Y,AIC:,734.2489
Date:,2017-11-20 11:02,BIC:,742.6781
No. Observations:,500,Log-Likelihood:,-365.12
Df Model:,1,F-statistic:,482.3
Df Residuals:,498,Prob (F-statistic):,2.91e-75
R-squared:,0.492,Scale:,0.25325

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,4.9998,0.0225,222.0567,0.0000,4.9556,5.0440
W,0.5134,0.0234,21.9622,0.0000,0.4674,0.5593

0,1,2,3
Omnibus:,1.143,Durbin-Watson:,2.088
Prob(Omnibus):,0.565,Jarque-Bera (JB):,1.186
Skew:,0.054,Prob(JB):,0.553
Kurtosis:,2.788,Condition No.:,1.0


And now look at the relationship between Y and X:

In [5]:
Y, X = dmatrices('Y ~ X', data=df)
mod2 = sm.OLS(Y, X)
res2 = mod2.fit()
res2.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.248
Dependent Variable:,Y,AIC:,929.4226
Date:,2017-11-20 11:02,BIC:,937.8518
No. Observations:,500,Log-Likelihood:,-462.71
Df Model:,1,F-statistic:,165.5
Df Residuals:,498,Prob (F-statistic):,6.63e-33
R-squared:,0.249,Scale:,0.37418

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,2.3735,0.2048,11.5885,0.0000,1.9711,2.7759
X,0.5265,0.0409,12.8652,0.0000,0.4461,0.6069

0,1,2,3
Omnibus:,0.459,Durbin-Watson:,2.112
Prob(Omnibus):,0.795,Jarque-Bera (JB):,0.488
Skew:,0.073,Prob(JB):,0.783
Kurtosis:,2.954,Condition No.:,39.0


#### Conclusion for case 1
for both Y ~ W and Y ~ W the linear regression indicated significant and positive association. Please note that the AIC and BIC are better for Y ~ W.  

### Case 2: Intervening variables
X → W → Y

Example for this case: Good rain season last year (X) resulted in increase in vegetation at the desert (W). As as result, the population of the rabbits has grown (Y).
One might assume, that rain causes the increase in rabbit population (X $\rightarrow$ Y).

Lets simulate that:

In [6]:
df2 = pd.DataFrame()
df2['X'] = norm.rvs(size=N)

In [7]:
a2 = 1
b2 = 0.2
df2['W'] = a2 + b2 * norm.rvs(size=N, loc=df['X'])
df2['Y'] = a2 + b2 * norm.rvs(size=N, loc=df['W'])

Now check the relationship between X ~ Y and W ~ W:

In [8]:
Y, X = dmatrices('Y ~ X', data=df2)
mod3 = sm.OLS(Y, X)
res3 = mod3.fit()
res3.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.018
Dependent Variable:,Y,AIC:,158.7247
Date:,2017-11-20 11:02,BIC:,167.1539
No. Observations:,500,Log-Likelihood:,-77.362
Df Model:,1,F-statistic:,9.963
Df Residuals:,498,Prob (F-statistic):,0.00169
R-squared:,0.020,Scale:,0.080105

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,1.0049,0.0127,79.2869,0.0000,0.9800,1.0298
X,-0.0410,0.0130,-3.1564,0.0017,-0.0666,-0.0155

0,1,2,3
Omnibus:,3.05,Durbin-Watson:,1.985
Prob(Omnibus):,0.218,Jarque-Bera (JB):,2.748
Skew:,-0.106,Prob(JB):,0.253
Kurtosis:,2.705,Condition No.:,1.0


In [9]:
Y, W = dmatrices('Y ~ W', data=df2)
mod4 = sm.OLS(Y, W)
res4 = mod4.fit()
res4.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.025
Dependent Variable:,Y,AIC:,155.0097
Date:,2017-11-20 11:02,BIC:,163.4389
No. Observations:,500,Log-Likelihood:,-75.505
Df Model:,1,F-statistic:,13.75
Df Residuals:,498,Prob (F-statistic):,0.000232
R-squared:,0.027,Scale:,0.079512

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,0.6068,0.1075,5.6437,0.0000,0.3956,0.8181
W,0.1996,0.0538,3.7083,0.0002,0.0939,0.3054

0,1,2,3
Omnibus:,1.61,Durbin-Watson:,1.97
Prob(Omnibus):,0.447,Jarque-Bera (JB):,1.547
Skew:,-0.055,Prob(JB):,0.461
Kurtosis:,2.751,Condition No.:,21.0


This time, the linear regression between X ~ Y doesn't look very promising, with a very small coefficient for X and a large P value. The short conclusion is, that intervening variables are easier to detect (is that always true?).