
# Correlation is Not Causation

In this chapter, we explore one of the most common misunderstandings in statistics: the difference between **correlation**, **independence**, and **causation**.

Understanding these concepts is essential for data analysis, scientific research, and informed decision-making.

---



## Definitions

**Correlation** measures the statistical association between two variables. A high correlation indicates that the variables move together, but not necessarily that one causes the other.

**Independence** means that knowing the value of one variable gives no information about the other.

**Causation** implies that changes in one variable bring about changes in another.

We will explore these using data, visualizations, and tests.


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr, spearmanr, chi2_contingency

sns.set(style='whitegrid')
np.random.seed(42)


In [None]:

# Simulate correlated data
x = np.random.normal(0, 1, 100)
y = 2 * x + np.random.normal(0, 1, 100)

df = pd.DataFrame({'x': x, 'y': y})
sns.scatterplot(data=df, x='x', y='y')
plt.title('Scatter Plot of Correlated Variables')
plt.show()

# Pearson correlation coefficient
corr, p_value = pearsonr(df['x'], df['y'])
print(f"Pearson correlation: {corr:.2f}, p-value: {p_value:.3f}")


In [None]:

# Simulate independent variables
a = np.random.normal(0, 1, 100)
b = np.random.normal(0, 1, 100)

df_indep = pd.DataFrame({'a': a, 'b': b})
sns.scatterplot(data=df_indep, x='a', y='b')
plt.title('Scatter Plot of Independent Variables')
plt.show()

# Correlation test
corr, p_value = pearsonr(df_indep['a'], df_indep['b'])
print(f"Pearson correlation: {corr:.2f}, p-value: {p_value:.3f}")


In [None]:

# Simulate a confounding variable
z = np.random.normal(0, 1, 100)
x = 2 * z + np.random.normal(0, 1, 100)
y = -3 * z + np.random.normal(0, 1, 100)

df_spurious = pd.DataFrame({'x': x, 'y': y, 'z': z})
sns.scatterplot(data=df_spurious, x='x', y='y')
plt.title('Spurious Correlation via a Confounding Variable')
plt.show()

corr, _ = pearsonr(df_spurious['x'], df_spurious['y'])
print(f"Correlation between x and y: {corr:.2f} (spurious)")



## Your Turn: Explore Causation

Try changing the relationships between variables and test for correlation. Does correlation imply causation? Try creating a scenario where there is causation but low correlation.
