
# "Correlation is Not Causation"

In this article, some details about one of the most common sentence in statistics are given. Starting from the sentence "Correlation is not causation", the definitions of **correlation**, **independence**, and **causation** are discussed.

<!--
Understanding these concepts is essential for data analysis, scientific research, and informed decision-making.
-->



## Definitions

<!--
**Correlation** measures the statistical association between two variables. A high correlation indicates that the variables move together, but not necessarily that one causes the other.

**Independence** means that knowing the value of one variable gives no information about the other.

**Causation** implies that changes in one variable bring about changes in another.

We will explore these using data, visualizations, and tests.
-->

### Correlation

Correlation measures any statistical relationship between two random variables, wheter it is statistically dependent or not, causal or not.

The most common measure of correlation is **Pearson correlation**. Pearson correlation between two random variables $X$, $Y$ is defined as the ratio

$$\rho_{XY} := \dfrac{\text{cov}(X,Y)}{\sigma_X \sigma_Y} \ ,$$

of the covariance $\text{cov}(X,Y) = \mathbb{E}\left[ (X-\mu_X) (Y-\mu_Y) \right]$, being $\mu_Z$ the expected value of variable $Z$, $\mu_Z = \mathbb{E}\left[ Z \right]$, and $\sigma_{Z}$ its standard deviation, $\sigma_{Z} = \sqrt{\mathbb{E}\left[ \left( Z - \mu_Z \right)^2 \right]}$.

### Statistical independence

Two random variables $X$, $Y$ are **statistically independent** if the conditional probability $p(X|Y)$ is equal to the unconditional probability $p(X)$,

$$p(X|Y) = p(X)$$

Thus, joint probability reads

$$p(X,Y) = p(X|Y) p(Y) = p(X) p(Y) \ , $$

i.e. joint probability is the product of unconditional probabilities of independent random variables.

As $p(X,Y) = p(Y|X) p(X)$, it also follows that $p(Y|X) = p(Y)$.

#### Statistical independence implies no correlation

As statistical independence  of variables $X$, $Y$ implies $p(X,Y) = p(X) p(Y)$, direct computation of the covariance $\text{cov}(X,Y)$ reads

$$\begin{aligned}
  \text{cov}(X,Y)
  & = \mathbb{E}\left[ \left( X - \mu_X \right) \left( Y - \mu_Y \right) \right] = \\
  & = \mathbb{E}\left[ X - \mu_X \right] \mathbb{E} \left[ Y - \mu_Y \right] = 0 \ .
\end{aligned}$$

#### Correlation of samples drawn from independent random variables

Sample covariance $\hat{S}_N$ of $N$ samples $\{ (X_n, Y_n) \}_{n=1:N}$,

$$\hat{S}_N := \dfrac{1}{N-1} \sum_{n=1}^{N} \left(X_n - \overline{X}_N \right) \left(Y_n - \overline{Y}_N \right) \ ,$$

drawn from random variables $X$, $Y$ is a random variable with zero expected value, but its realizations are non-zero in general.

In other words, samples of independent (and thus uncorrelated) variables have non-zero covariance and then non-zero correlation, in general.

### Causality

Causality is the relation between two events, in which one (the **cause**) is - at least partly - responsible for the other event (the **effect**), and the effect is - at least partly - dependent on the cause.

Principle of causality relation implies that the cause comes before the effect.

In general, an event may have multiple causes (that lie in its past) or have multiple effects.

#### Necessary, sufficient and contributory causes
- $x$ is necessary for $y$ is the occurence of $y$ implies a prior occurrence of $x$
- $x$ is sufficient for $y$ if the occurrence of $x$ implies the subsequent occurrence of $y$
- $x$ is contributory for $y$ if it's one among several co-occurrent causes.

## Pearl's work, *Causal Inference in Statistics: A Primer*





### Ladder of causation

Three levels of causation:

- **Association** is defined as the conditional probability,
  
   $$P(A|B) \ ,$$

   and has no causal implication: there's no cause-effect directionality, or both can be caused by a third event

- **Intervention** needs for an event to be performed (and not just observed), in the *minimal way*, with minimum intrusivity and unintended effects on the world. This action is represented mathematically using the *do-calculus* formalism. In order to quantify the effect of performing action $B$ on $A$, the probability

   $$P(A| \text{do}(B)) \ ,$$

   is required, being $\text{do}(\cdot)$ the operator representing the intervention

- **Counterfactuals** involves the consideration of an alternate version of the cause (past event), and the analysis of the effects for the same experimental unit/system of interest. ...

  $$P(A| B, C)$$

   

### Model

**Causal diagram**: directed graph showing causal relationship, built with nodes (set of variables) connected with arrows representing causal influence.

**Elements.**
- **Junction** patterns:
  - chain, $A \rightarrow B \rightarrow C$
  - fork at $B$, $A \leftarrow B \rightarrow C$
  - collider at $B$, $A \rightarrow B \leftarrow C$
- **Node** types:
  - mediator
  - confounder: affects multiple outcomes, creating a positive correlation among them
  - instrumental variable...
  - ...

### Associations


### Interventions

### Counterfactuals


## Examples

In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr, spearmanr, chi2_contingency

sns.set(style='whitegrid')
np.random.seed(42)


In [None]:

# Simulate correlated data
x = np.random.normal(0, 1, 100)
y = 2 * x + np.random.normal(0, 1, 100)

df = pd.DataFrame({'x': x, 'y': y})
sns.scatterplot(data=df, x='x', y='y')
plt.title('Scatter Plot of Correlated Variables')
plt.show()

# Pearson correlation coefficient
corr, p_value = pearsonr(df['x'], df['y'])
print(f"Pearson correlation: {corr:.2f}, p-value: {p_value:.3f}")


In [None]:

# Simulate independent variables
a = np.random.normal(0, 1, 100)
b = np.random.normal(0, 1, 100)

df_indep = pd.DataFrame({'a': a, 'b': b})
sns.scatterplot(data=df_indep, x='a', y='b')
plt.title('Scatter Plot of Independent Variables')
plt.show()

# Correlation test
corr, p_value = pearsonr(df_indep['a'], df_indep['b'])
print(f"Pearson correlation: {corr:.2f}, p-value: {p_value:.3f}")


In [None]:

# Simulate a confounding variable
z = np.random.normal(0, 1, 100)
x = 2 * z + np.random.normal(0, 1, 100)
y = -3 * z + np.random.normal(0, 1, 100)

df_spurious = pd.DataFrame({'x': x, 'y': y, 'z': z})
sns.scatterplot(data=df_spurious, x='x', y='y')
plt.title('Spurious Correlation via a Confounding Variable')
plt.show()

corr, _ = pearsonr(df_spurious['x'], df_spurious['y'])
print(f"Correlation between x and y: {corr:.2f} (spurious)")



## Your Turn: Explore Causation

Try changing the relationships between variables and test for correlation. Does correlation imply causation? Try creating a scenario where there is causation but low correlation.
