### Confounding

Confounding is the phenomenon whereby coefficients can be estimated in a misleading way due to the absence of highly relevant predictive factors. 

For example, suppose we were interested in how smoking affects lung capacity in children. If we were to fit a linear model of lung capacity against rate of smoking, we would likely observe that smoking has a positive correlation with lung capacity. This may lead us to the false conclusion that children who smoke more are more likely to have larger lungs. However we have missed off a highly relevant predictor, namely age. Age is both highly correlated with lung capacity and smoking rate. Indeed, older children will tend to have larger lungs and be more likely to smoke. So if we refit our model, this time also including age as a predictor, we will likely see a different result. That is, after accounting for age, smoking will be associated with smaller lung capacities. 

Confounding leads to biased or misleading estimates of effect sizes and directions, often due to omitted variables, and must be carefully controlled for to avoid false conclusions. It is difficult to directly prevent confounding but the following may help:
- Domain knowledge: Identify potential confounders before modeling (e.g., age, gender, socioeconomic status in medical studies).
- Study design: Randomised controlled trials are designed to eliminate confounding by randomly assigning treatment.

In the example below we generate a dataset where two predictors are both correlate with each other as well as the target. When we fit a model with just one predictor we get a very different estimation of the effect than using both predictors. 

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

# --- Step 1. Generate data ---
np.random.seed(0)
n = 200

# Confounder (Z)
Z = np.random.normal(0, 10, n)

# Predictor (X), correlated with Z
X = Z + np.random.normal(0, 1, n)

# Outcome (Y) depends on both X and Z
eps = np.random.normal(0, 1, n)
Y = 2*Z - 1.5*X + eps

# Put into a DataFrame
df = pd.DataFrame({"X": X, "Z": Z, "Y": Y})


# --- Step 2. Fit models ---

# Model A: Y ~ X only (ignores confounder)
Xa = sm.add_constant(df["X"])
modelA = sm.OLS(df["Y"], Xa).fit()
print("Model A (ignoring confounder):")
print(modelA.summary(), "\n")
print(f"When ommitting the confounder Z, the coefficient for X is {modelA.params['X']:.4f} implying a positive correlation between X and Y")

# Model B: Y ~ X + Z (includes confounder)
Xb = sm.add_constant(df[["X", "Z"]])
modelB = sm.OLS(df["Y"], Xb).fit()
print("Model B (with confounder):")
print(modelB.summary())
print(f"When including the confounder Z, the coefficient for X is {modelB.params['X']:.4f} implying a negative correlation between X and Y")7

print("Notice how the inclusion of the confounding factor caused the model to completely change its estimate of the size and direction of the effect of X on Y.")


Model A (ignoring confounder):
                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.859
Model:                            OLS   Adj. R-squared:                  0.858
Method:                 Least Squares   F-statistic:                     1205.
Date:                Tue, 02 Sep 2025   Prob (F-statistic):           4.08e-86
Time:                        08:42:58   Log-Likelihood:                -417.34
No. Observations:                 200   AIC:                             838.7
Df Residuals:                     198   BIC:                             845.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.1682