
# GLM vs. Linear Regression — Fit Line with Variance Area

Shows how **Generalized Linear Models (GLMs)** differ from **ordinary linear regression** by visualizing:
- the **fit line** (predicted mean) and
- a **variance area (±2·SD)** on the **natural scale**.

Use a synthetic count dataset where the *true data generating process* is Poisson with a log link.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, PoissonRegressor, GammaRegressor
from sklearn.metrics import mean_squared_error
np.random.seed(42)


## 1) Generate synthetic count data (Poisson, log link)

Create a single predictor `x` and a Poisson outcome `y` with:
\[
\log(\mu) = \beta_0 + \beta_1 x,\qquad y \sim \text{Poisson}(\mu).
\]

This ensures the **mean grows exponentially** with `x` and the **variance equals the mean**.


In [None]:
# Create data
n = 400
x = np.linspace(0, 10, n)
beta0, beta1 = 0.6, 0.22   # true coefficients

# Poisson true mean and outcome
mu_true = np.exp(beta0 + beta1 * x)
y = np.random.poisson(mu_true)

# Gamma true mean (same mu_true, but different variance structure)
# Var(Y) = phi * mu^2 ; choose phi > 0 (dispersion)
phi = 0.5
shape = 1.0 / phi                 # Gamma shape (k)
scale = mu_true * phi             # Gamma scale = mu / shape
y_gamma = np.random.gamma(shape, scale)

# Put into a DataFrame
df = pd.DataFrame({
    "x": x,
    "y_pois": y,
    "y_gamma": y_gamma,
    "mu_true": mu_true
})

df.head()


## 2) Fit models
- **Linear Regression (OLS)** on counts (treats variance as constant, can predict negatives).
- **Poisson GLM** with log link (ensures non‑negative mean and variance = mean).


In [None]:
# Fit OLS
X = x.reshape(-1, 1)
ols = LinearRegression().fit(X, y)
yhat_ols = ols.predict(X)

# Fit Poisson GLM
pois = PoissonRegressor(alpha=0.0, max_iter=5000).fit(X, y)
yhat_pois = pois.predict(X)

# Fit Gamma GLM
gamma = GammaRegressor(alpha=0.0, max_iter=5000).fit(X, y_gamma)
yhat_gamma = gamma.predict(X)

print("OLS   RMSE:", (mean_squared_error(y, yhat_ols)**0.5))
print("Poiss RMSE:", (mean_squared_error(y, yhat_pois)**0.5))
print("Gamma RMSE:", (mean_squared_error(y, yhat_gamma)**0.5))


## 3) Variance areas (±2·SD) on the natural scale

- **OLS (Linear Regression)** assumes **constant variance**: estimate \(\sigma^2\) from residuals and draw a **constant-width** band around the fitted line.
- **Poisson GLM** assumes \(\mathrm{Var}(Y)=\mu\): the **band widens** as the mean increases (using \(\pm 2\sqrt{\mu}\)).


In [None]:
# OLS band (constant)
resid_ols = y - yhat_ols
sigma_ols = np.std(resid_ols, ddof=1)
ols_upper = yhat_ols + 2*sigma_ols
ols_lower = yhat_ols - 2*sigma_ols

# Poisson band (variance = mean)
sd_pois = np.sqrt(np.maximum(yhat_pois, 1e-12))
pois_upper = yhat_pois + 2*sd_pois
pois_lower = np.maximum(0.0, yhat_pois - 2*sd_pois)  # counts can't go below 0

# Gamma band (variance = phi * mu^2)
# Estimate dispersion phi via Pearson chi-square / df
est_g = gamma               # GammaRegressor
p_params = est_g.coef_.size + 1                # +1 for intercept
df = max(len(y_gamma) - p_params, 1)

mu_g = np.asarray(yhat_gamma, dtype=float)         # predicted mean on natural scale
phi_hat = float(np.sum((y_gamma - mu_g)**2 / np.maximum(mu_g**2, 1e-12)) / df)

# ±2·SD band on natural scale (clip at 0 since costs are >= 0)
sd_g = np.sqrt(phi_hat) * mu_g                 # SD = sqrt(phi * mu^2) = sqrt(phi) * mu
gamma_upper = mu_g + 2 * sd_g
gamma_lower = np.maximum(0.0, mu_g - 2 * sd_g)


## 4) Visualization A — Poisson GLM: Fit line with variance area
- The shaded region **expands** as `x` increases because \(\mathrm{Var}(Y)\) grows with \(\mu\).
- This matches the **count data reality** (higher means → higher dispersion).


In [None]:
# Scatter and Poisson GLM band
plt.figure(figsize=(8,5))
plt.scatter(x, y, alpha=0.35, label="Observed counts")
plt.plot(x, yhat_pois, linewidth=2, label="Poisson GLM (mean)")
plt.fill_between(x, pois_lower, pois_upper, alpha=0.2, label="±2·SD (Poisson)")
plt.xlabel("x")
plt.ylabel("y (counts)")
plt.title("Poisson GLM — Fit line with variance area")
plt.legend()
plt.tight_layout()

# Quick plot vs an x-axis (e.g., sorted by predicted mean)
order = np.argsort(mu_g)
x_axis = np.arange(len(mu_g))[order]
plt.figure(figsize=(8,5))
plt.plot(x_axis, mu_g[order], label="Gamma (mean μ)", linewidth=2)
plt.fill_between(x_axis, gamma_lower[order], gamma_upper[order], alpha=0.2, label="Gamma band (±2·SD)")
plt.scatter(x_axis, np.asarray(y_gamma)[order], s=14, alpha=0.25, label="Observed")
plt.xlabel("samples (sorted by μ)"); plt.ylabel("cost (natural scale)")
plt.title("Gamma GLM — Fit line with variance area")
plt.legend(); plt.tight_layout(); plt.show()
plt.show()


## 5) Visualization B — OLS: Fit line with constant variance area
- The shaded region is **constant width** because OLS assumes **homoscedasticity** (constant variance).
- Notice how the band **understates variability** at high means and **overstates it** at low means.


In [None]:
# Scatter and OLS band
plt.figure(figsize=(8,5))
plt.scatter(x, y, alpha=0.35, label="Observed counts")
plt.plot(x, yhat_ols, linewidth=2, label="OLS (mean)")
plt.fill_between(x, ols_lower, ols_upper, alpha=0.2, label="±2·SD (OLS)")
plt.xlabel("x")
plt.ylabel("y (counts)")
plt.title("OLS — Fit line with constant variance area")
plt.legend()
plt.tight_layout()
plt.show()


## 6) Visualization C — Overlay of mean fits (no bands)
- Compare the **mean predictions** directly:
  - OLS line can cross negative territory (not shown here) and grows linearly.
  - Poisson GLM grows **exponentially** (log link) and stays non‑negative.


In [None]:
plt.figure(figsize=(8,5))
plt.scatter(x, y, alpha=0.2, label="Observed counts")
plt.plot(x, yhat_ols, linewidth=2, label="OLS (mean)")
plt.plot(x, yhat_pois, linewidth=2, label="Poisson GLM (mean)")
plt.xlabel("x")
plt.ylabel("y (counts)")
plt.title("Mean fits — OLS vs Poisson GLM (no bands)")
plt.legend()
plt.tight_layout()
plt.show()


## 7) What to tell a stakeholder

- **Why GLM?** For counts, **variance rises with the mean**; GLMs (Poisson) model that explicitly. Linear regression assumes constant variance and can mislead confidence.  
- **Visual cue:** The **Poisson band widens** as the mean rises — this matches how count processes behave (e.g., more traffic → more variability).  
- **Risk:** OLS can suggest **negative counts** or **flat uncertainty** where it shouldn’t.  
- **Action:** Use GLM families that match the outcome type (Poisson for counts, Gamma for positive skewed costs, Logistic for probabilities).
