# Modeling — Linear Regression (Baseline & Diagnostics)

**Chain statement:** In the lecture, we learned to fit a baseline model and diagnose assumptions with residuals. Now, we adapt that workflow to our dataset to evaluate trust and usefulness.

## 1) Load dataset (cleaned or synthetic fallback)

In [None]:
import numpy as np, pandas as pd
from pathlib import Path

# Change this path to your cleaned dataset if available
DATA = Path("../data/raw/cleaned_or_project.csv")
DATA.parent.mkdir(parents=True, exist_ok=True)

# Synthetic fallback to mimic a regression task
if not DATA.exists():
    rng = np.random.default_rng(10)
    n = 500
    x1 = rng.normal(0, 1, n)
    x2 = rng.normal(10, 2, n)
    noise = rng.normal(0, 0.8, n)
    y = 1.5 * x1 - 0.7 * x2 + 5 + noise
    df = pd.DataFrame({"x1": x1, "x2": x2, "y": y})
    DATA = DATA.with_name("synthetic_regression.csv")
    df.to_csv(DATA, index=False)
else:
    df = pd.read_csv(DATA)

print("Using data:", DATA)
df.head()

## 2) Train/Test split

In [None]:
from sklearn.model_selection import train_test_split

target_col = "y"
feature_cols = [c for c in df.columns if c != target_col]

X = df[feature_cols].copy()
y = df[target_col].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

## 3) Fit baseline LinearRegression

In [None]:
from sklearn.linear_model import LinearRegression

linreg = LinearRegression()
linreg.fit(X_train, y_train)

coef_table = pd.DataFrame({
    "feature": ["intercept"] + feature_cols,
    "coef": [linreg.intercept_] + list(linreg.coef_)
})
coef_table

## 4) Predictions, residuals, R², and RMSE

In [None]:
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

y_pred_train = linreg.predict(X_train)
y_pred_test = linreg.predict(X_test)

resid_train = y_train - y_pred_train
resid_test = y_test - y_pred_test

r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)
rmse_train = mean_squared_error(y_train, y_pred_train, squared=False)
rmse_test = mean_squared_error(y_test, y_pred_test, squared=False)

pd.DataFrame({
    "set": ["train","test"],
    "R2": [r2_train, r2_test],
    "RMSE": [rmse_train, rmse_test]
})

## 5) Residual diagnostics (assumptions)

In [None]:
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np

fig, axes = plt.subplots(2, 2, figsize=(10, 8))

# Residuals vs Fitted (train)
axes[0,0].scatter(y_pred_train, resid_train, alpha=0.6)
axes[0,0].axhline(0, color="black", linewidth=1)
axes[0,0].set_title("Residuals vs Fitted (train)")
axes[0,0].set_xlabel("Fitted values")
axes[0,0].set_ylabel("Residuals")

# Histogram of residuals (train)
axes[0,1].hist(resid_train, bins=30)
axes[0,1].set_title("Residual histogram (train)")

# QQ plot (train)
stats.probplot(resid_train, dist="norm", plot=axes[1,0])
axes[1,0].set_title("QQ-plot of residuals (train)")

# Residuals vs key predictor (x1 if exists)
if "x1" in X_train.columns:
    axes[1,1].scatter(X_train["x1"], resid_train, alpha=0.6)
    axes[1,1].axhline(0, color="black", linewidth=1)
    axes[1,1].set_title("Residuals vs x1 (train)")
    axes[1,1].set_xlabel("x1")
    axes[1,1].set_ylabel("Residuals")
else:
    axes[1,1].axis("off")
    axes[1,1].set_title("Add residuals vs key predictor here")

plt.tight_layout()
plt.show()

## 6) Interpretation of assumptions

- **Linearity**: Inspect *Residuals vs Fitted*；若残差围绕 0 随机散布、无系统形状，支持线性关系假设。  
- **Independence**: 若数据按观测独立采样，且残差图未呈现明显序列相关（可扩展：残差的 lag-1 散点/自相关图），则独立性更可信。  
- **Homoscedasticity（方差齐性）**: 残差的垂直分布带宽大致均匀，说明误差方差随拟合值变化不大；若呈漏斗形，则可能异方差。  
- **Normality**: 直方图接近对称、QQ 图点落在参考直线上下，支持残差近似正态（用于置信区间与显著性近似合理）。

## 7) (Optional) Add a transformed feature and refit

In [None]:
# Example: add x1^2 to model to capture mild curvature while remaining a linear regression in parameters
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

if "x1" in X.columns:
    X_aug = X.copy()
    X_aug["x1_sq"] = X_aug["x1"] ** 2
    X_train_aug, X_test_aug, y_train_aug, y_test_aug = train_test_split(X_aug, y, test_size=0.2, random_state=42)

    linreg_aug = LinearRegression().fit(X_train_aug, y_train_aug)
    y_pred_test_aug = linreg_aug.predict(X_test_aug)

    r2_test_aug = r2_score(y_test_aug, y_pred_test_aug)
    rmse_test_aug = mean_squared_error(y_test_aug, y_pred_test_aug, squared=False)

    compare = pd.DataFrame({
        "model": ["baseline", "augmented (+ x1^2)"],
        "R2_test": [r2_test, r2_test_aug],
        "RMSE_test": [rmse_test, rmse_test_aug]
    })
    compare
else:
    print("No x1 column present; skip augmented example.")

## 8) Conclusion — Do we trust this model?

- **Fit quality**: 结合 R² 与 RMSE（train/test），评估是否存在过拟合/欠拟合迹象。  
- **Assumptions**: 若残差诊断未显示强系统性偏差（线性、齐性、近似正态），信任度更高。  
- **Actionables**: 若异方差或非线性明显，可：
  1) 对目标或关键特征做变换（log、平方项、交互项）；  
  2) 加入稳定化技巧（如加权回归、稳健回归）；  
  3) 收集更多特征或修正数据质量问题。