# Notebook 03: Multicollinearity & PCA

## Detecting Redundancy

When features are highly correlated, coefficients become unstable and hard to interpret. Variance Inflation Factor (VIF) diagnoses the problem. Principal Component Analysis (PCA) reveals the underlying structure, rotating data to uncorrelated axes.

---

## What is Multicollinearity?

Multicollinearity occurs when features are highly correlated with each other. This causes:

- **Unstable coefficients**: Small data changes cause large coefficient swings
- **High variance**: Coefficients have large standard errors
- **Uninterpretable results**: Coefficients don't reflect true feature importance

## Variance Inflation Factor (VIF)

VIF measures how much the variance of a coefficient increases due to multicollinearity. VIF > 10 indicates problematic multicollinearity.

## Principal Component Analysis (PCA)

PCA rotates data to uncorrelated axes (principal components) that capture maximum variance. It's a diagnostic tool for understanding feature redundancy.

## Setup and Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

import sys
from pathlib import Path
project_root = Path().resolve().parent if Path().resolve().name == 'notebooks' else Path().resolve()
sys.path.insert(0, str(project_root))

from src.utils import set_seed

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

set_seed(42)
print("✓ Imports successful!")

## Step 1: Load and Prepare Data

In [None]:
# Load and split data
data = load_diabetes(as_frame=True)
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize for VIF and PCA
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)

print(f"Training set: {X_train.shape}")

## Step 2: Compute VIF

VIF measures multicollinearity. VIF > 10 indicates problematic correlation.

In [None]:
# === TODO: Compute VIF for numeric features after standardization
# Hints:
#   - Use variance_inflation_factor from statsmodels
#   - Add constant column: add_constant(X_train_scaled_df)
#   - Compute VIF for each feature
#   - Flag features with VIF > 10
# Acceptance: VIF table printed; flag VIF > 10

## Step 3: PCA Analysis

PCA reveals the underlying structure and shows how many components capture most variance.

In [None]:
# === TODO: PCA on standardized X; plot cumulative explained variance
# Hints:
#   - Fit PCA on X_train_scaled
#   - Compute cumulative explained variance
#   - Plot scree plot (components vs explained variance)
#   - Find number of components for 90% variance
#   - Save to images/03_pca_scree_plot.png
# Acceptance: Scree plot and number of components for 90% variance

## Step 4: Optional - Refit on Principal Components

Compare model performance using original features vs principal components.

In [None]:
# === TODO: Optional: refit Ridge on top k principal components
# Hints:
#   - Transform X_train and X_test using PCA
#   - Fit Ridge on principal components
#   - Compare RMSE to original features
# Acceptance: Report RMSE on PCs vs original features

## Summary

VIF diagnoses multicollinearity, PCA reveals underlying structure. Both help us understand feature redundancy.

**Next**: Notebook 04 will explore SHAP values for model interpretability.