# 02 – PCA Analysis

Goals:
- Load training split or reuse preprocessing
- Apply PCA on scaled numeric (or full encoded) data
- Decide number of components via cumulative explained variance (e.g., 90–95%)
- Visualize first 2–3 PCs colored by target.

Key Concepts:
PCA projects data onto orthogonal axes maximizing variance. Requires centered/ scaled input; categorical OHE expansion can inflate dimensionality. Beware interpretability loss.

In [None]:
import pandas as pd, numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
import joblib

DATA_PATH = Path('../data/heart_disease.csv')
df = pd.read_csv(DATA_PATH)
target_col = 'target' if 'target' in df.columns else 'num'
y = df[target_col]
X = df.drop(columns=[target_col])

# Load preprocessor built earlier
preprocessor = joblib.load('../models/preprocessor.pkl')
X_enc = preprocessor.fit_transform(X)  # If already fit, use transform only
X_enc.shape

## 1. Fit PCA (Full Dimensionality)

In [None]:
pca_full = PCA(random_state=42)
pca_full.fit(X_enc.toarray() if hasattr(X_enc,'toarray') else X_enc)
expl_var = pca_full.explained_variance_ratio_
expl_var[:10]

## 2. Cumulative Explained Variance Plot

In [None]:
cum_var = expl_var.cumsum()
plt.figure(figsize=(8,5))
plt.plot(range(1, len(cum_var)+1), cum_var, marker='o')
plt.axhline(0.90, color='r', ls='--', label='90%')
plt.axhline(0.95, color='g', ls='--', label='95%')
plt.xlabel('Components')
plt.ylabel('Cumulative Explained Variance')
plt.legend(); plt.title('Cumulative Explained Variance by #Components')
plt.show()

## 3. Choose k Components
Pick smallest k achieving chosen threshold (e.g., 95%).

In [None]:
k_95 = np.argmax(cum_var >= 0.95) + 1
k_90 = np.argmax(cum_var >= 0.90) + 1
k_90, k_95

## 4. 2D Projection (PC1 vs PC2) Colored by Target

In [None]:
pca2 = PCA(n_components=2, random_state=42)
X_pca2 = pca2.fit_transform(X_enc.toarray() if hasattr(X_enc,'toarray') else X_enc)
plt.figure(figsize=(7,6))
sns.scatterplot(x=X_pca2[:,0], y=X_pca2[:,1], hue=y, palette='coolwarm', alpha=0.7)
plt.title('PCA Scatter (PC1 vs PC2)')
plt.xlabel('PC1'); plt.ylabel('PC2')
plt.show()

## 5. Variance Contribution of First 10 PCs

In [None]:
plt.figure(figsize=(8,4))
plt.bar(range(1,11), expl_var[:10])
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Top 10 Principal Components')
plt.show()

## 6. Save Fitted PCA (Optional)

In [None]:
joblib.dump(pca_full, '../models/pca_full.pkl')
print('Saved PCA model.')

## Notes & Pitfalls
- PCA on sparse OHE may densify data; watch memory (use TruncatedSVD for high-dimensional sparse).
- Interpret components cautiously—linear combos of encoded variables.
- Consider performing PCA only on numeric scaled subset if categorical interpretability needed.