# 🧩 Titanic PCA — v4 (Dimensionality Reduction)


> **v4 Enhancements**  
> - Robust local CSV loader with fallback (`titanic.csv` or `train.csv`)  
> - EDA-first template with clear "What/Why" notes  
> - Version-agnostic metrics (manual RMSE), safe ROC plotting  
> - Target NaN handling (drop before split)  
> - "What we infer" summary cells at the end  
> - Reproducible `random_state=42`  


**Why PCA?** Project to 2D for visualization and check explained variance.

In [None]:

import pandas as pd, numpy as np, matplotlib.pyplot as plt
from utils import load_titanic, basic_eda

df = load_titanic()
basic_eda(df)


In [None]:

cols = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
data = df[cols].copy()

num_features = ['Pclass','Age','SibSp','Parch','Fare']
cat_features = ['Sex','Embarked']

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA

preprocess = ColumnTransformer([
    ('num', Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), num_features),
    ('cat', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]), cat_features)
])

pipe = Pipeline([('preprocess', preprocess), ('pca', PCA(n_components=2, random_state=42))])
Z = pipe.fit_transform(data)

plt.figure(figsize=(6,5))
plt.scatter(Z[:,0], Z[:,1], alpha=0.6)
plt.xlabel('PC1'); plt.ylabel('PC2'); plt.title('PCA 2D Projection'); plt.show()

explained = pipe.named_steps['pca'].explained_variance_ratio_
explained, explained.sum()


**What we infer:** How much variance PC1/PC2 capture; whether 2D view separates groups (optionally color by Survived if labels are available).