# 🔷 Titanic KMeans — v4 (Unsupervised Clustering)


> **v4 Enhancements**  
> - Robust local CSV loader with fallback (`titanic.csv` or `train.csv`)  
> - EDA-first template with clear "What/Why" notes  
> - Version-agnostic metrics (manual RMSE), safe ROC plotting  
> - Target NaN handling (drop before split)  
> - "What we infer" summary cells at the end  
> - Reproducible `random_state=42`  


**Why KMeans?** Uncovers latent groups without labels; useful for segmentation.

In [None]:

import pandas as pd, numpy as np, matplotlib.pyplot as plt
from utils import load_titanic, basic_eda

df = load_titanic()
basic_eda(df)


In [None]:

cols = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
data = df[cols].copy()

num_features = ['Pclass','Age','SibSp','Parch','Fare']
cat_features = ['Sex','Embarked']

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.cluster import KMeans

preprocess = ColumnTransformer([
    ('num', Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), num_features),
    ('cat', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]), cat_features)
])

pipe = Pipeline([('preprocess', preprocess), ('kmeans', KMeans(n_clusters=3, n_init=10, random_state=42))])
clusters = pipe.fit_predict(data)
import pandas as pd
pd.Series(clusters).value_counts().sort_index()


**What we infer:** Cluster sizes and profiles (inspect feature means per cluster) reveal segments (e.g., by class/age/fare).