# 🌲 Titanic Decision Tree — v4 (Classification)


> **v4 Enhancements**  
> - Robust local CSV loader with fallback (`titanic.csv` or `train.csv`)  
> - EDA-first template with clear "What/Why" notes  
> - Version-agnostic metrics (manual RMSE), safe ROC plotting  
> - Target NaN handling (drop before split)  
> - "What we infer" summary cells at the end  
> - Reproducible `random_state=42`  


**Why a tree?** Interpretable rules; handles nonlinearity; no scaling needed.

In [None]:

import pandas as pd, numpy as np, matplotlib.pyplot as plt
from utils import load_titanic, basic_eda

df = load_titanic()
basic_eda(df)


In [None]:

cols = ['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
data = df[cols].copy()
data['Alone'] = ((data['SibSp'].fillna(0) + data['Parch'].fillna(0)) == 0).astype(int)

y = data['Survived']
X = data.drop(columns=['Survived'])
num_features = ['Pclass','Age','SibSp','Parch','Fare']
cat_features = ['Sex','Embarked','Alone']


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

preprocess = ColumnTransformer([
    ('num', SimpleImputer(strategy='median'), num_features),
    ('cat', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]), cat_features)
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

model = Pipeline([('preprocess', preprocess), ('clf', DecisionTreeClassifier(random_state=42, max_depth=4))])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

{'accuracy': round(accuracy_score(y_test, y_pred),3),
 'precision': round(precision_score(y_test, y_pred),3),
 'recall': round(recall_score(y_test, y_pred),3),
 'f1': round(f1_score(y_test, y_pred),3)}


**What we infer:** Trees give simple decision rules; adjust `max_depth` to control overfitting.