# 📘 Titanic Logistic Regression — v4 (CSV, EDA-first)

> **v4 Enhancements**  
> - Robust local CSV loader with fallback (`titanic.csv` or `train.csv`)  
> - EDA-first template with clear "What/Why" notes  
> - Version-agnostic metrics (manual RMSE), safe ROC plotting  
> - Target NaN handling (drop before split)  
> - "What we infer" summary cells at the end  
> - Reproducible `random_state=42`  


## 0) Goal & Why
- **Task:** Predict `Survived` (0/1) → binary classification
- **Why:** Classic example to teach metrics beyond accuracy

## 1) Load Data (Local CSV)

In [None]:

import pandas as pd, numpy as np
from utils import load_titanic, basic_eda, plot_hist, bar_from_group, print_section

df = load_titanic()
df.head()


## 2) EDA — Structure, Missingness, Class Balance + Key Rates

In [None]:

basic_eda(df)


In [None]:

# Class balance
prop = df['Survived'].value_counts(normalize=True).rename('proportion')
prop


In [None]:

# Survival rate by Sex and Pclass
sex_rate = df.groupby('Sex')['Survived'].mean().sort_values(ascending=False)
pclass_rate = df.groupby('Pclass')['Survived'].mean()
bar_from_group(sex_rate, title="Survival Rate by Sex", ylabel="Rate", ylim01=True)
bar_from_group(pclass_rate, title="Survival Rate by Pclass", ylabel="Rate", ylim01=True)


## 3) Target & Features (safety: drop NaN target if present)

In [None]:

cols = ['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
data = df[cols].copy()
data['Alone'] = ((data['SibSp'].fillna(0) + data['Parch'].fillna(0)) == 0).astype(int)

data = data.dropna(subset=['Survived']).reset_index(drop=True)

y = data['Survived']
X = data.drop(columns=['Survived'])

num_features = ['Pclass','Age','SibSp','Parch','Fare']
cat_features = ['Sex','Embarked','Alone']


## 4) Preprocessing + Split — What & Why

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

numeric_transformer = Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
categorical_transformer = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocess = ColumnTransformer([('num', numeric_transformer, num_features), ('cat', categorical_transformer, cat_features)])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


## 5) Train & Evaluate — Metrics to report

In [None]:

model = Pipeline([('preprocess', preprocess), ('clf', LogisticRegression(max_iter=1000))])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

metrics = {
    'accuracy': round(accuracy_score(y_test, y_pred),3),
    'precision': round(precision_score(y_test, y_pred),3),
    'recall': round(recall_score(y_test, y_pred),3),
    'f1': round(f1_score(y_test, y_pred),3),
    'roc_auc': round(roc_auc_score(y_test, y_proba),3)
}
metrics


## 6) Diagnostics — Confusion Matrix & ROC Curve

In [None]:

from utils import confusion_df
confusion_df(y_test, y_pred)


In [None]:

import matplotlib.pyplot as plt
try:
    from sklearn.metrics import RocCurveDisplay
    RocCurveDisplay.from_estimator(model, X_test, y_test)
    plt.title('ROC Curve'); plt.show()
except Exception as e:
    print("ROC curve not available in this sklearn version:", e)


## 7) Interpretation — Coefficients & Odds Ratios

In [None]:

ct = model.named_steps['preprocess']
ohe = ct.named_transformers_['cat'].named_steps['onehot']
num_names = num_features
cat_names = list(ohe.get_feature_names_out(cat_features))
all_feature_names = num_names + cat_names

coef = model.named_steps['clf'].coef_[0]
import pandas as pd, numpy as np
coef_df = pd.DataFrame({'feature': all_feature_names, 'coef': coef})
coef_df['odds_ratio'] = np.exp(coef_df['coef'])
coef_df.sort_values('odds_ratio', ascending=False).head(12)


## ✅ What we infer
- We check **accuracy + precision/recall/F1** (especially if class imbalance exists) and **ROC‑AUC**.
- Confusion matrix shows failure modes.
- Coefficients/odds help communicate drivers of survival.