# Titanic Classification — End-to-End Guide

Welcome! This notebook walks you through building a complete machine learning pipeline that predicts passenger survival on the Titanic.

**You will:**
1. Understand the data and target (Survived).
2. Explore and visualize key factors (gender, age, class, fare, family, embarkation).
3. Engineer features (Title, FamilySize, IsAlone, TicketGroup).
4. Build robust preprocessing with `ColumnTransformer` and `Pipeline`.
5. Train and compare multiple models (Logistic Regression, RandomForest, GradientBoosting; optional XGBoost).
6. Evaluate with cross-validation, confusion matrix, ROC AUC.
7. Interpret feature importance & permutation importance.
8. Export a trained model and generate a Kaggle submission.

---
### How to use this notebook
- If you're on **Google Colab**:
  - Upload `train.csv` and `test.csv` from Kaggle's Titanic competition to the Colab session (or mount Google Drive and set the paths below).
  - Run the cells from top to bottom.
- If you're on **local Jupyter**:
  - Place `train.csv` and `test.csv` in the same folder as this notebook or update the paths below.

**Dataset source**: Kaggle Titanic — Machine Learning from Disaster.

---

## 0) Setup
Install and import the libraries you need. Skip installs if already available.

In [0]:
# If in Colab, uncomment these:
# !pip install scikit-learn pandas numpy matplotlib joblib shap xgboost --quiet

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, confusion_matrix, RocCurveDisplay)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.inspection import permutation_importance
import joblib
import re
import warnings
warnings.filterwarnings('ignore')


## 1) Load data
Set the paths to your CSVs. If you're in Colab, use the file upload panel (left sidebar) and keep these default names.

In [0]:
TRAIN_PATH = 'train.csv'
TEST_PATH  = 'test.csv'  # used later for submission

train = pd.read_csv(TRAIN_PATH)
test  = pd.read_csv(TEST_PATH)
train.head()

## 2) Quick data check

In [0]:
display(train.shape)
display(train.isna().sum())
train.describe(include='all')

## 3) Basic EDA (Exploratory Data Analysis)

In [0]:
# Target balance
surv_counts = train['Survived'].value_counts().sort_index()
surv_counts.plot(kind='bar')
plt.title('Target balance: Survived (0 = No, 1 = Yes)')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.show()

# Survival by Sex
train.groupby('Sex')['Survived'].mean().plot(kind='bar')
plt.title('Survival Rate by Sex')
plt.ylabel('Survival Rate')
plt.show()

# Survival by Pclass
train.groupby('Pclass')['Survived'].mean().plot(kind='bar')
plt.title('Survival Rate by Passenger Class')
plt.ylabel('Survival Rate')
plt.show()

# Age distribution by Survival
train[['Age','Survived']].dropna().hist(by='Survived', column='Age', bins=30, sharex=True)
plt.suptitle('Age Distribution by Survival')
plt.show()

# Embarked vs Survival
train.groupby('Embarked')['Survived'].mean().plot(kind='bar')
plt.title('Survival Rate by Embarked')
plt.ylabel('Survival Rate')
plt.show()


## 4) Feature Engineering
We'll create useful features that capture social status and family structure:
- **Title** extracted from `Name` (e.g., Mr, Mrs, Miss, Master, etc.)
- **FamilySize** = `SibSp + Parch + 1`
- **IsAlone** = 1 if `FamilySize == 1` else 0
- **TicketGroupSize** = number of passengers sharing the same ticket (proxy for group travel)

We also handle rare titles by grouping them.

In [0]:
def extract_title(name: str) -> str:
    m = re.search(r',\s*([^\.]+)\.', name)
    return m.group(1).strip() if m else 'Unknown'

def add_engineered_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['Title'] = df['Name'].apply(extract_title)
    # Map rare titles
    common = {'Mr','Mrs','Miss','Master'}
    df['Title'] = df['Title'].apply(lambda t: t if t in common else 'Rare')
    
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
    
    # Ticket group size
    ticket_counts = df['Ticket'].value_counts()
    df['TicketGroupSize'] = df['Ticket'].map(ticket_counts)
    return df

train_fe = add_engineered_features(train)
test_fe  = add_engineered_features(test)
train_fe[['Name','Title','FamilySize','IsAlone','Ticket','TicketGroupSize']].head()

## 5) Preprocessing & Train/Validation Split
We'll impute missing values, one-hot encode categorical variables, and scale numeric features where useful.

In [0]:
TARGET = 'Survived'
feature_cols = ['Pclass','Sex','Age','Fare','Embarked','FamilySize','IsAlone','Title','TicketGroupSize']

X = train_fe[feature_cols]
y = train_fe[TARGET]

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

numeric_features = ['Age','Fare','FamilySize','TicketGroupSize']
categorical_features = ['Pclass','Sex','Embarked','IsAlone','Title']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

X_train.shape, X_valid.shape

## 6) Baseline Model — Logistic Regression

In [0]:
log_reg = Pipeline(steps=[
    ('prep', preprocessor),
    ('clf', LogisticRegression(max_iter=200, n_jobs=None))
])

log_reg.fit(X_train, y_train)
preds = log_reg.predict(X_valid)
probs = log_reg.predict_proba(X_valid)[:,1]

def eval_metrics(y_true, y_pred, y_prob):
    print('Accuracy :', round(accuracy_score(y_true, y_pred), 4))
    print('Precision:', round(precision_score(y_true, y_pred), 4))
    print('Recall   :', round(recall_score(y_true, y_pred), 4))
    print('F1       :', round(f1_score(y_true, y_pred), 4))
    print('ROC AUC  :', round(roc_auc_score(y_true, y_prob), 4))
    
eval_metrics(y_valid, preds, probs)

cm = confusion_matrix(y_valid, preds)
fig, ax = plt.subplots()
ax.imshow(cm)
ax.set_title('Confusion Matrix — Logistic Regression')
ax.set_xlabel('Predicted')
ax.set_ylabel('True')
for (i, j), val in np.ndenumerate(cm):
    ax.text(j, i, int(val), ha='center', va='center')
plt.show()

RocCurveDisplay.from_estimator(log_reg, X_valid, y_valid)
plt.title('ROC Curve — Logistic Regression')
plt.show()

## 7) Model Comparison
We'll try a couple of tree-based models and compare via cross-validation.

In [0]:
models = {
    'LogisticRegression': LogisticRegression(max_iter=200),
    'RandomForest'     : RandomForestClassifier(n_estimators=400, random_state=42),
    'GradientBoosting' : GradientBoostingClassifier(random_state=42),
    # 'XGBoost'       : XGBClassifier(n_estimators=400, max_depth=4, learning_rate=0.1, subsample=0.9, colsample_bytree=0.9, random_state=42)
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = {}
for name, clf in models.items():
    pipe = Pipeline(steps=[('prep', preprocessor), ('clf', clf)])
    scores = cross_val_score(pipe, X, y, cv=cv, scoring='accuracy')
    cv_results[name] = (scores.mean(), scores.std())

for name, (mean_acc, std_acc) in cv_results.items():
    print(f"{name:18s}  Acc: {mean_acc:.4f} ± {std_acc:.4f}")

## 8) Fit Best Model & Interpretability
Pick the best-performing model above (update `best_name` if needed) and inspect feature importance (tree model) and permutation importance.

In [0]:
best_name = max(cv_results, key=lambda k: cv_results[k][0])
best_clf = models[best_name]
best_pipe = Pipeline(steps=[('prep', preprocessor), ('clf', best_clf)])
best_pipe.fit(X_train, y_train)
print('Best model selected:', best_name)

# Scores on validation
val_preds = best_pipe.predict(X_valid)
val_probs = best_pipe.predict_proba(X_valid)[:,1] if hasattr(best_pipe.named_steps['clf'], 'predict_proba') else None
eval_metrics(y_valid, val_preds, val_probs if val_probs is not None else val_preds)

# Feature importance for tree-based models
if hasattr(best_pipe.named_steps['clf'], 'feature_importances_'):
    # Retrieve feature names after preprocessing
    ohe = best_pipe.named_steps['prep'].named_transformers_['cat'].named_steps['onehot']
    num_feats = ['Age','Fare','FamilySize','TicketGroupSize']
    cat_feats = list(ohe.get_feature_names_out(['Pclass','Sex','Embarked','IsAlone','Title']))
    all_feats = num_feats + cat_feats
    importances = best_pipe.named_steps['clf'].feature_importances_
    imp = pd.Series(importances, index=all_feats).sort_values(ascending=False)
    ax = imp.head(20).plot(kind='bar')
    ax.set_title(f'Feature Importance — {best_name}')
    ax.set_ylabel('Importance')
    plt.show()

# Permutation importance (model-agnostic)
r = permutation_importance(best_pipe, X_valid, y_valid, n_repeats=10, random_state=42)
perm_imp = pd.Series(r.importances_mean, index=feature_cols).sort_values(ascending=False)
ax = perm_imp.plot(kind='bar')
ax.set_title('Permutation Importance (Validation Set)')
ax.set_ylabel('Importance')
plt.show()

## 9) Train on Full Data & Export Model

In [0]:
# Refit on ALL training data using the best model
final_pipe = Pipeline(steps=[('prep', preprocessor), ('clf', models[best_name])])
final_pipe.fit(X, y)
joblib.dump(final_pipe, 'titanic_model.joblib')
print('Saved model to titanic_model.joblib')

## 10) Generate Submission for Kaggle

In [0]:
# Prepare test features with the exact same engineering as training
X_test = test_fe[feature_cols]
test_pred = final_pipe.predict(X_test)
submission = pd.DataFrame({
    'PassengerId': test['PassengerId'],
    'Survived': test_pred
})
submission.to_csv('submission.csv', index=False)
submission.head()

## 11) Use the Model for New Passengers
Example: predict survival for a hypothetical passenger dictionary (same feature keys).

In [0]:
def predict_one(passenger_features: dict):
    df = pd.DataFrame([passenger_features])
    # Minimal fields expected: Pclass, Sex, Age, Fare, Embarked, FamilySize, IsAlone, Title, TicketGroupSize
    model = joblib.load('titanic_model.joblib')
    prob = None
    if hasattr(model.named_steps['clf'], 'predict_proba'):
        prob = model.predict_proba(df)[:,1][0]
    pred = model.predict(df)[0]
    return pred, prob

example = {
    'Pclass': 1,
    'Sex': 'female',
    'Age': 28,
    'Fare': 80,
    'Embarked': 'S',
    'FamilySize': 1,
    'IsAlone': 1,
    'Title': 'Miss',
    'TicketGroupSize': 1
}

print('Prediction (1=Survived,0=Not):', predict_one(example))

## 12) Next Steps / Ideas
- Hyperparameter tuning with `GridSearchCV` or `RandomizedSearchCV`.
- Use kNN imputation for `Age`.
- Try more features: cabin deck (first letter of Cabin), ticket prefix, age/fare bins.
- Try `XGBoost`/`LightGBM`/`CatBoost` and calibrate probabilities.
- Use `SHAP` for more detailed model explanations.
- Log experiments and metrics with MLflow.
- Build a simple FastAPI/Streamlit app around your trained model.