# Titanic Q2 — Improved Preprocessing & Cross‑Validated Model Comparison
    
**Course:** Data Mining  
**Objective:** Improve upon the common Kaggle preprocessing by using leakage‑safe Pipelines, better feature engineering, and Stratified K‑Fold evaluation.  
**Deliverable:** A table of mean CV accuracies (± std) for nine classifiers, with reproducible code.
    
### What this notebook does
1. Loads the Titanic training data (`train.csv`) and (optionally) test data (`test.csv`).
2. Builds a **leakage‑safe preprocessing pipeline** using scikit‑learn:
   - Robust feature engineering (Deck, Title, FamilySize, Ticket group size, interactions).
   - Rare-category grouping inside CV folds.
   - Proper imputers per data type, and one‑hot encoding with `handle_unknown="ignore"`.
   - Scaling only where it helps (linear, SVM, KNN).
3. Evaluates nine classifiers via **StratifiedKFold** with fixed random seed, reports **mean ± std accuracy**, and saves results to CSV.
    
> **Important:** Place `train.csv` (and optionally `test.csv`) in the same folder as this notebook. These are the standard Kaggle Titanic files.


## 0) Imports & Config

In [1]:

import os
import re
import numpy as np
import pandas as pd

from typing import List, Optional, Tuple

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Classifiers
from sklearn.linear_model import LogisticRegression, SGDClassifier, Perceptron
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


## 1) Load Data
    
Edit the `DATA_DIR` if your CSVs are somewhere else.


In [2]:

DATA_DIR = '.'  # change if needed
TRAIN_PATH = os.path.join(DATA_DIR, 'train.csv')
TEST_PATH  = os.path.join(DATA_DIR, 'test.csv')

assert os.path.exists(TRAIN_PATH), f"train.csv not found at {TRAIN_PATH}. Please add it and re-run."

train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH) if os.path.exists(TEST_PATH) else None

train_df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 2) Improved Feature Engineering (leakage‑safe via a custom transformer)
We avoid computing any statistics on the full dataset outside CV. All logic runs **inside** the Pipeline.


In [3]:

class FeatureBuilder(BaseEstimator, TransformerMixin):
    """Reproducible, leakage‑safe feature engineering for Titanic.
    
    - Extracts Title from Name and groups rare titles later in the pipeline
    - Deck from Cabin initial; 'U' if unknown
    - FamilySize, IsAlone
    - TicketPrefix and GroupSizeByTicket
    - Interactions: Age*Pclass (later after imputation), Sex x Pclass (via one-hot)
    - FarePerPerson = Fare / FamilySize (handles division carefully)
    - Adds binned copies (AgeBin/FareBin) as categorical features (optional)
    """
    def __init__(self, make_bins: bool = True):
        self.make_bins = make_bins

    @staticmethod
    def _extract_title(name: str) -> str:
        if pd.isna(name):
            return 'Unknown'
        try:
            part = name.split(',')[1]
            title = part.split('.')[0].strip()
            return title
        except Exception:
            return 'Unknown'

    @staticmethod
    def _ticket_prefix(ticket: str) -> str:
        if pd.isna(ticket):
            return 'MISSING'
        # Keep only alpha tokens as prefix; else 'NONE'
        toks = re.split(r'[\s/]+', str(ticket))
        toks = [t for t in toks if t and not t.isdigit()]
        return toks[0].upper() if toks else 'NONE'

    @staticmethod
    def _deck_from_cabin(cabin: str) -> str:
        if pd.isna(cabin) or str(cabin).strip() == '' or str(cabin).lower() == 'nan':
            return 'U'  # Unknown
        return str(cabin)[0].upper()

    def fit(self, X: pd.DataFrame, y=None):
        # Stateless (all stats learned later by downstream transformers)
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        df = X.copy()

        # Basic safe casts
        for col in ['Age', 'Fare', 'SibSp', 'Parch', 'Pclass']:
            if col in df.columns:
                df[col] = pd.to_numeric(df[col], errors='coerce')

        # Title, Deck, TicketPrefix
        df['Title'] = df['Name'].apply(self._extract_title) if 'Name' in df.columns else 'Unknown'
        df['Deck'] = df['Cabin'].apply(self._deck_from_cabin) if 'Cabin' in df.columns else 'U'
        df['TicketPrefix'] = df['Ticket'].apply(self._ticket_prefix) if 'Ticket' in df.columns else 'NONE'

        # Family features
        df['SibSp'] = df['SibSp'].fillna(0)
        df['Parch'] = df['Parch'].fillna(0)
        df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
        df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

        # Group size by ticket
        if 'Ticket' in df.columns:
            # Use groupby on the current batch (fold) only
            df['GroupSizeByTicket'] = df.groupby('Ticket')['Ticket'].transform('count')
        else:
            df['GroupSizeByTicket'] = 1

        # Fare per person (avoid division by zero)
        df['FarePerPerson'] = df['Fare'] / df['FamilySize']
        df['FarePerPerson'] = df['FarePerPerson'].replace([np.inf, -np.inf], np.nan)

        # Optional coarse bins as categorical helpers
        if self.make_bins:
            df['AgeBin'] = pd.cut(df['Age'], bins=[0, 12, 18, 30, 45, 60, 80, np.inf], right=False, include_lowest=True).astype('category')
            df['FareBin'] = pd.qcut(df['Fare'], q=5, duplicates='drop').astype('category')
        else:
            df['AgeBin'] = pd.Series(['NA'] * len(df), dtype='category')
            df['FareBin'] = pd.Series(['NA'] * len(df), dtype='category')

        # Simple interaction that uses numeric Age & Pclass; leave NaNs for imputer
        if 'Age' in df.columns and 'Pclass' in df.columns:
            df['AgeClass'] = df['Age'] * df['Pclass']
        else:
            df['AgeClass'] = np.nan

        # Keep a curated set of columns for modeling
        keep_cols = [
            # Original numeric
            'Pclass', 'Age', 'SibSp', 'Parch', 'Fare',
            # Engineered numeric
            'FamilySize', 'IsAlone', 'GroupSizeByTicket', 'FarePerPerson', 'AgeClass',
            # Categorical
            'Sex', 'Embarked', 'Title', 'Deck', 'TicketPrefix', 'AgeBin', 'FareBin'
        ]
        # Survive gracefully if some columns are missing in the raw data
        keep_cols = [c for c in keep_cols if c in df.columns]

        return df[keep_cols]


## 3) Rare-Category Grouper (fit **inside** CV folds)
Groups very infrequent categories to `Rare` based on training‑fold frequencies.


In [4]:

class RareCategoryGrouper(BaseEstimator, TransformerMixin):
    """Group infrequent categories to 'Rare' per column.
    
    Parameters
    ----------
    min_freq : float
        Minimum relative frequency required to keep a category.
    columns : List[str]
        Categorical columns to apply grouping.
    """
    def __init__(self, min_freq: float = 0.02, columns: Optional[List[str]] = None):
        self.min_freq = min_freq
        self.columns = columns
        self.keep_maps_ = {}

    def fit(self, X: pd.DataFrame, y=None):
        self.keep_maps_ = {}
        cols = self.columns if self.columns is not None else X.select_dtypes(include=['object', 'category']).columns.tolist()
        for col in cols:
            vc = X[col].astype('category').value_counts(normalize=True, dropna=False)
            keep = set(vc[vc >= self.min_freq].index.astype(str).tolist())
            self.keep_maps_[col] = keep
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        X = X.copy()
        for col, keep in self.keep_maps_.items():
            if col in X.columns:
                X[col] = X[col].astype(str).where(X[col].astype(str).isin(keep), other='Rare')
        return X


## 4) ColumnTransformer: Imputation, Encoding, and Scaling
- Numeric: `SimpleImputer(strategy='median')` then **scale for linear/knn/svm** models.
- Categorical: `SimpleImputer('most_frequent')` and `OneHotEncoder(handle_unknown='ignore')`.
    
We will build two preprocessors:
1. `preprocess_scaled` for models that need scaling (LogReg, LinearSVC/SGD, SVC-RBF, KNN, Perceptron)
2. `preprocess_tree` for tree/forest/naive bayes (no scaling required; NB will accept dense output)


In [10]:

# Helper to infer column types after FeatureBuilder
def infer_columns(df: pd.DataFrame) -> Tuple[List[str], List[str]]:
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    return num_cols, cat_cols

# Fit the feature builder once on the raw train_df to discover columns (doesn't learn stats)
fb = FeatureBuilder(make_bins=True)
feat_preview = fb.fit_transform(train_df)

num_cols, cat_cols = infer_columns(feat_preview)

rare_cols = [c for c in cat_cols if c in ['Title', 'Deck', 'TicketPrefix', 'AgeBin', 'FareBin']]

# Pipelines use RareCategoryGrouper -> ColumnTransformer
grouper = RareCategoryGrouper(min_freq=0.02, columns=rare_cols)

# For models that need scaling
numeric_scaled = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale', StandardScaler())
])

categorical_common = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocess_scaled = Pipeline([
    ('features', fb),
    ('rare', grouper),
    ('ct', ColumnTransformer([
        ('num', numeric_scaled, num_cols),
        ('cat', categorical_common, cat_cols)
    ], remainder='drop', verbose_feature_names_out=False))
])

# For tree/naive bayes models (no scaling)
numeric_tree = Pipeline([
    ('impute', SimpleImputer(strategy='median'))
])

preprocess_tree = Pipeline([
    ('features', fb),
    ('rare', grouper),
    ('ct', ColumnTransformer([
        ('num', numeric_tree, num_cols),
        ('cat', categorical_common, cat_cols)
    ], remainder='drop', verbose_feature_names_out=False))
])


## 5) Build Model Pipelines

In [11]:

def make_pipelines():
    models = {}

    # Linear / margin / distance-based -> scaled preprocessor
    models['Logistic Regression'] = Pipeline([
        ('prep', preprocess_scaled),
        ('clf', LogisticRegression(max_iter=2000, random_state=RANDOM_STATE))
    ])

    models['Linear SVC'] = Pipeline([
        ('prep', preprocess_scaled),
        ('clf', LinearSVC(random_state=RANDOM_STATE))
    ])

    models['SGD (Log Loss)'] = Pipeline([
        ('prep', preprocess_scaled),
        ('clf', SGDClassifier(loss='log_loss', max_iter=2000, random_state=RANDOM_STATE))
    ])

    models['Perceptron'] = Pipeline([
        ('prep', preprocess_scaled),
        ('clf', Perceptron(max_iter=2000, random_state=RANDOM_STATE))
    ])

    models['SVC (RBF)'] = Pipeline([
        ('prep', preprocess_scaled),
        ('clf', SVC(kernel='rbf', C=1.0, gamma='scale', random_state=RANDOM_STATE))
    ])

    models['KNN (k=15)'] = Pipeline([
        ('prep', preprocess_scaled),
        ('clf', KNeighborsClassifier(n_neighbors=15))
    ])

    # Tree-based / NB -> unscaled preprocessor
    models['Random Forest'] = Pipeline([
        ('prep', preprocess_tree),
        ('clf', RandomForestClassifier(
            n_estimators=400, max_depth=None, min_samples_split=2,
            min_samples_leaf=1, random_state=RANDOM_STATE, n_jobs=-1
        ))
    ])

    models['Decision Tree'] = Pipeline([
        ('prep', preprocess_tree),
        ('clf', DecisionTreeClassifier(random_state=RANDOM_STATE))
    ])

    # GaussianNB expects dense; our OneHotEncoder uses sparse=False above
    models['Gaussian NB'] = Pipeline([
        ('prep', preprocess_tree),
        ('clf', GaussianNB())
    ])

    return models

models = make_pipelines()
list(models.keys())


['Logistic Regression',
 'Linear SVC',
 'SGD (Log Loss)',
 'Perceptron',
 'SVC (RBF)',
 'KNN (k=15)',
 'Random Forest',
 'Decision Tree',
 'Gaussian NB']

## 6) Cross‑Validated Evaluation (Stratified K‑Fold)
- **n_splits=5**, shuffled with fixed seed for reproducibility
- Metric: **accuracy**
    
We also keep the same folds across models for fairness.


In [15]:

X = train_df.drop(columns=['Survived']) if 'Survived' in train_df.columns else train_df.copy()
y = train_df['Survived'].astype(int) if 'Survived' in train_df.columns else None
assert y is not None, "The training data must include a 'Survived' column."

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

rows = []
for name, pipe in models.items():
    scores = cross_val_score(pipe, X, y, cv=cv, scoring='accuracy', n_jobs=-1)
    rows.append({
        'Model': name,
        'Mean Accuracy': scores.mean(),
        'Std': scores.std(),
        'Fold Scores': scores
    })

results_df = pd.DataFrame(rows).sort_values(by='Mean Accuracy', ascending=False).reset_index(drop=True)
results_df[['Model', 'Mean Accuracy', 'Std']]


Unnamed: 0,Model,Mean Accuracy,Std
0,SVC (RBF),0.828278,0.014953
1,Linear SVC,0.826044,0.011695
2,Logistic Regression,0.824926,0.015922
3,Random Forest,0.823765,0.018212
4,KNN (k=15),0.801324,0.019533
5,SGD (Log Loss),0.79796,0.014892
6,Gaussian NB,0.787848,0.018901
7,Decision Tree,0.785619,0.033304
8,Perceptron,0.728379,0.022818


## 7) Fit Best Model on Full Training Set (Optional)

In [13]:

best_row = results_df.iloc[0]
best_name = best_row['Model']
best_model = models[best_name]
best_model.fit(X, y)

print(f"Best model: {best_name}")


Best model: SVC (RBF)


## 8) Save Results

In [14]:

OUT_CSV = 'titanic_q2_cv_results.csv'
results_df.to_csv(OUT_CSV, index=False)
print(f"Saved cross-validated results to {OUT_CSV}")


Saved cross-validated results to titanic_q2_cv_results.csv


## 9) Notes for Your Report
- The preprocessing avoids target leakage by learning imputers, rare-category grouping, and encoders **inside** CV folds.
- We keep continuous features for margin-based models; bins are present for trees via one-hot.
- Title, Deck, TicketPrefix, FamilySize, GroupSizeByTicket, FarePerPerson, and AgeClass generally add signal.
- Report your **Mean ± Std accuracy** per model and briefly justify why this pipeline addresses Kaggle’s common pitfalls.
