# 6. Hyperparameter Tuning - Titanic Survival

**Goal:** Optimize the best performing model(s) from the benchmark to squeeze out maximum accuracy.

**Selected Methods:**
1. **Random Forest Tuning** (Grid Search)
2. **XGBoost Tuning** (Randomized Search)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, confusion_matrix

np.random.seed(42)

## 1. Data Pipeline Setup

Re-establishing the preprocessing pipeline.

In [2]:
df = pd.read_csv('../data/raw/train.csv')
df['Title'] = df['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

X = df[['Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'Title', 'FamilySize', 'IsAlone']]
y = df['Survived']

numeric_features = ['Age', 'Fare', 'FamilySize']
categorical_features = ['Pclass', 'Sex', 'Embarked', 'Title', 'IsAlone']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), numeric_features),
        ('cat', Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]), categorical_features)
    ])

## 2. Random Forest Grid Search

Exhaustive search over specified parameter values.

In [3]:
pipeline_rf = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', RandomForestClassifier(random_state=42))])

param_grid_rf = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [None, 5, 10, 20],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

grid_rf = GridSearchCV(pipeline_rf, param_grid_rf, cv=5, verbose=1, n_jobs=-1)
grid_rf.fit(X, y)

print("Best RF Params:", grid_rf.best_params_)
print("Best RF Score:", grid_rf.best_score_)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


Best RF Params: {'classifier__max_depth': 10, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 10, 'classifier__n_estimators': 100}
Best RF Score: 0.8361308141359614


## 3. XGBoost Randomized Search

Randomized search allows testing a wider range of values efficiently.

In [4]:
pipeline_xgb = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'))])

param_dist_xgb = {
    'classifier__n_estimators': [100, 300, 500],
    'classifier__learning_rate': [0.01, 0.05, 0.1, 0.3],
    'classifier__max_depth': [3, 5, 7, 9],
    'classifier__subsample': [0.6, 0.8, 1.0],
    'classifier__colsample_bytree': [0.6, 0.8, 1.0]
}

rand_xgb = RandomizedSearchCV(pipeline_xgb, param_dist_xgb, n_iter=20, cv=5, verbose=1, n_jobs=-1, random_state=42)
rand_xgb.fit(X, y)

print("Best XGB Params:", rand_xgb.best_params_)
print("Best XGB Score:", rand_xgb.best_score_)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


Best XGB Params: {'classifier__subsample': 0.8, 'classifier__n_estimators': 100, 'classifier__max_depth': 7, 'classifier__learning_rate': 0.05, 'classifier__colsample_bytree': 0.8}
Best XGB Score: 0.843977151465696


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


## 4. Final Evaluation

Comparing the tuned models.

In [5]:
if grid_rf.best_score_ > rand_xgb.best_score_:
    print("üèÖ Winner: Random Forest")
    best_model = grid_rf.best_estimator_
else:
    print("üèÖ Winner: XGBoost")
    best_model = rand_xgb.best_estimator_

# Final metrics on entire training set (or hold-out if we had one)
y_pred = best_model.predict(X)
print(classification_report(y, y_pred))

üèÖ Winner: XGBoost
              precision    recall  f1-score   support

           0       0.90      0.96      0.93       549
           1       0.92      0.83      0.88       342

    accuracy                           0.91       891
   macro avg       0.91      0.89      0.90       891
weighted avg       0.91      0.91      0.91       891

