<font color = "DeepSkyBlue">**A precision-tuned XGBoost workflow focused on stability, feature logic, and reliable CV performance.**

This notebook builds a highly optimized XGBoost model for the Kaggle Playground Series S5E11 competition. It begins by loading the training and test data and performing a small domain-specific transformation on the grade_subgrade feature to extract the numerical subgrade. The code identifies all base features aside from the target and ID, and divides them into categorical and numeric groups. It then creates several forms of feature engineering: leakage-safe out-of-fold target mean encoding, frequency encoding for all columns, and quantile-based bin features for numeric columns. These engineered features are concatenated with the raw features, and some previously-tested columns are optionally removed. All remaining categorical features are cast to pandas categorical type so XGBoost can use its built-in categorical handling.

Next, the script aligns feature columns between train and test to ensure consistency. It then uses xgb.cv to estimate an optimal number of boosting rounds with early stopping, using one representative set of jittered hyperparameters. After this first estimate, the code performs a small “micro-sweep” around the chosen number of estimators (best, +10, +20) to find the value that maximizes out-of-fold AUC with a strong parameter configuration. The best number of estimators identified in this sweep becomes the final n_estimators for all models in the ensemble.

The notebook then trains a small internal ensemble of XGBoost models using several different seeds and parameter jitters. Each model is trained on the full training data and predicts probabilities on the test data. To tune the final ensemble blending method, the script also builds out-of-fold predictions for each ensemble member using Stratified K-Fold. From these OOF predictions it constructs two meta-signals: the average probability of all ensemble predictions and the average rank of those predictions. The script then searches for a blending weight β (between 0.20 and 0.40) that maximizes OOF AUC when mixing probabilities and ranks. This produces a more stable and robust final prediction method.

Finally, the tuned blend is applied to the ensemble’s predictions on the test set, and the resulting probabilities are clipped into a valid range and written into submission.csv. The script concludes by printing the reference CV AUC from xgb.cv, the best result from the micro-sweep, and the final blended OOF AUC used to drive the ensemble selection.

<font color = "DeepSkyBlue">**Imports**

This section loads all required libraries for the modeling pipeline. It brings in core Python tools for file handling, randomness control, system operations, and memory management. NumPy and pandas are included to handle numerical processing and tabular data. XGBoost and its classifier interface provide the gradient-boosted tree model used throughout the workflow, including native support for categorical features. From scikit-learn, it imports utilities for cross-validation and the AUC metric used for evaluation. Warning messages are then disabled to keep the notebook output focused and uncluttered.

In [None]:
# Imports
import os, gc, warnings, math, random
from pathlib import Path

import numpy as np
import pandas as pd

import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import roc_auc_score

warnings.filterwarnings("ignore")

<font color = "DeepSkyBlue">**Config**

This section defines the paths to the training and test files within the Kaggle input directory. It sets the names of the ID column and the target column used in the prediction task. It also specifies the number of folds for cross-validation and establishes a fixed random seed to ensure reproducible results.

In [None]:
# Config
DATA_DIR = Path("/kaggle/input/playground-series-s5e11")
TRAIN_PATH = DATA_DIR / "train.csv"
TEST_PATH  = DATA_DIR / "test.csv"

ID_COL = "id"
TARGET = "loan_paid_back"

N_FOLDS = 7
SEED = 42

<font color = "DeepSkyBlue">**Boosting budget & early stopping used by xgb.cv to find a good n_estimators**

This section sets the upper limit for how many boosting rounds XGBoost is allowed to explore during cross-validation. It also defines the early stopping threshold, which halts training if performance does not improve for a specified number of rounds, helping identify an efficient and well-performing number of estimators.

In [None]:
# Boosting budget & early stopping used by xgb.cv to find a good n_estimators
NUM_BOOST_ROUND = 20000
EARLY_STOP = 50

<font color = "DeepSkyBlue">**Small internal ensemble: same model with slight param jitters and different seeds**

This section specifies a set of random seeds and small parameter variations used to train several versions of the same XGBoost model under slightly different conditions. The seeds provide controlled randomness, while the jittered parameters introduce subtle changes in tree structure and regularization strength. These variations allow the model to be trained from multiple stable perspectives, improving robustness without relying on external blending or combining unrelated models.

In [None]:
# Small internal ensemble: same model with slight param jitters and different seeds
ENSEMBLE_SEEDS = [42, 7, 19, 77, 123]
JITTERS = [
    dict(max_leaves=4,  min_child_weight=89, reg_alpha=1.4, reg_lambda=5.9),
    dict(max_leaves=4,  min_child_weight=82, reg_alpha=1.1, reg_lambda=6.3),
    dict(max_leaves=5,  min_child_weight=95, reg_alpha=1.6, reg_lambda=5.6),
    dict(max_leaves=5,  min_child_weight=88, reg_alpha=1.3, reg_lambda=6.1),
    dict(max_leaves=4,  min_child_weight=92, reg_alpha=1.2, reg_lambda=6.0),
]

<font color = "DeepSkyBlue">**Base XGBoost params (hist + categorical); regularization/leaves are provided via jitters**

This section defines the baseline XGBoost configuration used throughout the workflow. It enables the efficient histogram tree method and lets the library take advantage of the GPU when possible. The model is set for binary classification, uses AUC as its evaluation metric, and outputs calibrated logistic probabilities. All row and feature sampling parameters are set to use the full dataset, eliminating randomness from subsampling and keeping the training behavior consistent. Structural and regularization parameters are intentionally omitted here, because they are later supplied through small, controlled parameter variations designed to improve stability.

In [None]:
# Base XGBoost params (hist + categorical); regularization/leaves are provided via jitters
BASE_PARAMS = {
    'tree_method': 'hist',
    'device': 'cuda',            # falls back if no GPU; keep 'auto' predictor to let XGB decide
    'predictor': 'auto',
    'eval_metric': 'auc',
    'objective': 'binary:logistic',
    'subsample': 1.0,
    'colsample_bytree': 1.0,
    'colsample_bylevel': 1.0,
    'colsample_bynode': 1.0,
    'gamma': 0.0,
    'scale_pos_weight': 1.0,
}

<font color = "DeepSkyBlue">**Reproducible randomness for Python, NumPy, and hashing**

This section ensures that every run of the notebook behaves the same way. It locks Python’s hashing, aligns the standard random generator, and sets NumPy’s seed to a fixed value.

By synchronizing these sources of randomness, the pipeline produces identical data splits, parameter choices, and model outputs across repeated executions.

In [None]:
# Reproducible randomness for Python, NumPy, and hashing
os.environ["PYTHONHASHSEED"] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)

<font color = "DeepSkyBlue">**Helpers**

This helper module provides the building blocks for a clean, leakage-safe training flow. It starts with read_data, which simply loads the train and test files. The target_encoding routine then creates strictly out-of-fold target means using stratified folds so that each training row receives a target mean computed without seeing itself, while the test set gets a single global mapping, preventing leakage by design. Complementing that, create_frequency_and_bins adds lightweight signal: frequency encodings for every column and quantile bins at multiple resolutions for numeric features, with safeguards for constant or tricky distributions.

To make XGBoost’s native categorical handling work correctly, enable_categoricals casts chosen columns to pandas’ categorical dtype. For model length selection, do_cv_nround runs xgb.cv with early stopping and returns the boosting round at which the cross-validated AUC peaks, giving a data-driven estimate of n_estimators. Finally, oof_auc_for_n performs a focused, stratified out-of-fold evaluation for a specific number of estimators and parameter set, enabling a small, reliable micro-sweep around the cross-validated peak to lock in a stable training budget.

In [None]:
# Helpers
def read_data():
    """Load train/test CSVs from Kaggle input path."""
    train = pd.read_csv(TRAIN_PATH)
    test  = pd.read_csv(TEST_PATH)
    return train, test

def target_encoding(train, test, cols, target_col, n_splits=10, seed=42):
    """
    Out-of-fold target mean encoding for leakage-safe training.
    - Uses StratifiedKFold for stable class balance per fold.
    - Applies global mapping to test.
    """
    kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
    te_train = pd.DataFrame(index=train.index)
    te_test  = pd.DataFrame(index=test.index)

    y = train[target_col].values
    for col in cols:
        oof_vals = np.zeros(len(train), dtype=float)
        for tr_idx, va_idx in kf.split(train, y):
            tr_fold = train.iloc[tr_idx]
            va_fold = train.iloc[va_idx]
            mean_map = tr_fold.groupby(col)[target_col].mean()
            oof_vals[va_idx] = va_fold[col].map(mean_map).astype(float)
        te_train[f"mean_{col}"] = oof_vals

        # Global mapping for test rows (no leakage)
        global_map = train.groupby(col)[target_col].mean()
        te_test[f"mean_{col}"] = test[col].map(global_map).astype(float)

    return te_train, te_test

def create_frequency_and_bins(train, test, cols, num_cols):
    """
    Lightweight encodings:
    - Frequency encoding for all columns.
    - Quantile bins (5/10/15) for numeric columns to add coarse order information.
    """
    tr_new = pd.DataFrame(index=train.index)
    te_new = pd.DataFrame(index=test.index)

    for col in cols:
        # Frequency (fallback to mean frequency for unseen test values)
        freq = train[col].value_counts()
        tr_new[f"{col}_freq"] = train[col].map(freq).astype(float)
        te_new[f"{col}_freq"] = test[col].map(freq).astype(float).fillna(freq.mean())

        # Quantile bins only for numeric columns; protect against constant columns
        if col in num_cols:
            for q in [5, 10, 15]:
                try:
                    tr_bins, bins = pd.qcut(train[col], q=q, labels=False, retbins=True, duplicates="drop")
                    tr_new[f"{col}_bin{q}"] = tr_bins.astype(float)
                    te_new[f"{col}_bin{q}"] = pd.cut(test[col], bins=bins, labels=False, include_lowest=True).astype(float)
                except Exception:
                    tr_new[f"{col}_bin{q}"] = 0.0
                    te_new[f"{col}_bin{q}"] = 0.0
    return tr_new, te_new

def enable_categoricals(df, cat_cols):
    """Cast listed columns to pandas 'category' so XGBoost can use enable_categorical=True."""
    for c in cat_cols:
        if df[c].dtype.name != "category":
            df[c] = df[c].astype("category")
    return df

def do_cv_nround(train_df, features, target, base_params):
    """
    Estimate a good number of boosting rounds via xgb.cv with early stopping.
    Returns (best_round, best_auc) based on test-auc-mean peak.
    """
    dtrain = xgb.DMatrix(train_df[features], label=train_df[target], enable_categorical=True)
    cv = xgb.cv(
        params=base_params,
        dtrain=dtrain,
        nfold=N_FOLDS,
        num_boost_round=NUM_BOOST_ROUND,
        metrics='auc',
        verbose_eval=False,
        early_stopping_rounds=EARLY_STOP,
        seed=SEED,
        shuffle=True,
        stratified=True,
    )
    best_round = int(cv['test-auc-mean'].idxmax())
    best_auc   = float(cv['test-auc-mean'][best_round])
    print(f"[CV] Best round: {best_round} | Best CV AUC: {best_auc:.7f}")
    return best_round, best_auc

def oof_auc_for_n(X, y, n_estimators, params):
    """
    Compute OOF AUC for a given n_estimators and params using StratifiedKFold.
    This is a small micro-sweep to refine around the cv-chosen round.
    """
    skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
    oof = np.zeros(len(X), dtype="float32")
    for tr_idx, va_idx in skf.split(X, y):
        X_tr_f, y_tr_f = X.iloc[tr_idx], y[tr_idx]
        X_va_f, y_va_f = X.iloc[va_idx], y[va_idx]
        m = XGBClassifier(**params, n_estimators=n_estimators, enable_categorical=True)
        m.fit(X_tr_f, y_tr_f)
        oof[va_idx] = m.predict_proba(X_va_f)[:, 1].astype("float32")
        del m
        gc.collect()
    return roc_auc_score(y, oof)

<font color = "DeepSkyBlue">**Main**

The main function begins by loading the training and test data, displaying their shapes, and extracting a useful numeric feature from the grade_subgrade field. It then prepares the full feature set by identifying categorical and numeric columns, generating leakage-safe target encodings, and adding frequency and quantile-bin based transformations. After combining these engineered features with the original ones, it removes a small group of columns that previous experimentation showed to be less beneficial, ensures proper categorical typing for XGBoost, and synchronizes the train and test matrices so that both share the same structure.

Once the data is ready, the function determines an effective number of boosting rounds by running cross-validation with a representative parameter configuration. It refines this estimate through a short sweep around the cross-validated peak, selecting the value that yields the strongest out-of-fold performance. Using this finalized boosting length, the method trains several stability-oriented model runs that differ only in seed and slight parameter variations, collecting both their test-time predictions and their out-of-fold predictions on the training data.

From these training-side predictions, the function creates two complementary signals: the average predicted probability and the average rank of the predictions. It then evaluates a small set of mixing weights to identify the most reliable combination of these signals in terms of AUC. Finally, it applies the chosen weight to the test-side predictions, clips them into a valid probability range, saves them as submission.csv, and prints key reference metrics drawn from cross-validation, the sweep, and the final combined output.

In [None]:
# Main
def main():
    train, test = read_data()
    print(f"train: {train.shape} | test: {test.shape}")

    # Minimal domain feature: extract numeric subgrade from 'grade_subgrade' (e.g., 'A7' -> 7)
    train['subgrade'] = train['grade_subgrade'].str[1:].astype(int)
    test['subgrade']  = test['grade_subgrade'].str[1:].astype(int)

    # Build feature list excluding ID and target
    base_cols = train.drop(columns=[TARGET, ID_COL]).columns.tolist()

    # Split into categorical vs numeric for later encodings
    cat_cols = [c for c in base_cols if train[c].dtype in ["object", "category"]]
    num_cols = [c for c in base_cols if c not in cat_cols]

    # Leakage-safe target encoding on all base columns (categorical & numeric)
    te_tr, te_te = target_encoding(train, test, base_cols, TARGET, n_splits=10, seed=SEED)

    # Frequency + quantile-bin encodings
    fq_tr, fq_te = create_frequency_and_bins(train, test, base_cols, num_cols)

    # Concatenate original features with encodings
    X_tr = pd.concat([train[base_cols], te_tr, fq_tr], axis=1)
    X_te = pd.concat([test[base_cols],  te_te, fq_te], axis=1)

    # Optional feature drops kept from prior experimentation
    drops = [
        "education_level","loan_purpose","grade_subgrade","interest_rate","marital_status",
        "employment_status_freq", "credit_score_bin5", "loan_amount_bin5", "debt_to_income_ratio_bin5"
    ]
    drops = [d for d in drops if d in X_tr.columns]
    X_tr = X_tr.drop(columns=drops, errors="ignore")
    X_te = X_te.drop(columns=drops, errors="ignore")

    # Ensure XGBoost categorical support by casting objects/categories
    cat_all = [c for c in X_tr.columns if X_tr[c].dtype in ["object","category"]]
    X_tr = enable_categoricals(X_tr, cat_all)
    X_te = enable_categoricals(X_te, cat_all)

    # Align columns between train and test for safety
    common_cols = [c for c in X_tr.columns if c in X_te.columns]
    X_tr = X_tr[common_cols]
    X_te = X_te[common_cols]

    print(f"Final feature count: {X_tr.shape[1]}")

    # Find a good n_estimators via xgb.cv using a representative jitter probe
    base_for_cv = BASE_PARAMS.copy()
    probe = dict(max_leaves=4, min_child_weight=89, reg_alpha=1.4, reg_lambda=5.9)
    base_for_cv.update(probe)
    best_round, best_auc = do_cv_nround(pd.concat([X_tr, train[[TARGET]]], axis=1), common_cols, TARGET, base_for_cv)

    # Micro-sweep around the cv peak to lock n_estimators (best, +10, +20)
    y = train[TARGET].values
    strong = BASE_PARAMS.copy()
    strong.update(probe)
    strong["random_state"] = SEED

    candidates = [best_round, best_round + 10, best_round + 20]
    best_n, best_n_auc = None, -1.0
    print(f"[n-sweep] candidates: {candidates}")
    for n in candidates:
        auc_n = oof_auc_for_n(X_tr[common_cols], y, n_estimators=n, params=strong)
        print(f"  n_estimators={n} -> OOF AUC={auc_n:.6f}")
        if auc_n > best_n_auc:
            best_n_auc = auc_n
            best_n = n

    n_estimators = int(best_n)
    print(f"[n-sweep] chosen n_estimators={n_estimators} | OOF AUC={best_n_auc:.6f}")

    # Train the internal ensemble (same n_estimators, jittered params, different seeds)
    preds = []
    for seed, jitter in zip(ENSEMBLE_SEEDS, JITTERS):
        params = BASE_PARAMS.copy()
        params.update(jitter)
        params["random_state"] = seed

        model = XGBClassifier(
            **params,
            n_estimators=n_estimators,
            enable_categorical=True
        )
        model.fit(X_tr, train[TARGET])
        pred = model.predict_proba(X_te)[:, 1].astype("float32")
        preds.append(pred)
        del model
        gc.collect()

    # Build OOF stack for the same ensemble to tune a simple convex blend (prob vs rank)
    print("[OOF] Building internal OOF for blend beta …")
    skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
    oof_stack = np.zeros((len(train), len(preds)), dtype="float32")

    for fold, (tr_idx, va_idx) in enumerate(skf.split(X_tr, y), 1):
        X_tr_f, y_tr_f = X_tr.iloc[tr_idx], y[tr_idx]
        X_va_f, y_va_f = X_tr.iloc[va_idx], y[va_idx]
        for m_idx, jitter in enumerate(JITTERS):
            params = BASE_PARAMS.copy()
            params.update(jitter)
            params["random_state"] = ENSEMBLE_SEEDS[m_idx]

            m = XGBClassifier(
                **params,
                n_estimators=n_estimators,
                enable_categorical=True
            )
            m.fit(X_tr_f, y_tr_f)
            oof_stack[va_idx, m_idx] = m.predict_proba(X_va_f)[:, 1].astype("float32")
            del m
        auc_fold = roc_auc_score(y_va_f, oof_stack[va_idx].mean(axis=1))
        print(f"  Fold {fold}: mean-prob AUC = {auc_fold:.6f}")
        gc.collect()

    # Tune beta for convex combination of mean probabilities and mean ranks (robustness)
    prob_oof = oof_stack.mean(axis=1)
    ranks = np.column_stack([pd.Series(oof_stack[:, i]).rank(method="average").values for i in range(oof_stack.shape[1])])
    rank_oof = (ranks.mean(axis=1) - ranks.min()) / (ranks.max() - ranks.min() + 1e-12)

    beta_grid = [0.20, 0.25, 0.30, 0.35, 0.40]
    best_beta, best_beta_auc = 0.25, -1.0
    for b in beta_grid:
        mix = (1-b)*prob_oof + b*rank_oof
        auc = roc_auc_score(y, mix)
        if auc > best_beta_auc:
            best_beta_auc = auc
            best_beta = b
    print(f"[Blend] Best beta={best_beta} | OOF AUC={best_beta_auc:.6f}")

    # Apply tuned beta to test predictions and write submission
    prob_test = np.mean(preds, axis=0)
    ranks_te = np.column_stack([pd.Series(preds[i]).rank(method="average").values for i in range(len(preds))])
    rank_test = (ranks_te.mean(axis=1) - ranks_te.min()) / (ranks_te.max() - ranks_te.min() + 1e-12)
    final_pred = (1-best_beta)*prob_test + best_beta*rank_test
    final_pred = np.clip(final_pred, 0.0, 1.0).astype("float32")

    sub = pd.DataFrame({ID_COL: test[ID_COL], TARGET: final_pred})
    sub.to_csv("submission.csv", index=False)
    print("\n[DONE] Wrote submission.csv")
    print(f"CV ref AUC (probe): {best_auc:.6f} | OOF AUC (n-sweep best): {best_n_auc:.6f} | Blend OOF AUC: {best_beta_auc:.6f}")

if __name__ == "__main__":
    main()
