# Tuning an XGBoost Model

The goal of this notebook is to train and evaluate an XGBoost model, comparing it's performance on a holdout set against other types of models (LR, SVC, LDA). 

To ensure reproducibility and consistent evaluation across models, all datasets were **pre-split into cross-val data and holdout data** as below:

| Split type           | CV training file     | Holdout file              | Description                              |
| -------------------- | -------------------- | ------------------------- | ---------------------------------------- |
| **Random**           | `apps_cv_random.csv` | `apps_holdout_random.csv` | Simple random sampling                   |
| **Stratified**       | `apps_cv_strat.csv`  | `apps_holdout_strat.csv`  | Stratified by `TARGET`                   |
| **Multi-Stratified** | `apps_cv_multi.csv`  | `apps_holdout_multi.csv`  | Stratified by `TARGET` + `CODE_GENDER_M` |

Each dataset for cross-validation (`apps_cv_*.csv`) also contains a column, `fold`, with pre-assigned folds from 1-5 using the corresponding splitting method to ensure consistent evaluation. Therefore, no additional splitting is needed inside this notebook -- can simply loop through assigned folds for cross-validation.


In [149]:
import pandas as pd 
import numpy as np 
from xgboost import XGBClassifier
from itertools import product
import time

## Evaluation Functions

#### Metric calculators:

Copied from `cross_val.ipynb`

In [150]:
# METRICS 

def classification_metrics(y_true, y_pred):
    """
    Computes confusion matrix + accuracy, precision, recall, F1, and balanced accuracy.
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    # Confusion matrix components
    tp = np.sum((y_true == 1) & (y_pred == 1))
    tn = np.sum((y_true == 0) & (y_pred == 0))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))

    # Metrics
    acc  = (tp + tn) / max((tp + tn + fp + fn), 1)
    prec = tp / max((tp + fp), 1)
    rec  = tp / max((tp + fn), 1)
    f1   = (2 * prec * rec / max((prec + rec), 1e-12)) if (prec + rec) > 0 else 0.0

    # Specificity (True Negative Rate)
    spec = tn / max((tn + fp), 1)

    # Balanced accuracy
    bal_acc = 0.5 * (rec + spec)

    metrics = {
        "n": len(y_true),
        "tp": tp, "tn": tn, "fp": fp, "fn": fn,
        "acc": acc, "bal_acc": bal_acc, "prec": prec, "rec": rec, "spec": spec,
        "f1": f1
    }
    return metrics

def roc_auc_from_probs(y_true, y_prob):
    
    desc_sort_indices = np.argsort(-y_prob)
    y_true = np.array(y_true)[desc_sort_indices]
    y_prob = np.array(y_prob)[desc_sort_indices]
    pos = np.sum(y_true == 1)
    neg = np.sum(y_true == 0)

    # running totals for TPR/FPR
    tpr = [0.0]
    fpr = [0.0]
    tp = fp = 0
    for i in range(len(y_true)):
        if y_true[i] == 1:
            tp += 1
        else:
            fp += 1
        tpr.append(tp / pos)
        fpr.append(fp / neg)

    # get auc
    auc = np.trapz(tpr, fpr)
    return auc

#### Cross-validation function

Note that in `cross_val.ipynb`, we assigned folds already for each type of splitting to maintain consistent comparison across modeling. So, there will be no explicit splitting in this file, we will just use the folds already created according to the type of split we want to use.

Also, while we will be tuning many xgboost parameters, some we will be set consistently for reproducability. These include:

- eval_metric='auc' : tells xgboost to evaluate performance based on ROC-AUC
- random_state=42 : due to random subsampling in tree building, we fix the random seed to get the same results every run
- n_jobs=-1 : uses all available CPU cores in parallel to speed up training
- tree_method='hist' : a histogram-based algorithm provided by xgboost that can be faster than the default exact method with similar performance
- scale_pos_weight = neg/pos : corrects for class imbalance by upweighting minority class so the model focuses more on them in training. we may change this ratio throughout testing, but it won't be a part of grid searches

In [151]:
def cv_xgb(data, feature_cols, target_col, params=None):
    
    if params == None:
        params = {}

    fold_metrics = []
    fold_preds = []
    for f in data.fold.unique():

        # split into train and test based on folds
        train = data[data.fold != f]
        test = data[data.fold == f]
        X_train, y_train = train[feature_cols], train[target_col]
        X_test, y_test = test[feature_cols], test[target_col]

        # calculate counts for class weighting
        pos = (y_train == 1).sum()
        neg = (y_train == 0).sum()
        balanced_weight = (neg / max(pos, 1)) * 0.5

        # fit model with specified params
        model = XGBClassifier(eval_metric='auc', 
                              random_state=42, 
                              n_jobs=-1, 
                              tree_method='hist',
                              scale_pos_weight=balanced_weight,
                              **params
                            )
        model.fit(X_train, y_train)
        
        # get predictions (probablities and decisions)
        y_prob = model.predict_proba(X_test)[:, 1]
        y_train_prob = model.predict_proba(X_train)[:,1] # make training predictions as well to assess overfit
        y_pred = (y_prob >= 0.5).astype(int)

        # calculate classification metrics from previously defined functions
        metrics = classification_metrics(y_test, y_pred)
        metrics['roc_auc'] = roc_auc_from_probs(y_test, y_prob)
        metrics['train_roc_auc'] = roc_auc_from_probs(y_train, y_train_prob)
        metrics['fold'] = int(f)

        # add to list of all fold metrics
        fold_metrics.append(metrics)

        # store raw fold predictions for later threshold tuning
        fold_preds.append(pd.DataFrame({
            "fold": f,
            "y_true": y_test.values,
            "y_prob": y_prob
        }))

    # combine fold predictions into one long DataFrame
    preds_df = pd.concat(fold_preds, ignore_index=True)

    # summary DataFrame for quick metrics view
    metrics_df = pd.DataFrame(fold_metrics).sort_values("fold").reset_index(drop=True)

    # return results in dataframe
    return metrics_df, preds_df

#### A grid-search function:

Tests every combination of hyperparameters -- very slow/inefficient, so consider size of grid.

In [152]:
def grid_search_xgb(data, feature_cols, target_col, param_grid):
    
    # get all possible combinations of parameters
    keys = list(param_grid.keys())
    combos = [dict(zip(keys, v)) for v in product(*param_grid.values())]

    # initialize stuff for tracking and results
    results = []
    total = len(combos)
    start = time.time()
    next_checkpoint = 5 
    best_roc_auc = 0
    best_params = None

    # evaluate every possible combo
    for i, params in enumerate(combos, 1):

        # run cross validation and store results
        metrics, preds = cv_xgb(data, feature_cols, target_col, params)
        mean_roc_auc = metrics["roc_auc"].mean()

        results.append({
            'params': params,
            'mean_roc_auc': mean_roc_auc,
            'mean_f1': metrics['f1'].mean(),
            'mean_acc': metrics['acc'].mean(),
            'mean_bal_acc': metrics['bal_acc'].mean(),
            'mean_prec': metrics['prec'].mean(),
            'mean_rec': metrics['rec'].mean(),
        })

        # tracker for updates
        if  mean_roc_auc > best_roc_auc:
            best_roc_auc =  mean_roc_auc
            best_params = params

        # print progress checkpoints
        pct_done = (i/total)*100
        elapsed = time.time() - start
        if pct_done >= next_checkpoint or i == total:
            print(f"{i}/{total} ({pct_done:5.1f}% in {elapsed/60:.1f} mins) | Best ROC-AUC: {best_roc_auc:.4f} | Best Params: {best_params}")
            next_checkpoint += 5


    
    return pd.DataFrame(results).sort_values("mean_roc_auc", ascending=False).reset_index(drop=True)

# Model Development

**Notes:** 
- All evaluation will focus on stratified cross-validation, but we will test the other methods as well. 
- Recall that folds have been pre-assigned to ensure consistency across different model development processes
- For our other models, we have decided to scale + PCA, but this is not necessary for nonlinear tree-based algorithms like XGBoost
    - these can only really hurt XGBoost, so we will not use it here

**Process:**
1. Setting a baseline
    - evaluating an xgb model with all default parameters to build off of
2. Hyperparameter tuning
    - evaluate many different combinations of parameters
    - choose the best set based on average ROC-AUC across all folds
3. Holdout evaluation
    - evaluate on the corresponding holdout table. the performance here is what we will compare with other models (LR, SVC, LDA)
4. Threshold tuning
    - tweak the threshold on the best model to maximize another chosen metric (recall, precision, f1, balanced accuracy, etc.) 
        - note that roc-auc is not affected by threshold, hence the need a different optimizing metric
    - what metric we choose to optimize with threshold depends on business needs
        - consider the cost of mislabeling someone as high risk? or trusting an applicant that you shouldn't? will there be human review?
        - something we can include in the right up as optionality moving forward, not something we have to decide now on our own
        - "our model is very solid at ranking applicants from low-risk to high-risk, but in terms of actual classification, we can move the threshold based on what matters most to the business"
5. Fairness evaluation

## 0. Setup

Recall that we have different datasets for each type of cross validation. We are currently focusing on the stratified splitting method, but random and multiple-stratification methods are avaiable for testing/comparison.

In [157]:
apps_cv_strat = pd.read_csv("data/apps_cv_strat.csv")
apps_holdout_strat = pd.read_csv("data/apps_holdout_strat.csv")
target_col = 'TARGET'
feature_cols = [col for col in apps_cv_strat.columns if col not in 
                [target_col, 'SK_ID_CURR', 'fold', 'neighbors_target_mean_500', 'AGE_INT', 'CODE_GENDER_M',
                 'CODE_GENDER_XNA', 'DAYS_BIRTH',
                 'NAME_FAMILY_STATUS_Previously Married', 'NAME_FAMILY_STATUS_Single']]

Even though we aren't doing PCA and xgboost handles correlated/unecessary features well, we can still simply our model a bit by removing some of them. It may help us run faster and assess feature importance at the end. This may or may not be used depending on results, but I'll make the set now to keep as an option.

In [6]:
corr = apps_cv_strat[feature_cols].corr().abs()
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

print(f"Dropping {len(to_drop)} highly correlated features")

feature_cols_pruned = [f for f in feature_cols if f not in to_drop]

Dropping 35 highly correlated features


## 1. Setting a Baseline

Fitting an XGBoost model with default parameters to understand baseline predictive power and what we can build on.

In [158]:
baseline_metrics, baseline_preds = cv_xgb(apps_cv_strat, feature_cols, target_col, params=None) # send no parameters
baseline_metrics

Unnamed: 0,n,tp,tn,fp,fn,acc,bal_acc,prec,rec,spec,f1,roc_auc,train_roc_auc,fold
0,49156,1603,40700,4487,2366,0.860587,0.652291,0.263218,0.40388,0.900702,0.31872,0.767795,0.906028,1
1,49156,1621,40578,4609,2348,0.858471,0.653208,0.260193,0.408415,0.898002,0.317874,0.766144,0.903025,2
2,49156,1523,40422,4765,2446,0.853304,0.639137,0.242207,0.383724,0.894549,0.296968,0.749658,0.903107,3
3,49155,1563,40575,4611,2406,0.857247,0.645879,0.253158,0.393802,0.897955,0.308193,0.762121,0.905266,4
4,49154,1597,40603,4583,2371,0.858526,0.650522,0.258414,0.40247,0.898575,0.314742,0.759541,0.904998,5


Precision, recall, and F1 are all quite low, while ROC-AUC is relatively strong. This indicates that the model is doing a good job ranking applicants from low-risk to high-risk, as in assigning higher probabilities to true positives, but the actual 0/1 decisions at the default threshold of 0.5 are poor (likely due to class imbalance). Also, the training ROC-AUC is even higher than the baseline ROC-AUC indicating some overfit risk. 

Therefore, we expect significant improvement after threshold tuning at the end. This is also why we focus on ROC-AUC during hyperparameter tuning -- as long as the model’s ability to rank cases is good (high ROC-AUC), we can later adjust the threshold to manipulate precision/recall/F1.

## 2. Hyperparameter Tuning

Now, finding hyperparameters that maximize ROC-AUC (while keeping an eye on other metrics). 

XGBoost has a lot of parameters, many of which make a big impact on predictions, so the method of selecting the best ones is more complicated than the other linear models. A simple grid search over a huge parameter grid will take forever, considering just one cross-validated iteration of the baseline model took 15+ seconds. 

Therefore, a smart approach may be to do a multi-step grid search. There are two ways we could do this:

1. Tree structure --> learning dynamics
    - first a grid search on parameters that affect the shape of the trees essentially (max_depth, subsample, etc.)
    - then a grid search on paramaters that affect how the model learns (learning_rate, estimators, etc.)
2. Wide range, big step size --> small range, small step size
    - a grid search on *all* parameters with a very wide range for each parameter
    - then a more granular search on the best parameters to find exact optimal values

We will start with the first structural approach, but then may incorporate the second approach at some points as well. 

### First Pass:

Optimizing tree structure.

In [32]:
param_grid = {

    # fix learning dynamics this pass
    "learning_rate": [0.05], # low learning rate at first to establish reliable tree structure
    "n_estimators": [500], 

    # tree structure parameters to test
    "max_depth": [3, 4, 5, 6, 7],
    "min_child_weight": [1, 3, 5, 8, 12],
    "subsample": [0.7, 0.85, 1.0],
    "colsample_bytree": [0.7, 0.85, 1.0]
}

search1_results = grid_search_xgb(apps_cv_strat, feature_cols, target_col, param_grid)

12/225 (  5.3% in 7.5 mins) | Best ROC-AUC: 0.7790 | Best Params: {'learning_rate': 0.05, 'n_estimators': 500, 'max_depth': 3, 'min_child_weight': 3, 'subsample': 0.7, 'colsample_bytree': 0.85}
23/225 ( 10.2% in 14.7 mins) | Best ROC-AUC: 0.7790 | Best Params: {'learning_rate': 0.05, 'n_estimators': 500, 'max_depth': 3, 'min_child_weight': 3, 'subsample': 0.7, 'colsample_bytree': 0.85}
34/225 ( 15.1% in 21.5 mins) | Best ROC-AUC: 0.7791 | Best Params: {'learning_rate': 0.05, 'n_estimators': 500, 'max_depth': 3, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 0.85}
45/225 ( 20.0% in 27.5 mins) | Best ROC-AUC: 0.7791 | Best Params: {'learning_rate': 0.05, 'n_estimators': 500, 'max_depth': 3, 'min_child_weight': 12, 'subsample': 0.7, 'colsample_bytree': 0.85}
57/225 ( 25.3% in 34.6 mins) | Best ROC-AUC: 0.7810 | Best Params: {'learning_rate': 0.05, 'n_estimators': 500, 'max_depth': 4, 'min_child_weight': 3, 'subsample': 0.7, 'colsample_bytree': 0.7}
68/225 ( 30.2% in 41.3 min

In [34]:
# save so we dont have to run it all again
search1_results.to_csv('results/xgb_search1_results.csv')

In [47]:
search1_results.sort_values(by='mean_roc_auc', ascending=False)[
    ['params', 'mean_roc_auc', 'mean_prec', 'mean_rec', 'mean_f1']].head(10)

Unnamed: 0,params,mean_roc_auc,mean_prec,mean_rec,mean_f1
0,"{'learning_rate': 0.05, 'n_estimators': 500, '...",0.781358,0.184567,0.68585,0.290855
1,"{'learning_rate': 0.05, 'n_estimators': 500, '...",0.781235,0.192193,0.661913,0.29788
2,"{'learning_rate': 0.05, 'n_estimators': 500, '...",0.781106,0.184413,0.684237,0.290517
3,"{'learning_rate': 0.05, 'n_estimators': 500, '...",0.781089,0.184827,0.684439,0.291051
4,"{'learning_rate': 0.05, 'n_estimators': 500, '...",0.781088,0.19236,0.66151,0.29804
5,"{'learning_rate': 0.05, 'n_estimators': 500, '...",0.781088,0.184615,0.684943,0.290832
6,"{'learning_rate': 0.05, 'n_estimators': 500, '...",0.78108,0.1843,0.686051,0.290543
7,"{'learning_rate': 0.05, 'n_estimators': 500, '...",0.78105,0.191841,0.664533,0.297723
8,"{'learning_rate': 0.05, 'n_estimators': 500, '...",0.781029,0.184236,0.683632,0.290245
9,"{'learning_rate': 0.05, 'n_estimators': 500, '...",0.781008,0.184761,0.684539,0.290977


It looks like we found decent lift (+0.02 ROC-AUC) from tuning the tree structure. We also virtually identical scores at the top, not just one outlier, which is a good sign we found an optimal structure -- if these top scores had similar parameters, that would be further proof:

In [58]:
params_df = search1_results['params'].apply(pd.Series)
results = pd.concat([search1_results.drop(columns='params'), params_df], axis=1)
results.sort_values(by='mean_roc_auc', ascending=False)[
    ['mean_roc_auc', 'max_depth', 'min_child_weight', 'subsample', 'colsample_bytree']].head()

Unnamed: 0,mean_roc_auc,max_depth,min_child_weight,subsample,colsample_bytree
0,0.781358,4.0,8.0,0.7,0.7
1,0.781235,5.0,8.0,0.85,1.0
2,0.781106,4.0,12.0,0.7,1.0
3,0.781089,4.0,12.0,0.85,1.0
4,0.781088,5.0,8.0,0.7,1.0


The top 5 models by ROC-AUC have very small changes in parameters -- The optimal region occurs with moderate `depth` (4–5), high `min_child_weight` (8–12), and subsampling between 0.7–0.85, suggesting strong generalization, reproducability, and minimal overfitting risk. We also don't see any sitting on a single edge, like if all the best performers were at the highest depth, so there is no reason to believe we are missing anything beyond our grid. 

These parameters will be fixed while tuning learning dynamics in the next phase:

- max_depth = 4
- min_child_weight = 8
- subsample = 0.7
- colsample_bytree = 1

These were the parameters of the best model by ROC-AUC, *except* for `colsample_bytree` because four out of the top five models had this parameter set to 1, so I felt it was safest to keep it as that, even though the #1 model was 0.7.

### Second Pass: Learning Dynamics

Now that we have a reliable tree structure, we will focus on the parameters that effect how the model learns. 

I think in this case it makes more sense to do the wide-then-narrow search approach because the learning parameters are continuous and small changes can make a big difference. Therefore, we will first find the appropriate scale of each parameter, and then do a narrow search over a smaller range to find more exact optimal parameters. Also, the difference from the best to 2nd (or even 5th) best is so small it's negligble.

**Finding Scale of Learning Dynamics:**

Bigger grid than before, so will take much longer. A reasonable option is to make 'reasonable' learning_rate + n_estimator pairings -- low learnings rates are very likely to underfit with a low number of estimators (trees) because it can't converge fast enough, so you can consider not running some of pairings you think are 'unreasonable'. 

However, I'm going to run this overnight either way and hopefully won't need to run again, so why not be thorough with the full grid. 

In [59]:
param_grid = {

    # fix tree structure to optimal parameters found in pass 1
    "max_depth": [4],
    "min_child_weight": [8],
    "subsample": [0.7],
    "colsample_bytree": [1],

    # wide grid to find scale of learning dynamics
    "learning_rate": [0.01, 0.03, 0.05, 0.07, 0.10],
    "n_estimators": [300, 500, 800, 1200],
    "gamma": [0.0, 0.1, 0.2, 0.3],
    "reg_alpha": [0.0, 0.5, 1.0],
    "reg_lambda": [1.0, 2.0, 5.0]
}

search2_results = grid_search_xgb(apps_cv_strat, feature_cols, target_col, param_grid)

36/720 (  5.0% in 20.0 mins) | Best ROC-AUC: 0.7565 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.01, 'n_estimators': 300, 'gamma': 0.0, 'reg_alpha': 0.0, 'reg_lambda': 1.0}
72/720 ( 10.0% in 46.7 mins) | Best ROC-AUC: 0.7660 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.01, 'n_estimators': 500, 'gamma': 0.0, 'reg_alpha': 0.5, 'reg_lambda': 1.0}
108/720 ( 15.0% in 82.6 mins) | Best ROC-AUC: 0.7727 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.01, 'n_estimators': 800, 'gamma': 0.3, 'reg_alpha': 0.5, 'reg_lambda': 1.0}
144/720 ( 20.0% in 134.2 mins) | Best ROC-AUC: 0.7772 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.01, 'n_estimators': 1200, 'gamma': 0.0, 'reg_alpha': 0.5, 'reg_lambda': 5.0}
180/720 ( 25.0% in 149.6

In [60]:
# save so we dont have to run it all again
search2_results.to_csv('results/xgb_search2_results.csv')

In [61]:
params_df = search2_results['params'].apply(pd.Series)
results = pd.concat([search2_results.drop(columns='params'), params_df], axis=1)
results.sort_values(by='mean_roc_auc', ascending=False)[
    ['mean_roc_auc', 'learning_rate', 'n_estimators', 'gamma', 'reg_alpha', 'reg_lambda']].head(10)

Unnamed: 0,mean_roc_auc,learning_rate,n_estimators,gamma,reg_alpha,reg_lambda
0,0.783057,0.03,1200.0,0.0,0.5,5.0
1,0.782981,0.03,1200.0,0.3,0.5,5.0
2,0.782968,0.03,1200.0,0.1,0.5,5.0
3,0.782923,0.03,1200.0,0.2,0.5,5.0
4,0.782863,0.05,800.0,0.1,0.0,5.0
5,0.782842,0.03,1200.0,0.2,0.5,1.0
6,0.782842,0.05,800.0,0.0,0.0,5.0
7,0.782786,0.03,1200.0,0.3,0.5,1.0
8,0.782783,0.05,800.0,0.2,0.0,5.0
9,0.782778,0.05,800.0,0.3,0.0,5.0


We see some lift again (+0.002) ROC-AUC, but much smaller than the last pass, which is expected, as we were just hoping to squeeze out points at this stage. Additionally, since all of the top scorers have almost identical performance, we can assume we are close to optimal learning dynamics.

However, I notice that `n_estimators` and `reg_lambda` are consistently on an edge of our parameter range, so before doing our narrow search, I want to expand the scale of these a bit to see if they continue to operate better at higher scale. Also, `gamma` does not seem to have an optimal value at all, so I'll test a wider range for that as well, but suspecting it may not have a real impact on our trees. 

Note that increasing `n_estimators` has the biggest impact on speed, so that is something to consider if speed is required for this model -- but for now, we are just looking for best performance.

In [62]:
param_grid = {

    # fix tree structure to optimal parameters found in pass 1
    "max_depth": [4],
    "min_child_weight": [8],
    "subsample": [0.7],
    "colsample_bytree": [1],

    "learning_rate": [0.03], # fixed, stable
    "n_estimators": [1200, 1500, 1800, 2000], # expanding higher
    "gamma": [0.0, 0.2, 0.4, 0.6], # trying some other values 
    "reg_alpha": [0.5], # fixed, stable
    "reg_lambda": [5.0, 7.5, 10.0, 15.0] # expanding higher
}

search2_expanded_results = grid_search_xgb(apps_cv_strat, feature_cols, target_col, param_grid)

4/64 (  6.2% in 6.6 mins) | Best ROC-AUC: 0.7833 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.03, 'n_estimators': 1200, 'gamma': 0.0, 'reg_alpha': 0.5, 'reg_lambda': 15.0}
7/64 ( 10.9% in 10.6 mins) | Best ROC-AUC: 0.7833 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.03, 'n_estimators': 1200, 'gamma': 0.0, 'reg_alpha': 0.5, 'reg_lambda': 15.0}
10/64 ( 15.6% in 14.9 mins) | Best ROC-AUC: 0.7833 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.03, 'n_estimators': 1200, 'gamma': 0.2, 'reg_alpha': 0.5, 'reg_lambda': 15.0}
13/64 ( 20.3% in 19.0 mins) | Best ROC-AUC: 0.7834 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.03, 'n_estimators': 1200, 'gamma': 0.4, 'reg_alpha': 0.5, 'reg_lambda': 15.0}
16/64 ( 25.0% in 23.0 mins)

In [81]:
# save so we dont have to run it all again
search2_expanded_results.to_csv('results/xgb_search2_expanded_results.csv')

In [64]:
params_df = search2_expanded_results['params'].apply(pd.Series)
results = pd.concat([search2_expanded_results.drop(columns='params'), params_df], axis=1)
results.sort_values(by='mean_roc_auc', ascending=False)[
    ['mean_roc_auc', 'n_estimators', 'gamma', 'reg_lambda']].head(10)

Unnamed: 0,mean_roc_auc,n_estimators,gamma,reg_lambda
0,0.783527,1500.0,0.6,15.0
1,0.783494,1500.0,0.4,15.0
2,0.78341,1500.0,0.0,15.0
3,0.783397,1500.0,0.2,15.0
4,0.783394,1200.0,0.4,15.0
5,0.783341,1200.0,0.6,15.0
6,0.783317,1800.0,0.4,15.0
7,0.783302,1200.0,0.2,15.0
8,0.783298,1800.0,0.6,15.0
9,0.783287,1500.0,0.0,10.0


- `n_estimators` seems optimal at 1500
- `gamma` appears to not affect the model. probably because we have so much regularization elsewhere that it doesn't matter much
- `reg_lambda` is still at the top of our range so lets do one final sweep to see if any more performance can be squeezed out before diminishing returns

A final search for `reg_lambda` scale before narrowing all parameters further:

In [75]:
param_grid = {

    # fix tree structure to optimal parameters found in pass 1
    "max_depth": [4],
    "min_child_weight": [8],
    "subsample": [0.7],
    "colsample_bytree": [1],

    "learning_rate": [0.03], # fixed, stable
    "n_estimators": [1500], # fixed, stable
    "gamma": [1], # fixed, not very useful
    "reg_alpha": [0.5], # fixed, stable
    "reg_lambda": [10, 15, 20, 25, 30] # further expansion
}

reg_lambda_expansion = grid_search_xgb(apps_cv_strat, feature_cols, target_col, param_grid)

1/5 ( 20.0% in 2.2 mins) | Best ROC-AUC: 0.7831 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.03, 'n_estimators': 1500, 'gamma': 1, 'reg_alpha': 0.5, 'reg_lambda': 10}
2/5 ( 40.0% in 4.0 mins) | Best ROC-AUC: 0.7836 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.03, 'n_estimators': 1500, 'gamma': 1, 'reg_alpha': 0.5, 'reg_lambda': 15}
3/5 ( 60.0% in 5.7 mins) | Best ROC-AUC: 0.7836 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.03, 'n_estimators': 1500, 'gamma': 1, 'reg_alpha': 0.5, 'reg_lambda': 15}
4/5 ( 80.0% in 7.4 mins) | Best ROC-AUC: 0.7836 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.03, 'n_estimators': 1500, 'gamma': 1, 'reg_alpha': 0.5, 'reg_lambda': 15}
5/5 (100.0% in 9.2 mins) | Best ROC-AUC: 0.7836 | Be

In [None]:
# save so we dont have to run it all again
reg_lambda_expansion.to_csv('results/xgb_reg_lambda_expansion_results.csv')

In [76]:
params_df = reg_lambda_expansion['params'].apply(pd.Series)
results = pd.concat([reg_lambda_expansion.drop(columns='params'), params_df], axis=1)
results.sort_values(by='mean_roc_auc', ascending=False)[
    ['mean_roc_auc', 'reg_lambda']].head(10)

Unnamed: 0,mean_roc_auc,reg_lambda
0,0.783609,15.0
1,0.783512,25.0
2,0.783464,20.0
3,0.78345,30.0
4,0.783123,10.0


It looks like we do peak at `reg_lambda`=15 before flattening out. 

I think we have found the appropriate scale for each of our learning dynamics hyperparameters:

- `learning_rate` = 0.03
- `n_estimators` = 1500
- `gamma` = 1
- `reg_alpha` = 0.5
- `reg_lambda` = 15

Now, a refined, narrow search around these parameters. Not expecting a huge lift here.

**Finding exact optimal parameters:**

We have already determined that `gamma` does not affect much and that `reg_lambda` pleateaus hard at 15, so these do not need to be further fine-tuned. Changes due to `n_estimtors` tend to be pretty smooth, and we already tested with a step size of 300, so we can't expect too much of a difference, but a smaller step size could be worth trying. `learning_rate` and `reg_alpha` will be the primary parameters to refine.

In [82]:
param_grid = {

    # fix tree structure to optimal parameters found in pass 1
    "max_depth": [4],
    "min_child_weight": [8],
    "subsample": [0.7],
    "colsample_bytree": [1],

    # very narrow search over the parameters we just found
    "learning_rate": [0.02, 0.03, 0.04], 
    "n_estimators": [1350, 1500, 1650], 
    "gamma": [1], # fixed, not very useful
    "reg_alpha": [0.3, 0.4, 0.5, 0.6, 0.7],
    "reg_lambda": [15] 
}

narrow_learning_search = grid_search_xgb(apps_cv_strat, feature_cols, target_col, param_grid)

3/45 (  6.7% in 5.5 mins) | Best ROC-AUC: 0.7828 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.02, 'n_estimators': 1350, 'gamma': 1, 'reg_alpha': 0.5, 'reg_lambda': 15}
5/45 ( 11.1% in 9.2 mins) | Best ROC-AUC: 0.7828 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.02, 'n_estimators': 1350, 'gamma': 1, 'reg_alpha': 0.5, 'reg_lambda': 15}
7/45 ( 15.6% in 12.7 mins) | Best ROC-AUC: 0.7832 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.02, 'n_estimators': 1500, 'gamma': 1, 'reg_alpha': 0.4, 'reg_lambda': 15}
9/45 ( 20.0% in 16.3 mins) | Best ROC-AUC: 0.7832 | Best Params: {'max_depth': 4, 'min_child_weight': 8, 'subsample': 0.7, 'colsample_bytree': 1, 'learning_rate': 0.02, 'n_estimators': 1500, 'gamma': 1, 'reg_alpha': 0.5, 'reg_lambda': 15}
12/45 ( 26.7% in 21.6 mins) | Best ROC-AUC: 0.

In [83]:
# save so we dont have to run it all again
narrow_learning_search.to_csv('results/xgb_narrow_learning_search_results.csv')

In [85]:
params_df = narrow_learning_search['params'].apply(pd.Series)
results = pd.concat([narrow_learning_search.drop(columns='params'), params_df], axis=1)
results.sort_values(by='mean_roc_auc', ascending=False)[
    ['mean_roc_auc', 'learning_rate', 'n_estimators', 'reg_alpha']].head(10)

Unnamed: 0,mean_roc_auc,learning_rate,n_estimators,reg_alpha
0,0.783621,0.03,1350.0,0.5
1,0.783609,0.03,1500.0,0.5
2,0.783561,0.03,1650.0,0.5
3,0.783549,0.03,1350.0,0.3
4,0.783547,0.03,1500.0,0.3
5,0.783544,0.03,1500.0,0.7
6,0.783534,0.03,1350.0,0.7
7,0.783519,0.02,1650.0,0.5
8,0.783482,0.03,1500.0,0.6
9,0.783476,0.03,1350.0,0.6


### Confirmation

Now I just want to confirm that these results are stable, not overfitting, and safe from leakage. 

**Stability across folds:**

In [159]:
tuned_params = {

    # tree structure
    "max_depth": 4,
    "min_child_weight": 8,
    "subsample": 0.7,
    "colsample_bytree": 1,

    # learning dynamics
    "learning_rate": 0.03, 
    "n_estimators": 1350, 
    "gamma": 1,
    "reg_alpha": 0.5,
    "reg_lambda": 15 

    # rest are already set in cv function
}

fold_metrics, preds = cv_xgb(apps_cv_strat, feature_cols, target_col, tuned_params)
fold_metrics

Unnamed: 0,n,tp,tn,fp,fn,acc,bal_acc,prec,rec,spec,f1,roc_auc,train_roc_auc,fold
0,49156,1775,40409,4778,2194,0.858166,0.670739,0.270868,0.447216,0.894262,0.337388,0.784749,0.840518,1
1,49156,1804,40240,4947,2165,0.855318,0.672522,0.26722,0.454523,0.890522,0.336567,0.783701,0.840971,2
2,49156,1695,40200,4987,2274,0.852287,0.658348,0.253667,0.42706,0.889636,0.31828,0.770565,0.842926,3
3,49155,1772,40178,5008,2197,0.853423,0.667815,0.261357,0.44646,0.889169,0.329705,0.782082,0.841722,4
4,49154,1765,40352,4834,2203,0.856838,0.668914,0.267465,0.444808,0.89302,0.334059,0.777351,0.841928,5


In [160]:
print(f"Mean AUC: {fold_metrics.roc_auc.mean():.6f}")
print(f"Std AUC: {fold_metrics.roc_auc.std():.6f}")

Mean AUC: 0.779690
Std AUC: 0.005834


Appears to be extremely stable. Very small differences across folds, meaning that the average score isn't carried by just one fold getting 'lucky'. We can assume consistent performance from our model. 

**Overfit check:**

Want to see a minimal gap between trainings and testing ROC-AUC

In [161]:
print("Mean Train AUC:", fold_metrics.train_roc_auc.mean())
print("Mean Val AUC:", fold_metrics.roc_auc.mean())
print("Gap:", fold_metrics.train_roc_auc.mean() - fold_metrics.roc_auc.mean())

Mean Train AUC: 0.8416130608394795
Mean Val AUC: 0.7796898608913626
Gap: 0.0619231999481169


There is definitely a little bit of overfit, but the stability across folds also mitigates this risk. We can try adding some regularization / making the tree less complex to minimize that gap while keeping performance high.

In [162]:
final_params = {

    # tree structure
    "max_depth": 4,
    "min_child_weight": 8,
    "subsample": 0.5,
    "colsample_bytree": 0.7,

    # learning dynamics
    "learning_rate": 0.03, 
    "n_estimators": 1000, 
    "gamma": 2,
    "reg_alpha": 0.5,
    "reg_lambda": 20 

    # rest are already set in cv function
}
fold_metrics, preds = cv_xgb(apps_cv_strat, feature_cols, target_col, final_params)
print("Mean Train AUC:", fold_metrics.train_roc_auc.mean())
print("Mean Val AUC:", fold_metrics.roc_auc.mean())
print("Gap:", fold_metrics.train_roc_auc.mean() - fold_metrics.roc_auc.mean())
print(f"Std AUC: {fold_metrics.roc_auc.std():.6f}")

Mean Train AUC: 0.8245869891682421
Mean Val AUC: 0.7789686872814786
Gap: 0.045618301886763524
Std AUC: 0.006055


After making a few tweaks to the complexity of the model, we have closed the gap between training and validation performance by 26% without affecting validation performance in a significant way. I believe the gap is now within a comfortable range to assume we are *not* overfitting. Variance between folds even decreased a bit too. 

**Leakage Check:**

Lastly, I want to confirm that we have a *safe* model by confirming as best as possible that we evaluated our model properly, especially by not revealing information from our test set to our training set.

One way to assess this is by shuffling the target variable, removing any structural patterns from the data, which means we should expect ~0.5 AUC (randomly guessing). If we see significantly higher, it would indicate that the model is exploiting some sort of pattern that it could only pick up on through data leaking from testing to training.   

In [163]:
shuffled = apps_cv_strat.copy()
shuffled['TARGET'] = np.random.permutation(shuffled['TARGET'].values)
fold_metrics_shuffled, preds_shuffled = cv_xgb(shuffled, feature_cols, target_col, final_params)
print("Shuffled mean AUC:", fold_metrics_shuffled.roc_auc.mean())

Shuffled mean AUC: 0.49585671451650304


Perfect, no issues.

### Final Hyperparameters

Now that we have confirmed a safe and reliabel model, reducing risk of overfitting and leakage, I think we are comfortable with these hyperparameters (including the pre-set params):

- `max_depth` = 4
- `min_child_weight` = 8
- `subsample` = 0.5
- `colsample_bytree` = 0.7
- `learning_rate` = 0.03
- `n_estimators` = 1000
- `gamma` = 2
- `reg_alpha` = 0.5
- `reg_lambda` = 20
- `scale_pos_weight` = 0.5 * #neg/#pos (found the 0.5 scaler helped a little)
- `eval_metric` = 'auc'
- `tree_method` = 'hist'
- `random_state` = 42
- `n_jobs` = -1

# 3. Holdout Evaluation

We are going to compare this model to other models through performance on the holdout set. To do this, we will train on the set model above on the entire dataset we have been using so far for model selection, and then test it on the entire holdout set that we have not seen yet for unbiased evaluation. 

In [164]:
# separate data
X_train, y_train = apps_cv_strat[feature_cols], apps_cv_strat[target_col]
X_test, y_test = apps_holdout_strat[feature_cols], apps_holdout_strat[target_col]

# initiliaze our finalized parameters
pos = (y_train == 1).sum()
neg = (y_train == 0).sum()
params = {
    "max_depth": 4,
    "min_child_weight": 8,
    "subsample": 0.5,
    "colsample_bytree": 0.7,
    "learning_rate": 0.03, 
    "n_estimators": 1000, 
    "gamma": 2,
    "reg_alpha": 0.5,
    "reg_lambda": 20,
    'eval_metric': 'auc', 
    'random_state': 42, 
    'n_jobs': -1, 
    'tree_method': 'hist',
    'scale_pos_weight': (neg / max(pos, 1)) * 0.5 
}

# fit model on training data
model = XGBClassifier(**params)
model.fit(X_train, y_train)

# predict on holdout data
y_prob = model.predict_proba(X_test)[:, 1]
y_pred  = (y_prob >= 0.5).astype(int) # just the default 0.5 threshold for now
y_train_prob = model.predict_proba(X_train)[:,1] # make training predictions as well to assess overfit

# calculate classification metrics from previously defined functions
metrics = classification_metrics(y_test, y_pred)
metrics['roc_auc'] = roc_auc_from_probs(y_test, y_prob)
metrics['train_roc_auc'] = roc_auc_from_probs(y_train, y_train_prob)
metrics

{'n': 61443,
 'tp': 2200,
 'tn': 50282,
 'fp': 6201,
 'fn': 2760,
 'acc': 0.8541575118402421,
 'bal_acc': 0.666881570989387,
 'prec': 0.26187358647780024,
 'rec': 0.4435483870967742,
 'spec': 0.8902147548819999,
 'f1': 0.32931666791407826,
 'roc_auc': 0.7859435296832104,
 'train_roc_auc': 0.8174978218587409}

Since we still haven't tuned threshold, we are really only looking at ROC-AUC right now. 

The performance from cross-validation carried over to the holdout set (with small improvement even), indicating we have a very generalizable model with no leakage risks. Overfitting is still at low-risk. Overall, a successful run on the holdout set, and I'm pretty confident we squeezed out as much performance as we can from this XGBoost model.

An `ROC-AUC` of **0.784** is what we will compare to our other models.

### Kaggle Predictions

The Kaggle competition has it's own holdout set that we need to predict on for our submission. This doesn't have the true targets so we can't evaluate it ourselves, but we can make the predictions and submit it for an overall score. 

In [165]:
# read in the full training and test data
## AFTER running through our pipeline files
train = pd.read_csv('data/apps_all_background.csv')
test = pd.read_csv('data/apps_all_background_test.csv')

# separate data
X_train, y_train = train[feature_cols], train[target_col]
X_test = test[feature_cols]

# fit model on training data
model = XGBClassifier(**params)
model.fit(X_train, y_train)

# predict on holdout data
test["TARGET"] = model.predict_proba(X_test)[:, 1]

# create submission csv
submission = test[["SK_ID_CURR", "TARGET"]]
submission.to_csv("results/submission_xgb.csv", index=False)

**RESULTS:** 
- `Public`: 0.788
- `Private`: 0.781

Another submission with the parameters *before* adjusting for overfit just to try:

In [166]:
params = {
    "max_depth": 4,
    "min_child_weight": 8,
    "subsample": 0.7,
    "colsample_bytree": 1,
    "learning_rate": 0.03, 
    "n_estimators": 1350, 
    "gamma": 1,
    "reg_alpha": 0.5,
    "reg_lambda": 15,
    'eval_metric': 'auc', 
    'random_state': 42, 
    'n_jobs': -1, 
    'tree_method': 'hist',
    'scale_pos_weight': (neg / max(pos, 1)) * 0.5 
}

# fit model on training data
model = XGBClassifier(**params)
model.fit(X_train, y_train)

# predict on holdout data
test["TARGET"] = model.predict_proba(X_test)[:, 1]

# create submission csv
submission = test[["SK_ID_CURR", "TARGET"]]
submission.to_csv("results/submission2_xgb.csv", index=False)

**RESULTS:** 
- `Public`: 0.790
- `Private`: 0.783

I'm going to test the xgboost model on just the given application data on our holdout set as an experiment so we can measure lift from feature engineering + lift from tuning:

In [167]:
# read in original applications table
apps = pd.read_csv("data/application_train.csv")  

# get id's from current tables we're using to matchup with original tables
train_prev = pd.read_csv("data/apps_cv_strat.csv", usecols=["SK_ID_CURR"])
hold_prev  = pd.read_csv("data/apps_holdout_strat.csv",   usecols=["SK_ID_CURR"])
train_ids = set(train_prev["SK_ID_CURR"])
hold_ids  = set(hold_prev["SK_ID_CURR"])

# split original applications data based on how we already split current data
apps_train = apps[apps["SK_ID_CURR"].isin(train_ids)].copy()
apps_hold  = apps[apps["SK_ID_CURR"].isin(hold_ids)].copy()

# add in folds for cross validation
folds = pd.read_csv("data/apps_cv_strat.csv", usecols=["SK_ID_CURR", "fold"])
apps_train = apps_train.merge(folds, on="SK_ID_CURR", how="left")

# cast objects to category (xgboost can handle it)
for df in (apps_train, apps_hold):
    for c in df.select_dtypes(include="object").columns:
        df[c] = df[c].astype("category")

target_col = "TARGET"
feature_cols_apps = [c for c in apps.columns if c not in ["SK_ID_CURR", target_col, "fold", 'CODE_GENDER_M', 'CODE_GENDER', 'AGE_INT']]
X_apps_train, y_apps_train = apps_train[feature_cols_apps], apps_train[target_col]

print("X_apps_train shape:", X_apps_train.shape)

X_apps_train shape: (245777, 119)


In [168]:
og_data_metrics, og_data_preds = cv_xgb(apps_train, feature_cols_apps, target_col, params={'enable_categorical':True})
og_data_metrics

Unnamed: 0,n,tp,tn,fp,fn,acc,bal_acc,prec,rec,spec,f1,roc_auc,train_roc_auc,fold
0,49156,1421,40704,4483,2548,0.856966,0.629407,0.240684,0.358025,0.90079,0.287856,0.738588,0.897444,1
1,49156,1389,40722,4465,2580,0.856681,0.625575,0.237274,0.349962,0.901188,0.282806,0.736199,0.895527,2
2,49156,1359,40614,4573,2610,0.853873,0.620601,0.229096,0.342404,0.898798,0.274518,0.724889,0.894448,3
3,49155,1431,40571,4615,2538,0.854481,0.629205,0.236685,0.360544,0.897867,0.285771,0.736082,0.89655,4
4,49154,1433,40564,4622,2535,0.854396,0.629425,0.236664,0.361139,0.897712,0.285942,0.728191,0.894462,5


So overall for our 3 main stages of improvement:

| Stage                            | Description                                                                          | Key Changes                                                                   |    Mean ROC-AUC   |              Approx. Lift from Base             |
| :------------------------------- | :----------------------------------------------------------------------------------- | :---------------------------------------------------------------------------- | :---------------: | :-----------------------------------: |
| **1. Base** | Model trained only on the raw `application_train.csv` features.                      | No joins or aggregations. Baseline feature set.                               | **~0.73** |                                      |
| **2. Feature-Engineered**       | Added engineered + aggregated variables from internal / external credit tables.      | Joins, aggregations, missing-value flags, structural imputations.                           | **~0.76** |               **+0.026**              |
| **3️. Tuned XGBoost**            | Final optimized parameters (depth = 4, min_child_weight = 8, subsample = 0.5, etc.). | Full cross-validated tuning on tree structure + learning dynamics. | **~0.786** | **+0.047** |


# 4. Threshold Tuning

With a solid ROC-AUC, we know that our model can rank applicants based on risk well, but the cutoff for decision-making is highly business-dependent. The main question that needs to be considered is the cost of mislabeling in both directions. We can attempt to answer this on our own and come up with a proposed solution, but it can also be something we leave a bit open-ended in our report asking for feedback on what the client would prefer. 

In [169]:
# try 200 thresholds evenly spaced from 0 to 1
## should be fast. no model required with the predictions we already have

thresholds = np.linspace(0, 1, 200)
scores = []
for t in thresholds:
    y_pred = (preds["y_prob"] >= t).astype(int)
    tp = ((y_pred == 1) & (preds["y_true"] == 1)).sum()
    fp = ((y_pred == 1) & (preds["y_true"] == 0)).sum()
    tn = ((y_pred == 0) & (preds["y_true"] == 0)).sum()
    fn = ((y_pred == 0) & (preds["y_true"] == 1)).sum()

    prec = tp / max(tp + fp, 1)
    rec  = tp / max(tp + fn, 1)
    f1   = 2 * prec * rec / max(prec + rec, 1e-12)
    spec = tn / max(tn + fp, 1)
    bal_acc = (rec + spec) / 2

    scores.append((t, prec, rec, f1, bal_acc))

scores_df = pd.DataFrame(scores, columns=["threshold", "precision", "recall", "f1", "bal_acc"])

Sorted by precision:

In [170]:
scores_df[scores_df.precision != 1].sort_values("precision", ascending=False).head(5)

Unnamed: 0,threshold,precision,recall,f1,bal_acc
186,0.934673,0.916667,0.000554,0.001108,0.500275
187,0.939698,0.857143,0.000302,0.000605,0.500149
184,0.924623,0.857143,0.001209,0.002415,0.500596
185,0.929648,0.833333,0.000756,0.00151,0.500371
183,0.919598,0.76087,0.001764,0.003519,0.500858


Sorted by recall:

In [171]:
scores_df[scores_df.recall != 1].sort_values("recall", ascending=False).head(5)

Unnamed: 0,threshold,precision,recall,f1,bal_acc
4,0.020101,0.080914,0.999899,0.149713,0.501171
5,0.025126,0.081159,0.999698,0.150129,0.502805
6,0.030151,0.081536,0.999395,0.150771,0.505308
7,0.035176,0.082028,0.998841,0.151606,0.508534
8,0.040201,0.082666,0.998085,0.152686,0.512651


Sorted by F1:

In [172]:
scores_df.sort_values("f1", ascending=False).head(5)

Unnamed: 0,threshold,precision,recall,f1,bal_acc
102,0.512563,0.268768,0.428492,0.330335,0.66305
101,0.507538,0.265817,0.435295,0.330073,0.664849
105,0.527638,0.277585,0.406823,0.330002,0.656916
100,0.502513,0.263148,0.44225,0.329962,0.666741
103,0.517588,0.271264,0.420429,0.329763,0.660614


Sorted by balanced accuracy:

In [173]:
scores_df.sort_values("bal_acc", ascending=False).head(5)

Unnamed: 0,threshold,precision,recall,f1,bal_acc
64,0.321608,0.177605,0.705201,0.283747,0.709197
63,0.316583,0.175505,0.712004,0.281597,0.709109
65,0.326633,0.179645,0.698246,0.285767,0.709095
62,0.311558,0.173513,0.718656,0.279534,0.708998
61,0.306533,0.171554,0.725509,0.277492,0.708895


Obviously there is a big tradeoff between precision and recall. For the rest of our evaluation, we will continue with continue with balanced accuracy as our optimizing metric, but will give other options as:

| Goal                                  | Choose threshold | Reason                                                                      |
| ------------------------------------- | ---------------- | ------------------------------------------------------------------------------ |
| **Balanced performance**              | ~ 0.30      | Maximizes Balanced Accuracy (~0.71) and keeps F1 moderate.                       |
| **Cautious - low false alarms**       | ≥ 0.45           | Higher precision, lower recall - good if rejecting loans is costly.            |
| **Aggressive - flag all defaulters** | ≤ 0.20           | Boosts recall but also false positives - good for an initial screen followed by human review. |


And anywhere inbetween these ranges to balance different objectives.

# 5. Fairness Evaluation

The groups from our data we are interested in are gender (male vs. female) and age (binned into decades). Note that these features have been removed from our model for *Group Unawarness*. Since unknown proxy attributes may exist, we will compute various metrics as confirmation.

After fixing the decision threshold at 0.3 (chosen to maximize balanced accuracy), we will evaluate whether the model treats these groups equitably using several fairness metrics from class:

- Demographic Parity: compares overall positive prediction rates across groups
- Equal Opportunity: compares true positive rates (recall) across groups
- Equalized Odds: compares both true and false positive rates across groups
- Predictive Parity (PPVP): compares precision across groups

These metrics will be computed on the holdout predictions to assess if any group is disproportionately disadvantaged by the model.

First, fit holdout data again and get demographic groups:

In [174]:
# separate data
X_train, y_train = apps_cv_strat[feature_cols], apps_cv_strat[target_col]
X_test, y_test = apps_holdout_strat[feature_cols], apps_holdout_strat[target_col]

# initiliaze our finalized parameters
pos = (y_train == 1).sum()
neg = (y_train == 0).sum()
params = {
    "max_depth": 4,
    "min_child_weight": 8,
    "subsample": 0.5,
    "colsample_bytree": 0.7,
    "learning_rate": 0.03, 
    "n_estimators": 1000, 
    "gamma": 2,
    "reg_alpha": 0.5,
    "reg_lambda": 20,
    'eval_metric': 'auc', 
    'random_state': 42, 
    'n_jobs': -1, 
    'tree_method': 'hist',
    'scale_pos_weight': (neg / max(pos, 1)) * 0.5 
}

# fit model on training data
model = XGBClassifier(**params)
model.fit(X_train, y_train)

# predict on holdout data
apps_holdout_strat['y_prob'] = model.predict_proba(X_test)[:, 1]
apps_holdout_strat['y_pred']  = (apps_holdout_strat['y_prob'] >= 0.3).astype(int) # optimal 0.3 threshold

In [184]:
# convert gender back to readable label
apps_holdout_strat['GENDER'] = np.where(apps_holdout_strat['CODE_GENDER_M'], 'Male', 'Female')

# convert marital status to readable label
status_cols = [c for c in apps_holdout_strat.columns if c.startswith("NAME_FAMILY_STATUS_")]
def get_status(row):
    for c in status_cols:
        if row[c] == 1:
            return c.replace("NAME_FAMILY_STATUS_", "")
    return "Married"
apps_holdout_strat["MARITAL_STATUS"] = apps_holdout_strat.apply(get_status, axis=1)


# bin ages by decades
bins   = [20,30,40,50,60,70]
labels = ["20–29","30–39","40–49","50–59","60–69"]
apps_holdout_strat['AGE_GROUP'] = pd.cut(apps_holdout_strat['AGE_INT'], bins=bins, labels=labels, right=False)

A function for fairness metrics:

In [185]:
# function to caluclate fairness metrics based on a protected group
def fairness_metrics(df, group_col, y_true="TARGET", y_pred="y_pred"):
    results = []
    for g, sub in df.groupby(group_col):
        tp = ((sub[y_pred]==1) & (sub[y_true]==1)).sum()
        fp = ((sub[y_pred]==1) & (sub[y_true]==0)).sum()
        tn = ((sub[y_pred]==0) & (sub[y_true]==0)).sum()
        fn = ((sub[y_pred]==0) & (sub[y_true]==1)).sum()
        
        # core rates
        pos_rate = (tp + fp) / max(len(sub), 1)
        tpr = tp / max(tp + fn, 1)   # recall
        fpr = fp / max(fp + tn, 1)
        prec = tp / max(tp + fp, 1)
        
        results.append({
            group_col: g,
            "n": len(sub),
            "Pos Rate": pos_rate,
            "TPR (Equal Opportunity)": tpr,
            "FPR": fpr,
            "Precision (PPVP)": prec,
        })

    results_df = pd.DataFrame(results)
    results_df['TPR_Ratio'] = results_df['TPR (Equal Opportunity)'] / results_df['TPR (Equal Opportunity)'].max()
    results_df['FPR_Ratio'] = results_df['FPR'] / results_df['FPR'].max()

    return results_df

In [186]:
fairness_metrics(apps_holdout_strat, 'GENDER', 'TARGET', 'y_pred')

Unnamed: 0,GENDER,n,Pos Rate,TPR (Equal Opportunity),FPR,Precision (PPVP),TPR_Ratio,FPR_Ratio
0,Female,40608,0.327842,0.741143,0.296634,0.158717,0.962482,0.807144
1,Male,20835,0.408255,0.770033,0.36751,0.190924,1.0,1.0


All metrics are higher for men, but all are relatively close. 

In [187]:
fairness_metrics(apps_holdout_strat, 'AGE_GROUP', 'TARGET', 'y_pred')

Unnamed: 0,AGE_GROUP,n,Pos Rate,TPR (Equal Opportunity),FPR,Precision (PPVP),TPR_Ratio,FPR_Ratio
0,20–29,9083,0.580975,0.877437,0.541094,0.179079,1.0,1.0
1,30–39,16277,0.403207,0.7885,0.363723,0.181777,0.898639,0.672198
2,40–49,15236,0.320688,0.73361,0.285226,0.180925,0.836082,0.527127
3,50–59,13624,0.271359,0.659314,0.246643,0.145523,0.751408,0.455822
4,60–69,7223,0.193271,0.507163,0.177335,0.126791,0.578005,0.327734


In [188]:
fairness_metrics(apps_holdout_strat, 'MARITAL_STATUS', 'TARGET', 'y_pred')

Unnamed: 0,MARITAL_STATUS,n,Pos Rate,TPR (Equal Opportunity),FPR,Precision (PPVP),TPR_Ratio,FPR_Ratio
0,Married,45224,0.349018,0.754053,0.31422,0.170933,0.955277,0.800637
1,Previously Married,7190,0.297775,0.685371,0.268869,0.159738,0.868267,0.685081
2,Single,9029,0.431277,0.789354,0.392463,0.178993,1.0,1.0
