# Tuning an XGBoost Model

The goal of this notebook is to train and evaluate an XGBoost model, comparing it's performance on a holdout set against other types of models (LR, SVC, LDA). 

To ensure reproducibility and consistent evaluation across models, all datasets were **pre-split into cross-val data and holdout data** as below:

| Split type           | CV training file     | Holdout file              | Description                              |
| -------------------- | -------------------- | ------------------------- | ---------------------------------------- |
| **Random**           | `apps_cv_random.csv` | `apps_holdout_random.csv` | Simple random sampling                   |
| **Stratified**       | `apps_cv_strat.csv`  | `apps_holdout_strat.csv`  | Stratified by `TARGET`                   |
| **Multi-Stratified** | `apps_cv_multi.csv`  | `apps_holdout_multi.csv`  | Stratified by `TARGET` + `CODE_GENDER_M` |

Each dataset for cross-validation (`apps_cv_*.csv`) also contains a column, `fold`, with pre-assigned folds from 1-5 using the corresponding splitting method to ensure consistent evaluation. Therefore, no additional splitting is needed inside this notebook -- can simply loop through assigned folds for cross-validation.


In [34]:
import pandas as pd 
import numpy as np 
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier

## Evaluation Functions

Copying over metric functions from `cross_val.ipynb`. Note that in that file, we assigned folds already for each type of splitting to maintain consistent comparison across modeling. So, there will be no explicit splitting in this fild, we will just use the folds already created, but still make it clear what type of splitting was used. 

In [35]:
# METRICS 

def classification_metrics(y_true, y_pred):
    """
    computes conf matrix + acc, prec, rec, and f1
    
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    # conf matrix
    tp = np.sum((y_true==1) & (y_pred==1))
    tn = np.sum((y_true==0) & (y_pred==0))
    fp = np.sum((y_true==0) & (y_pred==1))
    fn = np.sum((y_true==1) & (y_pred==0))

    acc  = (tp + tn) / max((tp + tn + fp + fn), 1)
    prec = tp / max((tp + fp), 1)
    rec  = tp / max((tp + fn), 1)
    f1   = (2*prec*rec / max((prec+rec), 1e-12)) if (prec+rec)>0 else 0.0

    metrics = {"n":len(y_true),
        "tp":tp, "tn": tn, "fp":fp, "fn":fn, "acc":acc, "prec":prec, "rec": rec, "f1":f1
    }
    return metrics

def roc_auc_from_probs(y_true, y_prob):
    
    desc_sort_indices = np.argsort(-y_prob)
    y_true = np.array(y_true)[desc_sort_indices]
    y_prob = np.array(y_prob)[desc_sort_indices]
    pos = np.sum(y_true == 1)
    neg = np.sum(y_true == 0)

    # running totals for TPR/FPR
    tpr = [0.0]
    fpr = [0.0]
    tp = fp = 0
    for i in range(len(y_true)):
        if y_true[i] == 1:
            tp += 1
        else:
            fp += 1
        tpr.append(tp / pos)
        fpr.append(fp / neg)

    # get auc
    auc = np.trapz(tpr, fpr)
    return auc

# Model Development

**Notes:** 
- All evaluation will focus on stratified cross-validation, but we will test the other methods as well. 
- Recall that folds have been pre-assigned to ensure consistency across different model development processes
- For our other models, we have decided to scale + PCA, but this is not necessary for nonlinear tree-based algorithms like XGBoost
    - these can only really hurt XGBoost, so we will not use it here

**Process:**
1. Setting a baseline
    - evaluating an xgb model with all default parameters to build off of
2. Hyperparameter tuning
    - evaluate many different combinations of parameters
    - choose the best set based on average ROC-AUC across all folds
3. Threshold tuning
    - tweak the threshold on the best model to maximize F1 
        - note that roc-auc is not affected by threshold, so we need a different optimizing metric
4. Holdout evaluation
    - evaluate on the corresponding holdout table. the performance here is what we will compare with other models (LR, SVC, LDA)

### 0. Reading in the data

Recall that we have different datasets for each type of cross validation. We are currently focusing on the stratified splitting method, but random and multiple-stratification methods are avaiable for testing/comparison.

In [36]:
apps_cv_strat = pd.read_csv("data/apps_cv_strat.csv")
apps_holdout_strat = pd.read_csv("data/apps_holdout_strat.csv")
target_col = 'TARGET'
feature_cols = [col for col in apps_cv_strat.columns if col not in [target_col, 'SK_ID_CURR', 'fold']]

### 1. Setting a Baseline

Fitting an XGBoost model with default parameters to understand baseline predictive power and what we can build on.

In [39]:
fold_metrics = []
for f in apps_cv_strat.fold.unique():

    # split into train and test based on folds
    train = apps_cv_strat[apps_cv_strat.fold != f]
    test = apps_cv_strat[apps_cv_strat.fold == f]
    X_train, y_train = train[feature_cols], train[target_col]
    X_test, y_test = test[feature_cols], test[target_col]

    # fit baseline model: xgboost with default parameters (optimizing for roc-auc)
    xgb = XGBClassifier(eval_metric='auc', random_state=42, n_jobs=-1, tree_method='hist')
    xgb.fit(X_train, y_train)
    
    # get predictions (probablities and decisions)
    y_prob = xgb.predict_proba(X_test)[:, 1]
    y_pred = (y_prob >= 0.5).astype(int)

    # calculate classification metrics from previously defined functions
    metrics = classification_metrics(y_test, y_pred)

    # add roc-auc and the fold number to metrics
    roc_auc = roc_auc_from_probs(y_test, y_prob)
    metrics.update({'fold': int(f), 'roc_auc': roc_auc}) # add roc-auc and fold number to metrics 

    # add to list list of all fold metrics
    fold_metrics.append(metrics)

# store results in dataframe
baseline_results = pd.DataFrame(fold_metrics).sort_values("fold").reset_index(drop=True)

In [40]:
baseline_results

Unnamed: 0,n,tp,tn,fp,fn,acc,prec,rec,f1,fold,roc_auc
0,49156,253,44913,274,3716,0.91883,0.480076,0.063744,0.112544,1,0.7667
1,49156,224,44928,259,3745,0.918545,0.463768,0.056437,0.100629,2,0.766153
2,49156,226,44935,252,3743,0.918728,0.472803,0.056941,0.101642,3,0.768592
3,49155,232,44903,283,3737,0.918218,0.450485,0.058453,0.103479,4,0.772013
4,49154,223,44899,287,3745,0.917972,0.437255,0.0562,0.099598,5,0.765584


Precision, recall, and F1 are all quite low, while ROC-AUC is relatively strong. This indicates that the model is doing a good job ranking applicants from low-risk to high-risk, as in assigning higher probabilities to true positives, but the actual 0/1 decisions at the default threshold of 0.5 are poor (likely due to class imbalance).

Therefore, we expect significant improvement after threshold tuning at the end. This is also why we focus on ROC-AUC during hyperparameter tuning -- as long as the modelâ€™s ability to rank cases is good (high ROC-AUC), we can later adjust the threshold to manipulate precision/recall/F1.

### 2. Hyperparameter Tuning

Now, finding hyperparameters that maximize ROC-AUC (while keeping an eye on other metrics). 

XGBoost has a lot of parameters, many of which make a big impact on predictions, so the method of selecting the best ones is more complicated than the other linear models. A simple grid search over a huge parameter grid will take forever, considering just one cross-validated iteration of the baseline model took 15+ seconds. Therefore, a smarter approach may be to identify optimal ranges of certain parameters first, and then do a search over a narrower grid. So an iterative grid search instead of all at once. Our approach will likely adapt as we try things out. 
