<a href="https://colab.research.google.com/github/B0BWAX/AMEX-DEFAULT-PREDICTION/blob/main/models/AMEX_XGB_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
# mount drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [11]:
# using kaggle API
!mkdir ~/.kaggle
!cp /content/drive/MyDrive/API-KEYS/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [12]:
# downloading modified dataset from last notebook
!kaggle datasets download -d bobwax/amex-agg-dataset
!unzip amex-agg-dataset.zip

Downloading amex-agg-dataset.zip to /content
 99% 900M/905M [00:11<00:00, 125MB/s]
100% 905M/905M [00:11<00:00, 82.9MB/s]
Archive:  amex-agg-dataset.zip
  inflating: amex_agg_data.csv       


In [13]:
# downloading GPU library
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

fatal: destination path 'rapidsai-csp-utils' already exists and is not an empty directory.
***********************************************************************
Woo! Your instance has a Tesla T4 GPU!
We will install the latest stable RAPIDS via pip 24.4.*!  Please stand by, should be quick...
***********************************************************************

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cudf-cu12==24.4.*
  Using cached https://pypi.nvidia.com/cudf-cu12/cudf_cu12-24.4.0-cp310-cp310-manylinux_2_28_x86_64.whl (473.3 MB)
Collecting cuml-cu12==24.4.*
  Downloading https://pypi.nvidia.com/cuml-cu12/cuml_cu12-24.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1200.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 GB 1.2 MB/s eta 0:00:00
Collecting cugraph-cu12==24.4.*
  Downloading https://pypi.nvidia.com/cugraph-cu12/cugraph_cu12-24.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1429.1 MB)
 

In [1]:
import cuml, cudf # GPU libiraries
import pandas as pd, numpy as np # CPU libiraries
import pickle
import gc

In [2]:
from cuml.model_selection import train_test_split

In [3]:
# importing data
data = cudf.read_csv('/content/amex_agg_data.csv')

In [4]:
data.head()

Unnamed: 0,P_2_mean,P_2_std,P_2_min,P_2_max,P_2_last,D_39_mean,D_39_std,D_39_min,D_39_max,D_39_last,...,D_64_count,D_64_last,D_64_nunique,D_66_count,D_66_last,D_66_nunique,D_68_count,D_68_last,D_68_nunique,target
0,0.933824,0.024194,0.86858,0.960384,0.934745,0.230769,0.83205,0,3,0,...,13,0,1,13,-1,1,13,6,1,0
1,0.89982,0.022119,0.861109,0.929122,0.880519,7.153846,6.743468,0,19,6,...,13,0,1,13,-1,1,13,6,1,0
2,0.878454,0.028911,0.79767,0.904482,0.880875,0.0,0.0,0,0,0,...,13,2,1,13,-1,1,13,6,1,0
3,0.598969,0.020107,0.567442,0.623392,0.621776,1.538462,3.017046,0,9,0,...,13,0,1,13,-1,1,13,3,3,0
4,0.891679,0.042325,0.805045,0.940382,0.8719,0.0,0.0,0,0,0,...,13,0,1,13,1,1,13,6,1,0


In [5]:
# splitting data
X = data.drop(columns=['target'])
y = data['target']
del data # no longer needed

In [31]:
gc.collect() # free up GPU RAM

0

### Metric
The metric used to compare between models will be the competition metric which is:

>The evaluation metric, M, for this competition is the mean of two measures of rank ordering: Normalized Gini Coefficient, $G$, and default rate captured at 4%, $D$.
>$$M=0.5⋅(G+D)$$
>The default rate captured at 4% is the percentage of the positive labels (defaults) captured within the highest-ranked 4% of the predictions, and represents a Sensitivity/Recall statistic.
>For both of the sub-metrics $G$ and $D$ , the negative labels are given a weight of 20 to adjust for downsampling.
>This metric has a maximum value of 1.0.

In [11]:
def amex_metric_mod(y_true, y_pred):

    labels     = np.transpose(np.array([y_true, y_pred]))
    labels     = labels[labels[:, 1].argsort()[::-1]]
    weights    = np.where(labels[:,0]==0, 20, 1)
    cut_vals   = labels[np.cumsum(weights) <= int(0.04 * np.sum(weights))]
    top_four   = np.sum(cut_vals[:,0]) / np.sum(labels[:,0])

    gini = [0,0]
    for i in [1,0]:
        labels         = np.transpose(np.array([y_true, y_pred]))
        labels         = labels[labels[:, i].argsort()[::-1]]
        weight         = np.where(labels[:,0]==0, 20, 1)
        weight_random  = np.cumsum(weight / np.sum(weight))
        total_pos      = np.sum(labels[:, 0] *  weight)
        cum_pos_found  = np.cumsum(labels[:, 0] * weight)
        lorentz        = cum_pos_found / total_pos
        gini[i]        = np.sum((lorentz - weight_random) * weight)

    return 0.5 * (gini[1]/gini[0] + top_four)

*This metric is very similar to ROC AUC score

*Accuracy score also computed but it's important to note that accuracy alone is not a good evaluation metric in this case.

## Model

In [25]:
import xgboost as xgb

*xgb parameters are taken from [this notebook](https://www.kaggle.com/code/bobwax/xgboost-starter-0-793/edit)

In [12]:
xgb_parms = {
    'max_depth':4,
    'learning_rate':0.05,
    'subsample':0.8,
    'colsample_bytree':0.6,
    'eval_metric':'logloss',
    'objective':'binary:logistic',
    'tree_method':'gpu_hist',
    'predictor':'gpu_predictor',
    'random_state':42
}

*for this model we will employ a validation datatset
* train set = 60% of dataset
* validation = 20% of dataset
* test set = 20% of dataset

In [14]:
X.memory_usage()

P_2_mean        3671304
P_2_std         3671304
P_2_min         3671304
P_2_max         3671304
P_2_last        3671304
                 ...   
D_66_nunique    3671304
D_68_count      3671304
D_68_last       3671304
D_68_nunique    3671304
Index                 0
Length: 919, dtype: int64

In [16]:
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [17]:
del X, y # free up GPU RAM
gc.collect()

4630

In [20]:
X_train, X_valid, y_train, y_valid = train_test_split(X_temp, y_temp, test_size=0.2, random_state=42)

In [21]:
del X_temp, y_temp# no longer needed
gc.collect() # free up GPU RAM

4630

In [26]:
# conversion to Dmatrix for efficency
dtrain_split = xgb.DMatrix(X_train, label=y_train)
dvalid_split = xgb.DMatrix(X_valid, label=y_valid)

In [29]:
del X_valid, y_valid
gc.collect()

57

In [32]:
xgb_model = xgb.train(xgb_parms,
                  dtrain=dtrain_split,
                  evals=[(dtrain_split, 'train'), (dvalid_split, 'valid')],
                  num_boost_round=9999,
                  early_stopping_rounds=100,
                  verbose_eval=100)

xgb_model.save_model('AMEX_XGB_model.xgb') # save model


    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.



[0]	train-logloss:0.54900	valid-logloss:0.54645
[100]	train-logloss:0.23470	valid-logloss:0.23688
[200]	train-logloss:0.22117	valid-logloss:0.22631
[300]	train-logloss:0.21485	valid-logloss:0.22268
[400]	train-logloss:0.21042	valid-logloss:0.22084
[500]	train-logloss:0.20681	valid-logloss:0.21975
[600]	train-logloss:0.20366	valid-logloss:0.21916
[700]	train-logloss:0.20081	valid-logloss:0.21874
[800]	train-logloss:0.19809	valid-logloss:0.21846
[900]	train-logloss:0.19553	valid-logloss:0.21819
[1000]	train-logloss:0.19303	valid-logloss:0.21801
[1100]	train-logloss:0.19060	valid-logloss:0.21792
[1200]	train-logloss:0.18834	valid-logloss:0.21783
[1300]	train-logloss:0.18610	valid-logloss:0.21769
[1400]	train-logloss:0.18392	valid-logloss:0.21763
[1500]	train-logloss:0.18179	valid-logloss:0.21754
[1600]	train-logloss:0.17976	valid-logloss:0.21747
[1700]	train-logloss:0.17772	valid-logloss:0.21742
[1800]	train-logloss:0.17571	valid-logloss:0.21743
[1857]	train-logloss:0.17456	valid-logloss:


    E.g. tree_method = "hist", device = "cuda"



In [33]:
dtest = xgb.DMatrix(X_test)
predictions = xgb_model.predict(dtest)

In [35]:
xgb_comp = amex_metric_mod(y_test.to_numpy(), predictions)
xgb_accuracy = cuml.metrics.accuracy_score(y_test, predictions)
print(f" XGBoost Scores \n ------------------------------------- \n Accuracy: {xgb_accuracy} \n Competition Metric: {xgb_comp}")

 XGBoost Scores 
 ------------------------------------- 
 Accuracy: 0.7399381399154663 
 Competition Metric: 0.7928970491095906
