# Group 16

Credit cards play an important role in modern life, offering both convenience and flexibility when making everyday purchases. Whether one is at a restaurant or making online purchases, the actual or digital credit cards can get it done. While they allow consumers to spend without carrying cash and even adopt a “buy now pay later” policy, the responsibility of managing the risk falls on card issuers. One of the biggest questions for lenders, especially commercial banks, is determining whether a customer is likely to repay what they borrow.

To accurately predict credit default is essential for financial institutions, as this facilitates their ability to make more informed lending decisions and manage risk more effectively. In this project, we will focus on applying machine learning techniques to develop and optimize various models that can predict credit default using the large and complex dataset provided by American Express. The dataset includes anonymized customer profiles along with behavioral data collected over time, which creates both an opportunity and a challenge for model development.

While the goal of our project is simply to predict if a customer will default in the future, this presents the opportunity to build a model that can improve on existing solutions, helping lenders make better decisions and offering a smoother experience for customers seeking credit.


In [1]:
print("\nChecking CUDA toolkit (for V100 GPU support):")
import subprocess
try:
    output = subprocess.check_output(["nvcc", "--version"]).decode()
    print("✅ CUDA toolkit is installed.")
    print(output.split('\n')[-2])  # Print CUDA version line
except Exception as e:
    print("❌ CUDA toolkit (nvcc) is NOT installed or not in PATH.")


Checking CUDA toolkit (for V100 GPU support):
✅ CUDA toolkit is installed.
Build cuda_12.6.r12.6/compiler.35059454_0


In [2]:
import importlib

required_libraries = [
    "gc", "warnings", "scipy", "numpy", "pandas", "tqdm", "itertools",
    "os", "random", "joblib", "lightgbm", "sklearn"
]

print("Checking Python libraries:")
for lib in required_libraries:
    try:
        importlib.import_module(lib)
        print(f"✅ {lib} is installed.")
    except ImportError:
        print(f"❌ {lib} is NOT installed.")

Checking Python libraries:
✅ gc is installed.
✅ scipy is installed.
✅ numpy is installed.
✅ pandas is installed.
✅ tqdm is installed.
✅ itertools is installed.
✅ os is installed.
✅ random is installed.
✅ joblib is installed.
✅ lightgbm is installed.
✅ sklearn is installed.


# Preprocessing

In [None]:
# ====================================================
# Library
# ====================================================
import gc
import warnings
warnings.filterwarnings('ignore')
import scipy as sp
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
from tqdm.auto import tqdm
import itertools

# ====================================================
# Read & preprocess data and save it to disk
# ====================================================
def read_preprocess_data():
    train = pd.read_parquet('content/data/train.parquet')
    features = train.drop(['customer_ID', 'S_2'], axis = 1).columns.to_list()
    cat_features = [
        "B_30",
        "B_38",
        "D_114",
        "D_116",
        "D_117",
        "D_120",
        "D_126",
        "D_63",
        "D_64",
        "D_66",
        "D_68",
    ]
    num_features = [col for col in features if col not in cat_features]
    print('Starting training feature engineer...')
    train_num_agg = train.groupby("customer_ID")[num_features].agg(['mean', 'std', 'min', 'max', 'last'])
    train_num_agg.columns = ['_'.join(x) for x in train_num_agg.columns]
    train_num_agg.reset_index(inplace = True)
    train_cat_agg = train.groupby("customer_ID")[cat_features].agg(['count', 'last', 'nunique'])
    train_cat_agg.columns = ['_'.join(x) for x in train_cat_agg.columns]
    print("1")
    train_cat_agg.reset_index(inplace = True)
    train_labels = pd.read_csv('content/data/train_labels.csv')
    train = train_num_agg.merge(train_cat_agg, how = 'inner', on = 'customer_ID').merge(train_labels, how = 'inner', on = 'customer_ID')
    del train_num_agg, train_cat_agg
    gc.collect()
    print("2")
    test = pd.read_parquet('content/data/test.parquet')
    print('Starting test feature engineer...')
    test_num_agg = test.groupby("customer_ID")[num_features].agg(['mean', 'std', 'min', 'max', 'last'])
    test_num_agg.columns = ['_'.join(x) for x in test_num_agg.columns]
    test_num_agg.reset_index(inplace = True)
    print("3")
    test_cat_agg = test.groupby("customer_ID")[cat_features].agg(['count', 'last', 'nunique'])
    test_cat_agg.columns = ['_'.join(x) for x in test_cat_agg.columns]
    test_cat_agg.reset_index(inplace = True)
    print("4")
    test = test_num_agg.merge(test_cat_agg, how = 'inner', on = 'customer_ID')
    del test_num_agg, test_cat_agg
    print("5")
    gc.collect()
    # Save files to disk
    train.to_parquet('content/data/train_fe.parquet')
    print("6")
    test.to_parquet('content/data/test_fe.parquet')
    print("Data preprocessing complete!")

# Read & Preprocess Data
read_preprocess_data()

Starting training feature engineer...
1
2
Starting test feature engineer...
3
4
5
6
Data preprocessing complete!


In [5]:
import pandas as pd

test_df = pd.read_parquet('content/data/train_fe.parquet')
print(test_df.shape)  # Shows the (rows, columns) of the DataFrame
print(test_df.head()) # Displays the first few rows

(458913, 920)
                                         customer_ID  P_2_mean   P_2_std   P_2_min   P_2_max  P_2_last  D_39_mean  D_39_std  D_39_min  D_39_max  D_39_last  B_1_mean   B_1_std   B_1_min   B_1_max  B_1_last  B_2_mean   B_2_std   B_2_min   B_2_max  B_2_last  R_1_mean   R_1_std   R_1_min   R_1_max  R_1_last  S_3_mean   S_3_std   S_3_min   S_3_max  S_3_last  D_41_mean  D_41_std  D_41_min  D_41_max  D_41_last  B_3_mean   B_3_std   B_3_min   B_3_max  B_3_last  D_42_mean  D_42_std  D_42_min  D_42_max  D_42_last  D_43_mean  D_43_std  D_43_min  D_43_max  D_43_last  D_44_mean  D_44_std  D_44_min  D_44_max  D_44_last   B_4_mean   B_4_std  B_4_min  B_4_max  B_4_last  D_45_mean  D_45_std  D_45_min  D_45_max  D_45_last  B_5_mean   B_5_std   B_5_min   B_5_max  B_5_last  R_2_mean  R_2_std  R_2_min  R_2_max  R_2_last  D_46_mean  D_46_std  D_46_min  D_46_max  D_46_last  D_47_mean  D_47_std  D_47_min  D_47_max  D_47_last  D_48_mean  D_48_std  D_48_min  D_48_max  D_48_last  D_49_mean  D_49_st

In [6]:
import lightgbm as lgb
import numpy as np

def check_lightgbm_gpu():
    try:
        X = np.random.rand(50, 2)
        y = np.random.randint(0, 2, 50)
        dtrain = lgb.Dataset(X, label=y)
        params = {
            'objective': 'binary',
            'device': 'gpu',  # Try using GPU
            'verbose': -1,
            'num_iterations': 1
        }
        lgb.train(params, dtrain)
        print("✅ LightGBM GPU support is ENABLED.")
    except Exception as e:
        print("❌ LightGBM GPU support is NOT enabled.")
        print("Error message:", e)

check_lightgbm_gpu()


✅ LightGBM GPU support is ENABLED.


In [7]:
import lightgbm
print("LightGBM version:", lightgbm.__version__)


LightGBM version: 4.6.0


In [8]:
import subprocess

try:
    output = subprocess.check_output('nvidia-smi', encoding='utf-8')
    print(output)
except Exception as e:
    print("No Nvidia GPU detected or drivers not installed.")
    print("Error message:", e)


Wed Apr 23 11:18:20 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 572.83                 Driver Version: 572.83         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce GTX 1650 ...  WDDM  |   00000000:02:00.0 Off |                  N/A |
| N/A   49C    P0             13W /   35W |       0MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

# Training & Inference

In [None]:
# ====================================================
# Library
# ====================================================
import os
import gc
import warnings
warnings.filterwarnings('ignore')
import random
import scipy as sp
import numpy as np
import pandas as pd
import joblib
import itertools
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
from tqdm.auto import tqdm
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
from itertools import combinations

# ====================================================
# Configurations
# ====================================================
class CFG:
    input_dir = 'content/data/'
    seed = 42
    n_folds = 2
    target = 'target'

# ====================================================
# Seed everything
# ====================================================
def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

# ====================================================
# Read data
# ====================================================
def read_data():
    train = pd.read_parquet(CFG.input_dir + 'train_fe.parquet')
    test = pd.read_parquet(CFG.input_dir + 'test_fe.parquet')
    return train, test

# ====================================================
# Amex metric
# ====================================================
def amex_metric(y_true, y_pred):
    labels = np.transpose(np.array([y_true, y_pred]))
    labels = labels[labels[:, 1].argsort()[::-1]]
    weights = np.where(labels[:,0]==0, 20, 1)
    cut_vals = labels[np.cumsum(weights) <= int(0.04 * np.sum(weights))]
    top_four = np.sum(cut_vals[:,0]) / np.sum(labels[:,0])
    gini = [0,0]
    for i in [1,0]:
        labels = np.transpose(np.array([y_true, y_pred]))
        labels = labels[labels[:, i].argsort()[::-1]]
        weight = np.where(labels[:,0]==0, 20, 1)
        weight_random = np.cumsum(weight / np.sum(weight))
        total_pos = np.sum(labels[:, 0] *  weight)
        cum_pos_found = np.cumsum(labels[:, 0] * weight)
        lorentz = cum_pos_found / total_pos
        gini[i] = np.sum((lorentz - weight_random) * weight)
    return 0.5 * (gini[1]/gini[0] + top_four)

# ====================================================
# LGBM amex metric
# ====================================================
def lgb_amex_metric(y_pred, y_true):
    y_true = y_true.get_label()
    return 'amex_metric', amex_metric(y_true, y_pred), True

# ====================================================
# Train & Evaluate
# ====================================================
def train_and_evaluate(train, test):
    # Label encode categorical features
    cat_features = [
        "B_30",
        "B_38",
        "D_114",
        "D_116",
        "D_117",
        "D_120",
        "D_126",
        "D_63",
        "D_64",
        "D_66",
        "D_68"
    ]
    cat_features = [f"{cf}_last" for cf in cat_features]
    for cat_col in cat_features:
        encoder = LabelEncoder()
        train[cat_col] = encoder.fit_transform(train[cat_col])
        test[cat_col] = encoder.transform(test[cat_col])
    # Round last float features to 2 decimal place
    num_cols = list(train.dtypes[(train.dtypes == 'float32') | (train.dtypes == 'float64')].index)
    num_cols = [col for col in num_cols if 'last' in col]
    for col in num_cols:
        train[col + '_round2'] = train[col].round(2)
        test[col + '_round2'] = test[col].round(2)
    # Get feature list
    features = [col for col in train.columns if col not in ['customer_ID', CFG.target]]
    params = {
        'objective': 'binary',
        'metric': "binary_logloss",
        'boosting': 'dart',
        'seed': CFG.seed,
        'num_leaves': 100,
        'learning_rate': 0.01,
        'feature_fraction': 0.20,
        'bagging_freq': 10,
        'bagging_fraction': 0.50,
        'n_jobs': -1,
        'lambda_l2': 2,
        'min_data_in_leaf': 40,
        'device': 'gpu',                # Enable GPU
        'gpu_platform_id': 0,           # Usually 0
        'gpu_device_id': 0,             # Usually 0 for single GPU
        'max_bin': 63,                  # Recommended for GPU speedup
        'gpu_use_dp': False             # Use single precision for consumer GPUs
        }

    # Create a numpy array to store test predictions
    test_predictions = np.zeros(len(test))
    # Create a numpy array to store out of folds predictions
    oof_predictions = np.zeros(len(train))
    kfold = StratifiedKFold(n_splits = CFG.n_folds, shuffle = True, random_state = CFG.seed)
    for fold, (trn_ind, val_ind) in enumerate(kfold.split(train, train[CFG.target])):
        print(' ')
        print('-'*50)
        print(f'Training fold {fold} with {len(features)} features...')
        x_train, x_val = train[features].iloc[trn_ind], train[features].iloc[val_ind]
        y_train, y_val = train[CFG.target].iloc[trn_ind], train[CFG.target].iloc[val_ind]
        lgb_train = lgb.Dataset(x_train, y_train, categorical_feature = cat_features)
        lgb_valid = lgb.Dataset(x_val, y_val, categorical_feature = cat_features)
        model = lgb.train(
            params = params,
            train_set = lgb_train,
            num_boost_round = 10500,
            valid_sets = [lgb_train, lgb_valid],
            # early_stopping_rounds = 100,
            # verbose_eval = 500,
            feval = lgb_amex_metric,
            callbacks=[lgb.early_stopping(stopping_rounds=100),
                       lgb.log_evaluation(period=500)]
            )
        # Save best model
        joblib.dump(model, f'content/Models/lgbm_fold{fold}_seed{CFG.seed}.pkl')
        # Predict validation
        val_pred = model.predict(x_val)
        # Add to out of folds array
        oof_predictions[val_ind] = val_pred
        # Predict the test set
        test_pred = model.predict(test[features])
        test_predictions += test_pred / CFG.n_folds
        # Compute fold metric
        score = amex_metric(y_val, val_pred)
        print(f'Our fold {fold} CV score is {score}')
        del x_train, x_val, y_train, y_val, lgb_train, lgb_valid
        gc.collect()
    # Compute out of folds metric
    score = amex_metric(train[CFG.target], oof_predictions)
    print(f'Our out of folds CV score is {score}')
    # Create a dataframe to store out of folds predictions
    oof_df = pd.DataFrame({'customer_ID': train['customer_ID'], 'target': train[CFG.target], 'prediction': oof_predictions})
    oof_df.to_csv(f'content/OOF/oof_lgbm_baseline_{CFG.n_folds}fold_seed{CFG.seed}.csv', index = False)
    # Create a dataframe to store test prediction
    test_df = pd.DataFrame({'customer_ID': test['customer_ID'], 'prediction': test_predictions})
    test_df.to_csv(f'content/Predictions/test_lgbm_baseline_{CFG.n_folds}fold_seed{CFG.seed}.csv', index = False)
    
seed_everything(CFG.seed)
train, test = read_data()
train_and_evaluate(train, test)

Using GPU device: NVIDIA GeForce GTX 1650 with Max-Q Design on platform: NVIDIA CUDA
 
--------------------------------------------------
Training fold 0 with 1011 features...
[LightGBM] [Info] Number of positive: 59414, number of negative: 170042
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 47730
[LightGBM] [Info] Number of data points in the train set: 229456, number of used features: 1002
[LightGBM] [Info] Using requested OpenCL platform 0 device 0
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce GTX 1650 with Max-Q Design, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 555 dense feature groups (121.67 MB) transferred to GPU in 0.093427 secs. 1 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.258934 -> initscore=-1.051516
[LightGBM] [Info] Start training from score -1.051516
[500]

OSError: [Errno 28] No space left on device

# seed blending

# Read Submission File

This is the submission file corresponding to the output of the previous pipeline

In [3]:
sub = pd.read_csv('../input/new-data/submission(9).csv')
#sub.to_csv('test_lgbm_baseline_5fold_seed42.csv', index = False)

In [4]:

sub.describe()

Unnamed: 0,prediction
count,924621.0
mean,0.225231
std,0.362114
min,-0.065756
25%,-0.036171
50%,0.001452
75%,0.471146
max,1.037961


In [5]:
sub['prediction'] *= .99
sub.to_csv('submission.csv', index=False)
sub.describe()

Unnamed: 0,prediction
count,924621.0
mean,0.222979
std,0.358493
min,-0.065099
25%,-0.035809
50%,0.001438
75%,0.466434
max,1.027581
