<a href="https://colab.research.google.com/github/AjjayK/Amex-Credit-Default-Prediction/blob/main/LGBM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#American Express &ndash; Default Prediction
<p style='margin-top:0cm;margin-right:0cm;margin-bottom:8.0pt;margin-left:-20.45pt;line-height:107%;font-size:15px;font-family:"Calibri",sans-serif;text-indent:41.75pt;'><span style='font-family:"Open Sans",sans-serif;'>Predict if a customer will default in the future</span></p>

---


<p style='margin-top:0cm;margin-right:0cm;margin-bottom:8.0pt;margin-left:21.3pt;line-height:107%;font-size:15px;font-family:"Calibri",sans-serif;'><strong><span style='font-size:19px;line-height:107%;font-family:"Open Sans",sans-serif;'>About Notebook</span></strong></p>
<p style='margin-top:0cm;margin-right:0cm;margin-bottom:8.0pt;margin-left:21.3pt;line-height:107%;font-size:15px;font-family:"Calibri",sans-serif;'><span style='font-size:16px;line-height:107%;font-family:"Open Sans",sans-serif;'>This notebook is an extension to the previous notebook which deals with EDA and deep learning models to predict credit default. This notebook features LGBM model. This notebook requires Python 3.9 with latest LGBM library.</span></p>

###**Initializing the notebook**
The notebook does not run on Google Colab since Google Colab uses Python 3.7 and it doesn't support LGBM log_evaluation

In [None]:
#Importing the libraries used in this notebook
import numpy as np 
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold 
from sklearn.metrics import roc_auc_score, roc_curve, auc
from lightgbm import LGBMClassifier, early_stopping, log_evaluation
import warnings, gc
warnings.filterwarnings("ignore")


###**Loading the Dataset**
Dataset was downloaded from Kaggle and is saved in local machine. Enter the directory where dataset is saved.

In [None]:
#Loading the training data
train_data = pd.read_feather(input_directory)
train_data = train_data.groupby('customer_ID').tail(1).set_index('customer_ID')

test_data = pd.read_feather(r"C:\Users\ajjay\Amex Kaggle\test_data.ftr\test_data.ftr")
test_data = test_data.groupby('customer_ID').tail(1).set_index('customer_ID')

In [None]:
#Column stratification
col = train_data.columns.to_list()
cat_features = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']
non_cat_features = [x for x in col if x not in cat_features]

delinquency_variables = [i for i in col if i.startswith('D_')]
spend_variables = [i for i in col if (i.startswith('S_') and i != 'S_2')]
payment_variables = [i for i in col if i.startswith('P_')]
balance_variables = [i for i in col if i.startswith('B_')]
risk_variables = [i for i in col if i.startswith('R_')]

In [None]:

del train_data['S_2']
gc.collect()
del test_data['S_2']
gc.collect()


77

###Light Gradient Boosting Machine
LightGBM is a gradient boosting framework based on decision trees to increases the efficiency of the model and reduces memory usage. 
It uses two novel techniques: Gradient-based One Side Sampling and Exclusive Feature Bundling (EFB) which fulfills the limitations of histogram-based algorithm that is primarily used in all GBDT (Gradient Boosting Decision Tree) frameworks. The two techniques of GOSS and EFB described below form the characteristics of LightGBM Algorithm. They comprise together to make the model work efficiently and provide it a cutting edge over other GBDT frameworks 

In [None]:
#LGBM
enc = LabelEncoder()
for col in cat_features:
    train_data[col] = enc.fit_transform(train_data[col])
    test_data[col] = enc.transform(test_data[col])

X=train_data.drop(['target'],axis=1)
y=train_data['target']
y_valid, gbm_val_probs, gbm_test_preds, gini=[],[],[],[]
ft_importance=pd.DataFrame(index=X.columns)
sk_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=21)
for fold, (train_idx, val_idx) in enumerate(sk_fold.split(X, y)):
    
    print("\nFold {}".format(fold+1))
    X_train, y_train = X.iloc[train_idx,:], y[train_idx]
    X_val, y_val = X.iloc[val_idx,:], y[val_idx]
    print("Train shape: {}, {}, Valid shape: {}, {}\n".format(
        X_train.shape, y_train.shape, X_val.shape, y_val.shape))
    
    params = {'boosting_type': 'gbdt',
              'n_estimators': 1000,
              'num_leaves': 50,
              'learning_rate': 0.05,
              'colsample_bytree': 0.9,
              'min_child_samples': 2000,
              'max_bins': 500,
              'reg_alpha': 2,
              'objective': 'binary',
              'random_state': 21}
    
    gbm = LGBMClassifier(**params).fit(X_train, y_train, 
                                       eval_set=[(X_train, y_train), (X_val, y_val)],
                                       callbacks=[early_stopping(200), log_evaluation(500)],
                                       eval_metric=['auc','binary_logloss'])
    gbm_prob = gbm.predict_proba(X_val)[:,1]
    gbm_val_probs.append(gbm_prob)
    y_valid.append(y_val)
    
    y_pred=pd.DataFrame(data={'prediction':gbm_prob})
    y_true=pd.DataFrame(data={'target':y_val.reset_index(drop=True)})
    gini_score=amex_metric(y_true = y_true, y_pred = y_pred)
    gini.append(gini_score)
    
    auc_score=roc_auc_score(y_val, gbm_prob)
    gbm_test_preds.append(gbm.predict_proba(test_data)[:,1])    
    ft_importance["Importance_Fold"+str(fold)]=gbm.feature_importances_    
    print("Validation Gini: {:.5f}, AUC: {:.4f}".format(gini_score,auc_score))
    
    del X_train, y_train, X_val, y_val
    _ = gc.collect()
    
del X, y


Fold 1
Train shape: (413021, 188), (413021,), Valid shape: (45892, 188), (45892,)

Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.971327	training's binary_logloss: 0.191196	valid_1's auc: 0.959852	valid_1's binary_logloss: 0.221868
Early stopping, best iteration is:
[559]	training's auc: 0.97245	training's binary_logloss: 0.188027	valid_1's auc: 0.959899	valid_1's binary_logloss: 0.221776
Validation Gini: 0.78717, AUC: 0.9599

Fold 2
Train shape: (413021, 188), (413021,), Valid shape: (45892, 188), (45892,)

Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.971298	training's binary_logloss: 0.191223	valid_1's auc: 0.959829	valid_1's binary_logloss: 0.222303
Early stopping, best iteration is:
[530]	training's auc: 0.971893	training's binary_logloss: 0.189559	valid_1's auc: 0.959853	valid_1's binary_logloss: 0.222244
Validation Gini: 0.78309, AUC: 0.9599

Fold 3
Train shape: (413021, 188), (413021,), Valid shape:

In [None]:
#Predicting the response using model for training data and test data
y_pred = gbm.predict_proba(train_data.loc[:,train_data.columns[0:188]])
y_pred_test = gbm.predict_proba(test_data)

###**Model Evaluation**
#####**Amex Metrics**
The evaluation metric, M, for this competition is the mean of two measures of rank ordering: Normalized Gini Coefficient, G, and default rate captured at 4%, D.

M=0.5⋅(G+D)
The default rate captured at 4% is the percentage of the positive labels (defaults) captured within the highest-ranked 4% of the predictions, and represents a Sensitivity/Recall statistic.

For both of the sub-metrics G and D, the negative labels are given a weight of 20 to adjust for downsampling.

This metric has a maximum value of 1.0.
#####**Sparse Categorical Accuracy**
Since interpretation of Amex metrics is difficult, sparse categorical accuracy is estimated for the training data. As 'target' values are not avilable for test data, we only evaluate this metric for training data. It was found that higher the categorical accruacy on training data, higher the Amex metrics on test data. Hence, high value of the categorical accuracy on training data should give a better estimate for higher categorical accuracy for test data.

In [None]:
#Sparse Categorical Accuracy
def sparse_categorical_accuracy(y_true, y_pred):
  acc = np.dot(1, np.equal(y_true, np.argmax(y_pred, axis=1)))
  return (sum(acc)/len(acc))*100

In [None]:
def amex_metric(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:

    def top_four_percent_captured(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        four_pct_cutoff = int(0.04 * df['weight'].sum())
        df['weight_cumsum'] = df['weight'].cumsum()
        df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
        return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()
        
    def weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
        total_pos = (df['target'] * df['weight']).sum()
        df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
        df['lorentz'] = df['cum_pos_found'] / total_pos
        df['gini'] = (df['lorentz'] - df['random']) * df['weight']
        return df['gini'].sum()

    def normalized_weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        y_true_pred = y_true.rename(columns={'target': 'prediction'})
        return weighted_gini(y_true, y_pred) / weighted_gini(y_true, y_true_pred)

    g = normalized_weighted_gini(y_true, y_pred)
    d = top_four_percent_captured(y_true, y_pred)

    return 0.5 * (g + d)



In [None]:
y_true = train_data['target']
print("Sparse categorical accuracy: {:.5f}%".format(sparse_categorical_accuracy(y_true,y_pred)))

Sparse categorical accuracy: 92.87599%


In [None]:
print("Amex metric for deep decision tree model: {:.5f}".format(amex_metric(y_true.to_frame(),pd.DataFrame(np.argmax(y_pred, axis=1), columns = ['prediction'], index = train_data.index))))

Amex metric for deep decision tree model: 0.68132


###Test Data Evaluation
Using Kaggle API to submit to the competition. Enter your kaggle username and kaggle key

In [None]:
lgbm_out = pd.DataFrame(y_pred_test[:,1], columns = ['prediction'], index = test_data.index)
lgbm_out.to_csv('lgbm_out.csv')

In [None]:
import os
os.environ['KAGGLE_username'] = username
os.environ['KAGGLE_key'] = key
import kaggle

In [None]:
!kaggle competitions submit -c amex-default-prediction -f "C:\Users\ajjay\Amex Kaggle\lgbm_out.csv" -m lgbm

Successfully submitted to American Express - Default Prediction



  0%|          | 0.00/76.1M [00:00<?, ?B/s]
  0%|          | 184k/76.1M [00:00<00:43, 1.83MB/s]
  2%|1         | 1.38M/76.1M [00:00<00:24, 3.24MB/s]
  5%|5         | 4.18M/76.1M [00:00<00:07, 9.78MB/s]
 10%|9         | 7.36M/76.1M [00:00<00:04, 16.0MB/s]
 12%|#2        | 9.52M/76.1M [00:00<00:03, 17.5MB/s]
 15%|#5        | 11.6M/76.1M [00:00<00:04, 16.4MB/s]
 18%|#8        | 13.8M/76.1M [00:01<00:03, 18.0MB/s]
 21%|##        | 15.7M/76.1M [00:01<00:03, 18.0MB/s]
 23%|##3       | 17.6M/76.1M [00:01<00:03, 17.6MB/s]
 25%|##5       | 19.4M/76.1M [00:01<00:03, 17.3MB/s]
 28%|##8       | 21.5M/76.1M [00:01<00:03, 18.6MB/s]
 31%|###       | 23.4M/76.1M [00:01<00:02, 18.8MB/s]
 33%|###3      | 25.2M/76.1M [00:01<00:02, 18.5MB/s]
 35%|###5      | 27.0M/76.1M [00:01<00:02, 18.1MB/s]
 38%|###7      | 28.8M/76.1M [00:01<00:02, 18.1MB/s]
 40%|####      | 30.6M/76.1M [00:01<00:02, 18.2MB/s]
 43%|####2     | 32.5M/76.1M [00:02<00:02, 18.5MB/s]
 45%|####4     | 34.3M/76.1M [00:02<00:02, 18.3MB/s]
 4

###**Conclusion**

1.   LGBM (Model in other notebook) stands out to be better in terms of metrics (Score - 0.79546 and Categorical Accuracy on training data - 92.87%)
2.   Amex Metric for Deep Decision Forest Model was better than Deep Neural Decision Tree since Forest Model is an ensemble model

Overall results are given in the table below

<table style="margin-left:21.3pt;border-collapse:collapse;border:none;">
    <tbody>
        <tr>
            <td style="width: 134.85pt;border: 1pt solid windowtext;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;'><strong><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>Model</span></strong></p>
            </td>
            <td style="width: 134.85pt;border-top: 1pt solid windowtext;border-right: 1pt solid windowtext;border-bottom: 1pt solid windowtext;border-image: initial;border-left: none;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;'><strong><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>Categorical Accuracy on Training Data</span></strong></p>
            </td>
            <td style="width: 134.9pt;border-top: 1pt solid windowtext;border-right: 1pt solid windowtext;border-bottom: 1pt solid windowtext;border-image: initial;border-left: none;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;'><strong><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>Amex Metric on Training Data</span></strong></p>
            </td>
            <td style="width: 134.9pt;border-top: 1pt solid windowtext;border-right: 1pt solid windowtext;border-bottom: 1pt solid windowtext;border-image: initial;border-left: none;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;'><strong><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>Amex Metric on Test Data</span></strong></p>
            </td>
        </tr>
        <tr>
            <td style="width: 134.85pt;border-right: 1pt solid windowtext;border-bottom: 1pt solid windowtext;border-left: 1pt solid windowtext;border-image: initial;border-top: none;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;'><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>LGBM</span></p>
            </td>
            <td style="width: 134.85pt;border-top: none;border-left: none;border-bottom: 1pt solid windowtext;border-right: 1pt solid windowtext;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;text-align:center;'><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>92.8759%</span></p>
            </td>
            <td style="width: 134.9pt;border-top: none;border-left: none;border-bottom: 1pt solid windowtext;border-right: 1pt solid windowtext;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;text-align:center;'><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>0.6813</span></p>
            </td>
            <td style="width: 134.9pt;border-top: none;border-left: none;border-bottom: 1pt solid windowtext;border-right: 1pt solid windowtext;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;text-align:center;'><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>0.7954</span></p>
            </td>
        </tr>
        <tr>
            <td style="width: 134.85pt;border-right: 1pt solid windowtext;border-bottom: 1pt solid windowtext;border-left: 1pt solid windowtext;border-image: initial;border-top: none;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;'><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>Deep Neural Decision Tree</span></p>
            </td>
            <td style="width: 134.85pt;border-top: none;border-left: none;border-bottom: 1pt solid windowtext;border-right: 1pt solid windowtext;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;text-align:center;'><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>90.4552%</span></p>
            </td>
            <td style="width: 134.9pt;border-top: none;border-left: none;border-bottom: 1pt solid windowtext;border-right: 1pt solid windowtext;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;text-align:center;'><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>0.7907</span></p>
            </td>
            <td style="width: 134.9pt;border-top: none;border-left: none;border-bottom: 1pt solid windowtext;border-right: 1pt solid windowtext;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;text-align:center;'><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>0.7818</span></p>
            </td>
        </tr>
        <tr>
            <td style="width: 134.85pt;border-right: 1pt solid windowtext;border-bottom: 1pt solid windowtext;border-left: 1pt solid windowtext;border-image: initial;border-top: none;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;'><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>Deep Neural Decision Forest</span></p>
            </td>
            <td style="width: 134.85pt;border-top: none;border-left: none;border-bottom: 1pt solid windowtext;border-right: 1pt solid windowtext;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;text-align:center;'><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>90.5866%</span></p>
            </td>
            <td style="width: 134.9pt;border-top: none;border-left: none;border-bottom: 1pt solid windowtext;border-right: 1pt solid windowtext;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;text-align:center;'><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>0.8001</span></p>
            </td>
            <td style="width: 134.9pt;border-top: none;border-left: none;border-bottom: 1pt solid windowtext;border-right: 1pt solid windowtext;padding: 0cm 5.4pt;vertical-align: top;">
                <p style='margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:0cm;line-height:normal;font-size:15px;font-family:"Calibri",sans-serif;text-align:center;'><span style='font-size:16px;font-family:"Open Sans",sans-serif;'>0.7917</span></p>
            </td>
        </tr>
    </tbody>
</table>