# <h1 align = "center"><div style = "background-color: #0070D1; color:white; border-radius: 15px; padding: 20px; margin: 2px;">American Express - Default Prediction</div></h1>

<img src = "https://d187qskirji7ti.cloudfront.net/news/wp-content/uploads/2020/02/Blue-Business-Cash-Card-From-American-Express-Credit-Card.jpg" width = 100%>

# <h1><div style = "background-color: #0070D1; color:white; border-radius: 15px; padding: 20px; margin: 2px;">0. Introduction and Overview 📖</div></h1>
**Default risk** is the chance that companies or individuals will not be able to make the required payments on their debt obligations. In other words, credit default risk is the ***probability*** that if you lend money, there is a chance that the borrowers won’t be able to give the money back on time. Lenders and investors are exposed to default risk in virtually all forms of credit extensions. **American Express** is a globally integrated payments company, and the largest payment card issuer in the world. The objective of this competition is to predict the probability that a customer does not pay back their credit card balance amount in the future based on their monthly customer profile. The target binary variable is calculated by observing 18 months performance window after the latest credit card statement, and if the customer does not pay the due amount in 120 days after their latest statement date it is considered a **default event**.

The dataset contains different features for the customers such as:
> - **`D_*`**: Deliquency variable
> - **`S_*`**: Spend variables
> - **`P_*`**: Payment variables
> - **`B_*`**: Balance variables
> - **`R_*`**: Risk variables

The categorical features of the dataset are: `['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']`

**Note**: The negative class has been subsampled for this dataset at 5%, and thus receives a 20x weighting in the scoring metric.

# <h1><div style = "background-color: #0070D1; color:white; border-radius: 15px; padding: 20px; margin: 2px;">1. Imports ⚙️</div></h1>

In [None]:
import os
import warnings
import math
# ===============================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# ===============================
import cudf
import cupy 
import cuml
# ===============================
warnings.filterwarnings('ignore')

In [None]:
!nvidia-smi

In [None]:
!nvcc --version

In [None]:
custom_colors = (['#0070D1', '#0099E7', '#00BBD8', '#00D8B0', '#8AED85', '#F9F871', '#F4F4B0'])
custom_palette = sns.set_palette(sns.color_palette(custom_colors))
sns.palplot(sns.color_palette(custom_colors), size = 1)
plt.tick_params(axis = 'both', labelsize = 0, length = 0)
plt.style.use('dark_background')

Instead of using the default dataset provided to us in the competition, I will be using the compressed `.parquet` files provided to us by [@raddar](https://www.kaggle.com/raddar) from this [dataset](https://www.kaggle.com/datasets/raddar/amex-data-integer-dtypes-parquet-format). 

In [None]:
def check_memory_usage(df):
    start_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

FILE_PATH_PARQUETS = '../input/amex-data-integer-dtypes-parquet-format/'
FILE_PATH_CSV = '../input/amex-default-prediction/'

train_features = cudf.read_parquet(FILE_PATH_PARQUETS + 'train.parquet')
test = cudf.read_parquet(FILE_PATH_PARQUETS + 'test.parquet')
train_labels = cudf.read_csv(FILE_PATH_CSV + 'train_labels.csv')

check_memory_usage(train_features)
check_memory_usage(test)
check_memory_usage(train_labels)

train = cudf.merge(train_features, train_labels, on = 'customer_ID')
check_memory_usage(train)

We merged `train_features` and `train_labels` to create a new dataframe `train` with features and labels all in one place. Once we have the dataframe ready to use we can take a look at the total number of missing values in the dataset.

In [None]:
all_features = ([af for af in train.columns])[:-1]
categorical_features = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']
numerical_features = [nf for nf in all_features if nf not in categorical_features]

print(train.isna().sum())
print('================================')
print('Total Missing Values = {}'.format(train.isna().sum().sum()))
print('================================')

# <h1><div style = "background-color: #0070D1; color:white; border-radius: 15px; padding: 20px; margin: 2px;">2. Exploratory Data Analysis (EDA) 📊</div></h1>

In [None]:
missing_value_dict = {}

missing_value_cols = [col for col in train.columns if train[col].isna().any()]
missing_value_cols_sum = [train[col].isna().sum() for col in missing_value_cols]

for i in range(0, len(missing_value_cols)):
    missing_value_dict.update({missing_value_cols[i]: missing_value_cols_sum[i]})

missing_value_dict = dict(sorted(missing_value_dict.items(), key = lambda x: x[1], reverse = True))
for i in range(0, len(missing_value_cols) - 10):
    missing_value_dict.popitem()
    
feature_names_with_missing_vals = list(missing_value_dict.keys())
missing_vals = list(missing_value_dict.values())

plt.figure(figsize = (15, 12))
ax = plt.axes()
ax.set_facecolor('black')
sns.barplot(x = missing_vals, y = feature_names_with_missing_vals, 
            palette = ['#0000C7', custom_colors[0], custom_colors[1], custom_colors[2], custom_colors[3], 
                        custom_colors[4], '#A2F79F', custom_colors[5], custom_colors[6], '#F7F6DA'])
plt.title('Features with Highest Missing Values', size = 25)
plt.xlabel('Missing Values', size = 20)
plt.xticks(size = 12)
plt.ylabel('Feature Names', size = 20)
plt.yticks(size = 12)
bbox_args = dict(boxstyle = 'round', fc = '0.9')
for p in ax.patches:
    width = p.get_width()
    plt.text(9.5 + p.get_width(), p.get_y() + 0.5 * p.get_height(), '{:1.0f}'.format(width), 
             ha = 'center', 
             va = 'center', 
             color = 'black', 
             bbox = bbox_args, 
             fontsize = 15)
plt.show()

Taking a look at the top 10 features with the highest missing values we can see that **`D_88`** has the highest missing values with 5525447 entries not being present in the dataset. **`D_110`** and **`B_39`** closely follow, having 5500117 and 5497819 missing values respectively. Out of 190 features in the dataset, it might be better to drop these 10 features before training our model, due to the high number of missing values in them.

In [None]:
target_index = cupy.array(train['target'].value_counts().index)
target_index = target_index.tolist()
target_vals = cudf.DataFrame(train['target'].value_counts())
target_values = cupy.array(target_vals['target']).tolist()

plt.figure(figsize = (15, 12))
ax = plt.axes()
ax.set_facecolor('black')
sns.barplot(x = target_index, y = target_values, palette = [custom_colors[0], custom_colors[2]], edgecolor = 'white', linewidth = 1.1)
plt.title('Credit Default Count', size = 25)
plt.xlabel('Credit Default', size = 20)
plt.xticks(size = 15)
plt.ylabel('Count', size = 20)
plt.yticks(size = 15)
bbox_args = dict(boxstyle = 'round', fc = '0.9')
for p in ax.patches:
        ax.annotate('{:.0f} = {:.2f}%'.format(p.get_height(), (p.get_height() / len(train['target'])) * 100), (p.get_x() + 0.25, p.get_height() + (50000 + 1e4)), 
                   color = 'black',
                   bbox = bbox_args,
                   fontsize = 15)
plt.show()

Customers that pay their dues on time comprise `75.09%` of the entire dataset. Only `24.91%` of all the customers default on their payments.

In [None]:
def fill_missing_with_mean(column_startletter):
    cols = [c for c in train.columns if (c.startswith(column_startletter) & (c not in categorical_features))]
    
    if(column_startletter == 'S'):
        cols = [c for c in train.columns if (c.startswith(column_startletter) & (c != 'S_2'))]
        
    for x in cols:
        if(train[x].isna().sum() != 0):
            train[x] = train[x].astype(float)
            train[x] = train[x].fillna(train[x].mean())

    temp = cudf.DataFrame(train)
    temp = temp[cols]
    temp['target'] = train['target']
    return cols, temp

def create_variable_distributions(column_startletter, set_sample_size, color1, color2):
    
    cols, temp = fill_missing_with_mean(column_startletter)
    
    no_of_columns = 5
    no_of_rows = math.ceil(len(cols) / no_of_columns)
    
    if(column_startletter == 'D'):
        fig, axes = plt.subplots(no_of_rows, no_of_columns, figsize = (25, 75))
        column_startletter = 'Delinquency'
        fig.suptitle('Distribution of ' + column_startletter + ' Variables', fontsize = 28)
    if(column_startletter == 'S'):
        fig, axes = plt.subplots(no_of_rows, no_of_columns, figsize = (25, 30))
        column_startletter = 'Spend'
        fig.suptitle('Distribution of ' + column_startletter + ' Variables', fontsize = 28)
    if(column_startletter == 'P'):
        fig, axes = plt.subplots(no_of_rows, no_of_columns, figsize = (45, 7))
        column_startletter = 'Payment'
        fig.suptitle('Distribution of ' + column_startletter + ' Variables', fontsize = 28, x = 0.27, y = 0.983)
    if(column_startletter == 'B'):
        fig, axes = plt.subplots(no_of_rows, no_of_columns, figsize = (25, 40))
        column_startletter = 'Balance'
        fig.suptitle('Distribution of ' + column_startletter + ' Variables', fontsize = 28)
    if(column_startletter == 'R'):
        fig, axes = plt.subplots(no_of_rows, no_of_columns, figsize = (25, 30))
        column_startletter = 'Risk'
        fig.suptitle('Distribution of ' + column_startletter + ' Variables', fontsize = 28)

        
    for i, ax in enumerate(axes.reshape(-1)):
            if i < len(cols):
                
                # considering a sample of size `set_sample_size`
                sns.kdeplot(x = cupy.array(temp[cols[i]].sample(set_sample_size)).tolist(), 
                            hue = cupy.array(temp['target'].sample(set_sample_size)).tolist(),
                            palette = [color1, color2],
                            fill = True, 
                            legend = False, 
                            linewidth = 1.8, 
                            ax = ax)
                ax.set_title(cols[i], fontsize = 20)
                ax.tick_params(left = False, bottom = False, labelsize = 15)
                
    if(column_startletter.startswith('D')):
        for col in range(2, 5):
            axes[17, col].set_visible(False)
            plt.tight_layout(rect = [0, 0.2, 0.99, 0.975])
            fig.legend(labels = ['Default','Paid'], ncol = 2, bbox_to_anchor = (0.18, 0.983), prop = {'size': 20})
    if(column_startletter.startswith('S')):
        for col in range(1, 5):
            axes[4, col].set_visible(False)
        plt.tight_layout(rect = [0, 0.2, 0.99, 0.975])
        fig.legend(labels = ['Default','Paid'], ncol = 2, bbox_to_anchor = (0.18, 0.983), prop = {'size': 20})
    if(column_startletter.startswith('P')):
        for col in range(3, 5):
            axes[col].set_visible(False)
        plt.tight_layout(rect = [0, 0.2, 0.9, 0.9])
        fig.legend(labels = ['Default','Paid'], ncol = 2, bbox_to_anchor = (0.08, 0.983), prop = {'size': 15})
    if(column_startletter.startswith('B')):
        for col in range(3, 5):
            axes[7, col].set_visible(False)
            plt.tight_layout(rect = [0, 0.2, 0.99, 0.975])
            fig.legend(labels = ['Default','Paid'], ncol = 2, bbox_to_anchor = (0.18, 0.983), prop = {'size': 20})
    if(column_startletter.startswith('R')):
        for col in range(3, 5):
            axes[5, col].set_visible(False)
            plt.tight_layout(rect = [0, 0.2, 0.99, 0.975])
            fig.legend(labels = ['Default','Paid'], ncol = 2, bbox_to_anchor = (0.18, 0.983), prop = {'size': 20})

    sns.despine(bottom = True, trim = True)
    plt.show()
    
create_variable_distributions('D', 100000, custom_colors[1], custom_colors[6])
create_variable_distributions('S', 100000, custom_colors[2], custom_colors[5])
create_variable_distributions('P', 100000, custom_colors[0], custom_colors[6])
create_variable_distributions('B', 100000, custom_colors[1], custom_colors[5])
create_variable_distributions('R', 100000, custom_colors[0], custom_colors[3])

In [None]:
def plot_feature_correlations(type_of_feature, color):
    correlational_features = [cc for cc in train.columns if (cc.startswith((type_of_feature))) & (cc not in categorical_features[:])]
    corr_data = train[correlational_features]
    corr_data = corr_data.select_dtypes(exclude = ['object'])
    missing_value_cols = [col for col in corr_data.columns if corr_data[col].isna().any()]
    corr_data.drop(columns = missing_value_cols, axis = 1, inplace = True)
    
    # considering a sample of 100000
    limited_samples = corr_data.sample(100000)
    corr_values = (limited_samples.iloc[:, :].corr()).values
    corr_values = cupy.float32(corr_values.get())
    
    plt.figure(figsize = (25, 25))
    
    if(type_of_feature == 'D'):
        sns.heatmap(corr_values, annot = True, vmin = -1, vmax = 1, center = 0, square = True, fmt = '.1f', cbar = False, cmap = color)
    else:
        sns.heatmap(corr_values, annot = True, vmin = -1, vmax = 1, center = 0, square = True, fmt = '.2f', cbar = False, cmap = color)

    if(type_of_feature == 'D'):
        type_of_feature = 'Delinquency'
    if(type_of_feature == 'S'):
        type_of_feature = 'Spend'
    if(type_of_feature == 'P'):
        type_of_feature = 'Payment'
    if(type_of_feature == 'B'):
        type_of_feature = 'Balance'
    if(type_of_feature == 'R'):
        type_of_feature = 'Risk'
    plt.title(type_of_feature + ' Variables Correlation', size = 25)
    plt.show()
    
plot_feature_correlations('D', 'Blues')
plot_feature_correlations('S', 'YlGnBu')
plot_feature_correlations('P', 'PuBuGn')
plot_feature_correlations('B', 'PuBu')
plot_feature_correlations('R', 'GnBu')

# <h1><div style = "background-color: #0070D1; color:white; border-radius: 15px; padding: 20px; margin: 2px;">3. References 📚</div></h1>
> - https://www.kaggle.com/code/datark1/american-express-eda
> - https://www.kaggle.com/code/kellibelcher/amex-default-prediction-eda-lgbm-baseline
> - https://www.kaggle.com/code/cdeotte/xgboost-starter-0-793
> - https://mycolor.space/

<div class="alert alert-warning">
  <strong>🚧 Work In Progress 🚧</strong>
</div>