# American Express - Default Prediction


In this analysis, I used the code from notebooks:
- https://www.kaggle.com/code/ambrosm/amex-eda-which-makes-sense/notebook
- https://www.kaggle.com/code/ihelon/default-prediction-eda-and-modeling

Also I use information from this discussion:
- https://www.kaggle.com/competitions/amex-default-prediction/discussion/327464

## Let's look at the task description

In this competition, you’ll apply your machine learning skills to predict credit default. Specifically, you will leverage an industrial scale data set to build a machine learning model that challenges the current model in production. Training, validation, and testing datasets include time-series behavioral data and anonymized customer profile information. You're free to explore any technique to create the most powerful model, from creating features to using the data in a more organic way within a model.

# Competition metric


### 0.5 * (G + D) 

The competition metric has two components: the normalized Gini coefficient and the default rate captured at 4 %:

* (G) The normalized Gini coefficient is simply a stretched AUC: AUC is the light red area under the curve, which has a value between 0 and 1. The normalized Gini coefficient is equal to 2*AUC-1 and is between -1 and 1. The larger the red area, the better is the score.
* (D) The default rate captured at 4 % is the true positive rate (recall) for a threshold set at 4 % of the total (weighted) sample count. It corresponds to the y coordinate of the intersection between the green line and the red roc curve (marked with a green dot) and is always between 0 and 1. The higher the intersection point, the better is the score.

In [None]:
import pandas as pd
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
import missingno as msno
import numpy as np
from pathlib import Path
import path
import os
import warnings
warnings.filterwarnings("ignore")

In [None]:
data_path = Path("../input/amex-default-prediction")
os.listdir(data_path)

In [None]:
n_rows = 15000

train_df = pd.read_csv("../input/amex-default-prediction/train_data.csv", chunksize=n_rows)
train_labels_df = pd.read_csv("../input/amex-default-prediction/train_labels.csv")

In [None]:
train_df_example = train_df.__next__()


In [None]:
train_df_example.info()

In [None]:
train_df_example.head()

In [None]:
train_df_example.tail()

In [None]:
train_labels_df.head()

In [None]:
train_labels_df.shape

In [None]:
train_labels_df.customer_ID.duplicated().sum()

In [None]:
train_df_example[train_df_example["customer_ID"] == "009469964a6c21c6f1f50bb9a9881dce39dcaa47801b4f09d6d65d6610a1e0e9"]

In [None]:
cols = list(train_df_example.columns)
print(cols)

### Features are anonymized and normalized, and fall into the following general categories:

* D_* = Delinquency variables
* S_* = Spend variables
* P_* = Payment variables
* B_* = Balance variables
* R_* = Risk variables

In [None]:
cat_cols = [
    'B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 
    'D_126', 'D_63', 'D_64', 'D_66', 'D_68',
]

### Let's try to work with all the data and not a part

The dataset of this contest is of considerable size. If you're reading raw CSV files, the data barely fits in memory.That's why we read the data from @munumbutt's AMEX-Feather-Dataset. In this Feather file, floating point precision has been reduced from 64 bits to 16 bits and reading a Feather file is faster than reading a csv file because the Feather file format is binary.

In [None]:
train = pd.read_feather('../input/amexfeather/train_data.ftr')
test = pd.read_feather('../input/amexfeather/test_data.ftr')

In [None]:
train.shape

In [None]:
test.shape

### Our observations
* There are almost twice as many test data as training data
* AMEX-Feather-Dataset is placed in our RAM

In [None]:
msno.matrix(train[0:2000], figsize = (10,5))
plt.show()

In [None]:
train.info(max_cols=200, show_counts=True)


### Our observations
* There are many missing values in the data
* There are columns with almost no data

In [None]:
temp = pd.concat([train[['customer_ID', 'S_2']], test[['customer_ID', 'S_2']]], axis=0)
temp.set_index('customer_ID', inplace=True)
temp['last_month'] = temp.groupby('customer_ID').S_2.max().dt.month

plt.figure(figsize=(16, 4))
plt.hist([temp.S_2[temp.last_month == 3],   # ending 03/18 -> training
          temp.S_2[temp.last_month == 4],   # ending 04/19 -> public lb
          temp.S_2[temp.last_month == 10]], # ending 10/19 -> private lb
         bins=pd.date_range("2017-03-01", "2019-11-01", freq="MS"),
         label=['Training', 'Public leaderboard', 'Private leaderboard'],
         stacked=True)
plt.xticks(pd.date_range("2017-03-01", "2019-11-01", freq="QS"))
plt.xlabel('Statement date')
plt.ylabel('Count')
plt.title('The three datasets', fontsize=20)
plt.legend()
plt.show()

### Our observations
* There is no date intersection between test data and training data in the data
* There are intersections between public and private dataset

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
train_sc = train.customer_ID.value_counts().value_counts().sort_index(ascending=False).rename('Train statements per customer')
ax1.pie(train_sc, labels=train_sc.index)
ax1.set_title(train_sc.name)
test_sc = test.customer_ID.value_counts().value_counts().sort_index(ascending=False).rename('Test statements per customer')
ax2.pie(test_sc, labels=test_sc.index)
ax2.set_title(test_sc.name)
plt.show()

### Our observations
* Most often, we have 13 observations for a client

# Categorial data

In [None]:
cat_features = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']
ind = 0
for col in cat_features:
    if ind % 4 == 0:
        plt.figure(figsize=(16, 3))
    plt.subplot(1, 4, ind % 4 + 1)
    
    sns.countplot(data=train, x=col, hue="target")
    plt.ylabel("")
    
    if ind % 4 == 3:
        plt.show()
    
    ind += 1

### Our observations
* In features B30 D116 D63 D64 D66 D68 there are categories that are relatively rare

# Continuous features

In [None]:
for col in list(train.columns):
    if col in ["S_2", "customer_ID", "target"] + cat_features:
        continue
    
    if ind % 4 == 0:
        plt.figure(figsize=(16, 4))
    plt.subplot(1, 4, ind % 4 + 1)
    
    sns.histplot(data=train, x=col, hue="target", bins=20)
    plt.ylabel("")
    
    if ind % 4 == 3:
        plt.show()
    
    ind += 1

## Let's look at the data in a different view 

In [None]:
cont_features = sorted([f for f in train.columns if f not in cat_features + ['customer_ID', 'target', 'S_2']])
# print(cont_features)
ncols = 4
for i, f in enumerate(cont_features):
    if i % ncols == 0: 
        if i > 0: plt.show()
        plt.figure(figsize=(16, 3))
        if i == 0: plt.suptitle('Continuous features', fontsize=20, y=1.02)
    plt.subplot(1, ncols, i % ncols + 1)
    plt.hist(train[f], bins=200)
    plt.xlabel(f)
plt.show()

### Our observations
* In features that have large empty spaces, there are outliers. There are quite a lot of features with outliers
* S8 B18 B16 can be categorical
* There are strange gaps in the distribution of traits S15 S18 P2 D47

#  Features correlation

In [None]:
correlations = train.corr()

In [None]:
fig, ax = plt.subplots(figsize=(14,14))
sns.heatmap(correlations,ax = ax)
plt.show()

In [None]:
correlations = correlations.unstack()
correlations.sort_values(ascending=False, kind="quicksort").drop_duplicates().head(30)

In [None]:
correlations.sort_values(ascending=False, kind="quicksort").drop_duplicates().tail(30)

### Our observations
* It can be seen that there are quite a lot of features strongly dependent on each other in the data.

In [None]:
correlations.to_csv("Corr.csv")