# American Express - Default Prediction

![Amex-intro-image](https://storage.googleapis.com/kaggle-organizations/3804/thumbnail.png)

In this competition, aim is to predict credit [default](https://www.investopedia.com/terms/d/default2.asp) using an industrial scale data provided by [American Express](https://www.americanexpress.com/en-in/)

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        file_size = round(os.path.getsize(os.path.join(dirname, filename)) / (1e9), 2)
        print(f"Filename : {filename} \t File Size : {file_size} GB")

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import warnings
from tqdm import tqdm

random.seed(42)
plt.style.use('fivethirtyeight')
warnings.filterwarnings('ignore')
sns.color_palette("flare", as_cmap=True)

# Data Loading ...

## Loading Dataset
Since the dataset it just tooooooo big, we can't directly read it as it will take up all the space present inside our kaggle kernel

## How to read it then?
1. Reading in Chunks 
2. Reading in other storage format such as parquet, feather, etc.
3. Assigning smaller dtype to each columns 

## Reading in other format
Using this since, 
* Reading in Chunks won't give us a overall look at the distribution and which `CHUNK_SIZE` is better its a question in itself
* I tried assigning smaller dtypes such as `float16` as shown [here](https://www.kaggle.com/code/sudalairajkumar/simple-explroration-notebook-amex-default?scriptVersionId=96684923&cellId=7) but still it was taking tooo long to load 

Thanks [raddar](https://www.kaggle.com/raddar) for providing dataset in Parquet Format : https://www.kaggle.com/datasets/raddar/amex-data-integer-dtypes-parquet-format

In [None]:
%time
df_train = pd.read_parquet("../input/amex-data-integer-dtypes-parquet-format/train.parquet")

Zuppppp ⚡

In [None]:
%time
labels = pd.read_csv('../input/amex-default-prediction/train_labels.csv')
df_train = df_train.merge(labels, left_on='customer_ID', right_on='customer_ID')

This one takes around minute 🤷‍

In [None]:
print("Shape of dataset :", df_train.shape)

In [None]:
df_train.head()

## Column info

The dataset contains aggregated profile features for each customer at each statement date. Features are anonymized and normalized, and fall into the following general categories:

* D_* = Delinquency variables (bad or criminal behaviour, especially among young people)
* S_* = Spend variables
* P_* = Payment variables
* B_* = Balance variables
* R_* = Risk variables

with the following features being categorical:  
`['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']`



# Lets Explore ... 🚀

# Missing values

In [None]:
null_vals = df_train.isna().sum().sort_values(ascending=False)
null_vals[null_vals > 0 ]

In [None]:
plt.title("Distribution of null values")
null_vals[null_vals > 0 ].plot(kind = 'hist');

So there are some columns for which number of missing values is close to **million** which is approximately same as the number of rows in our dataset thus removing those columns would be better...

In [None]:
plt.figure(figsize=(40,10))
plt.title("Null value count")
plt.xlabel("Columns")
plt.ylabel("Count")
null_vals[null_vals > 0 ].plot(kind="bar");

**Note** :  
Scale on y-axis is in millions which shows that there's just lot of columns which needs to be removed/preprocessed

# Is the target imbalance?

In [None]:
sns.countplot(
    df_train["target"].values,
).set_xlabel("Target");

**Yes it is!**

## How is target calculated ?

The target binary variable is calculated by observing 18 months performance window after the latest credit card statement, and if the customer does not pay due amount in 120 days after their latest statement date it is considered a default event.

# Customers

In [None]:
print("Number of unique customer :",len(df_train["customer_ID"].unique()))

In [None]:
cust_id = np.random.choice(df_train["customer_ID"])
df_train[df_train["customer_ID"] == cust_id]

In [None]:
df_train[df_train["target"] == 1].iloc[100].customer_ID

In [None]:
cust_id = '000473eb907b57c8c23f652bba40f87fe7261273dda47034d46fc46821017e50'
df_train[df_train["customer_ID"] == cust_id]

In [None]:
df_train[df_train["customer_ID"] == cust_id]["S_2"]

So for each customer, we have a data for multiple days  
Let's check if it is consistent for all the customers...

In [None]:
rand_customers = np.unique(df_train["customer_ID"])[:100] # for 100 customers
id_counts = df_train[df_train["customer_ID"].isin(rand_customers)].groupby("customer_ID").agg("count")

In [None]:
plt.figure(figsize=(20,10))
id_counts["S_2"].plot(kind='bar');

Looks like for all customers number of days when default was announced is not consistent

# How each variables correlate with the target?

## Count of each type of variables

In [None]:
var_count = {}
for col in df_train.columns :
    if col.startswith("S_"):
        var_count["Spend variables"] = var_count.get("Spend variables", 0) + 1 
    if col.startswith("D_"):
        var_count["Deliquency variables"] = var_count.get("Deliquency variables", 0) + 1
    if col.startswith("B_"):
        var_count["Balance variables"] = var_count.get("Balance variables", 0) + 1
    if col.startswith("R_"):
        var_count["Risk variables"] = var_count.get("Risk variables", 0) + 1
    if col.startswith("P_"):
        var_count["Payment variables"] = var_count.get("Payment variables", 0) + 1
plt.figure(figsize=(15,5))
sns.barplot(x=list(var_count.keys()), y=list(var_count.values()));

## Payment variables (P_*) vs Target

Let's see how payment variables affects default ?

In [None]:
payment_vars = [col for col in df_train.columns if col.startswith("P_")]
corr = df_train[payment_vars+["target"]].corr()
sns.heatmap(corr, annot=True, cmap="Purples");

In [None]:
fig, axes = plt.subplots(1,3, figsize=(20,5))
axes = axes.ravel()

for i, col in enumerate(payment_vars)  :
    sns.histplot(data = df_train, x = col, hue='target', ax=axes[i])

fig.suptitle("Distribution of Payment Variables w.r.t target")
fig.tight_layout()

In [None]:
df_train["P_4"].value_counts()

Looks like another case of artifical noise added to that categorical columns `P_4`

In [None]:
df_train["P_4"] = df_train["P_4"].apply(lambda x : 0 if x == 0 else 1)
plt.title("P_4 w.r.t target")
sns.countplot(data = df_train, x = "P_4", hue = "target");

* higher the `P_2` lower the chances of default
* `target = 1` (i.e. default) is following a normal distribution in both `P_2` and `P_3`
*  when `P_4` is 1, there's 50% of chance of being **default** but when it goes 0 lot of cases seem to be having less default

More to be added soon...