**Created by Sanskar Hasija**

**🚀AMEX-Default Prediction- Detailed EDA📊📈**

**26 May 2022**


# <center> AMEX-DEFAULT PREDICTION- DETAILED EDA📊 </center>
## <center>If you find this notebook useful, support with an upvote👍</center>

# Table of Contents
<a id="toc"></a>
- [1. Introduction](#1)
- [2. Imports](#2)
- [3. Data Loading and Preperation](#3)
    - [3.1 Exploring Train Data](#3.1)
    - [3.2 Exploring Test Data](#3.2)
    - [3.3 Submission File](#3.3)
- [4. EDA](#4)
    - [4.1 Null Value Distribution](#4.1)
    - [4.2 Continuos and Categorical Data Distribution](#4.2)
    - [4.3 Target Distribution ](#4.3)
    - [4.4 Continuos Features Distribution  ](#4.4)
    - [4.5 Categorical Features Distribution ](#4.5)

<a id="1"></a>
# **<center><span style="color:#00BFC4;">Introduction  </span></center>**

![](https://raw.githubusercontent.com/sanskar-hasija/kaggle/main/images/amex-header.png)

**The competition is organised by `American Express` and for `Credit default prediction`**

**In this competition, you are supposed to predict predict Credit default prediction.Submissions are evaluated on a custom evaluation metric which is described as follows :**

<center><b>M = 0.5*(G+D)</b></center><br>



<b>Here G is the Normalized Gini Coefficient,and D is the default rate captured at 4%</b>

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<a id="2"></a>
# **<center><span style="color:#00BFC4;">Imports  </span></center>**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots


import time
import warnings
warnings.filterwarnings('ignore')

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<a id="3"></a>
# **<center><span style="color:#00BFC4;">Data Loading and Preparation </span></center>**

##### I have created a Parquet version of dataset files for faster loading under the constrain of low memory in Kaggle Kernels.

#### Dataset link - www.kaggle.com/datasets/odins0n/amex-parquet
#### Example Notebook to open Parquet files from the dataset - www.kaggle.com/code/odins0n/load-parquet-files-with-low-memory

In [None]:
train = pd.read_parquet('../input/amex-parquet/train_data.parquet')
submission = pd.read_csv("../input/amex-default-prediction/sample_submission.csv")
RANDOM_STATE = 12 

In [None]:
import gc
gc.collect()

## <span style="color:#e76f51;"> Column Descriptions  : </span>

`customer_ID` = Unique Customer ID<br>
`- D_*` = Delinquency variables<br>
`S_*` = Spend variables<br>
`P_*` = Payment variables<br>
`B_*` = Balance variables<br>
`R_*` = Risk variables<br>



<a id="3.1"></a>
## <span style="color:#e76f51;"> Exploring Train Data : </span>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;<b><u>Observations in Train Data:</u></b><br>
 
* <i> There are total of <b><u>190</u></b> columns and <b><u>5531451</u></b> rows in <b><u>train</u></b> data.</i><br>
* <i> Train data contains <b><u>890116722</u></b> observation with <b><u>160858968</u></b>  missing values.</i><br>
</div>

### <span style="color:#e76f51;"> Quick view of Train Data : </span>

Below are the first 5 rows of train dataset:

In [None]:
train.head()

In [None]:
print(f'\033[94mNumber of rows in train data: {train.shape[0]}')
print(f'\033[94mNumber of columns in train data: {train.shape[1]}')
print(f'\033[94mNumber of values in train data: {train.count().sum()}')
print(f'\033[94mNumber missing values in train data: {sum(train.isna().sum())}')

### <span style="color:#e76f51;"> Column Wise missing values : </span>

In [None]:
print(f'\033[94m')
print(train.isna().sum().sort_values(ascending = False))

### <span style="color:#e76f51;"> Basic statistics of training data : </span>

Below is the basic statistics for each variables which contain information on `count`, `mean`, `standard deviation`, `minimum`, `1st quartile`, `median`, `3rd quartile` and `maximum`.

In [None]:
train.iloc[:, :-1].describe().T.sort_values(by='std' , ascending = False)\
                     .style.background_gradient(cmap='GnBu')\
                     .bar(subset=["max"], color='#F8766D')\
                     .bar(subset=["mean",], color='#00BFC4')

<a id="3.3"></a>
## <span style="color:#e76f51;"> Submission File </span>

### <span style="color:#e76f51;"> Quick view of Submission File </span>

In [None]:
submission.head()

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<a id="4"></a>
# **<center><span style="color:#00BFC4;"> EDA </span></center>**

<a id="4.1"></a>
## <span style="color:#e76f51;"> Null Value Distribution  </span>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;<b><u>Observations in Null Value Distribution :</u></b><br>
 
* <i> The maximum of missing value in an row is <b><u>102</u></b> and the lowest is <b><u>9</u></b> missing values.</i><br>
* <i> All the rows have atleast <b><u>9</u></b> missing values.</i><br>
* <i> <b><u>D88 </u></b>feature hax maximum number of missing values with a total of <b><u>5527586</u></b> missing values.</i><br>
* <i> <b><u>68 </u></b> features have no missing values whereas <b><u>122</u></b> features have atleast 1 missing values </i><br>
</div>


<a id="4.2.1"></a>
### <span style="color:#e76f51;">Column wise Null Value Distribution   </span>

In [None]:
train_null = pd.DataFrame(train.isna().sum())
train_null = train_null[train_null[0]>0]
train_null = train_null.sort_values(by = 0 ,ascending = True)
fig = px.bar(x=train_null[0],y=train_null.index,color_discrete_sequence = ["#DE3163"])
fig.update_layout(showlegend=False, 
                  title_text="Column Wise Null Value Distribution", 
                  title_x=0.5,
                  xaxis_title="Missing Value Count",
                  yaxis_title="Feature Name")
fig.show()

<a id="4.7.2"></a>
### <span style="color:#e76f51;">Row wise Null Value Distribution   </span>

In [None]:
missing_train_row = train.isna().sum(axis=1)
missing_train_row = pd.DataFrame(missing_train_row.value_counts()/train.shape[0]).reset_index()
missing_train_row.columns = ['no', 'count']
missing_train_row["count"] = missing_train_row["count"]*100

fig = px.bar(x=missing_train_row["no"], 
                     y=missing_train_row["count"] ,
             color_discrete_sequence = ["#DE3163"])
fig.update_layout(showlegend=False, 
                  title_text="Row wise Null Value Distribution", 
                  title_x=0.5,
                  xaxis_title="Number of Rows",
                  yaxis_title="Percentage of Missing Values")
fig.show()

### <span style="color:#e76f51;">Dealing with missing value (reference)  </span>
Some references on how to deal with missing value:
- [Missing Values](https://www.kaggle.com/alexisbcook/missing-values) by [Alexis Cook](https://www.kaggle.com/alexisbcook)
- [Data Cleaning Challenge: Handling missing values](https://www.kaggle.com/rtatman/data-cleaning-challenge-handling-missing-values) by [Rachael Tatman](https://www.kaggle.com/rtatman)
- [A Guide to Handling Missing values in Python ](https://www.kaggle.com/parulpandey/a-guide-to-handling-missing-values-in-python) by [Parul Pandey](https://www.kaggle.com/parulpandey)

Some models that have capability to handle missing value by default are:
- XGBoost: https://xgboost.readthedocs.io/en/latest/faq.html
- LightGBM: https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html
- Catboost: https://catboost.ai/docs/concepts/algorithm-missing-values-processing.html

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<a id="4.2"></a>
## <span style="color:#e76f51;">Continuos and Categorical Data Distribution </span>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;<b><u>Observations in Null Value Distribution :</u></b><br>
 
* <i> There are a total of  <b><u>190</u></b> features, out of which  <b><u>177</u></b> features are continous, <b><u>1</u></b> feature represents date and <b><u>11</u></b> features are categorical.</i><br>
</div>

In [None]:
FEATURES = list(train.columns[2:190])
TARGET = "target"
cat_features = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']
cont_features = [col for col in FEATURES if col not in cat_features and TARGET]
labels=['Categorical', 'Continuos']
values= [len(cat_features), len(cont_features)]
colors = ['#DE3163', '#58D68D']

print(f'\033[94mTotal number of features: {len(FEATURES) + 2   }')
print(f'\033[94mNumber of categorical features: {len(cat_features)}')
print(f'\033[94mNumber of continuos features: {len(cont_features)}')

fig = go.Figure(data=[go.Pie(
    labels=labels, 
    values=values, pull=[0.1, 0 ],
    marker=dict(colors=colors, 
                line=dict(color='#000000', 
                          width=2))
)])
fig.show()


<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<a id="4.3"></a>
## <span style="color:#e76f51;">  Target Distribution </span>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;<b><u>Observations in Null Value Distribution :</u></b><br>
 
* <i>There are two target values - <b><u>0</u></b> and <b><u>1</u></b>.</i><br>
* <i>Percentage of Target <b><u>0</u></b> and Target <b><u>1</u></b> are <b><u>74.11%</u></b> and <b><u>25.89%</u></b> respectively. </i><br>
</div>

In [None]:
target_df = pd.DataFrame(train['target'].value_counts()).reset_index()
target_df.columns = ['target', 'count']
fig = px.bar(data_frame =target_df, 
             x = 'target',
             y = 'count'
            ) 
fig.update_traces(marker_color =['#58D68D','#DE3163'], 
                  marker_line_color='rgb(0,0,0)',
                  marker_line_width=2,)
fig.update_layout(title = "Target Distribution",
                  template = "plotly_white",
                  title_x = 0.5)
print("\033[94mPercentage of Target = 0: {:.2f} %".format(target_df["count"][0]*100 / (target_df["count"][0]+ target_df["count"][1])))
print("\033[94mPercentage of Target = 1: {:.2f} %".format(target_df["count"][1]* 100 / (target_df["count"][0]+ target_df["count"][1])))
fig.show()

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<a id="4.4"></a>
## <span style="color:#e76f51;"> Continuos Features Distribution  </span>

In [None]:
RANDOM_SPLIT = 100000
ncols = 5
nrows = 36
n_features = cont_features
fig, axes = plt.subplots(nrows, ncols, figsize=(25, 15*8))

for r in range(nrows):
    for c in range(ncols):
        if r*ncols+c == len(cont_features):
            break
        col = n_features[r*ncols+c]
        sns.histplot(data= train.iloc[:RANDOM_SPLIT],  x=col, ax=axes[r, c], hue= "target", bins = 20, palette =['#DE3163','#58D68D'])
        axes[r,c].legend()
        axes[r, c].set_ylabel('')
        axes[r, c].set_xlabel(col, fontsize=8)
        axes[r, c].tick_params(labelsize=5, width=0.5)
        axes[r, c].xaxis.offsetText.set_fontsize(6)
        axes[r, c].yaxis.offsetText.set_fontsize(4)
fig.delaxes(axes[35][2])
fig.delaxes(axes[35][3])   
fig.delaxes(axes[35][4])
plt.show()

<a id="4.5"></a>
## <span style="color:#e76f51;"> Categorical Features Distribution  </span>

In [None]:
sns.set_style(style='white')
ncols = 5
nrows = int(len(cat_features) / ncols + (len(FEATURES) % ncols > 0)) 

fig, axes = plt.subplots(nrows, ncols, figsize=(18, 15), facecolor='#EAEAF2')

for r in range(nrows):
    for c in range(ncols):
        if r*ncols+c >= len(cat_features):
            break
        col = cat_features[r*ncols+c]
        sns.countplot(data=train.iloc[:RANDOM_SPLIT] , x = col, ax=axes[r, c], hue = "target", palette =['#DE3163','#58D68D'])
        axes[r, c].set_ylabel('')
        axes[r, c].set_xlabel(col, fontsize=8, fontweight='bold')
        axes[r, c].tick_params(labelsize=5, width=0.5)
        axes[r, c].xaxis.offsetText.set_fontsize(4)
        axes[r, c].yaxis.offsetText.set_fontsize(4)
fig.delaxes(axes[2][1])     
fig.delaxes(axes[2][2]) 
fig.delaxes(axes[2][3]) 
fig.delaxes(axes[2][4]) 
plt.show()

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    
    
### <center>Work in Progress 🙂</center>
### <center>If you have any feedback or find anything wrong, please let me know!</center>
