# Home Credit - Credit Risk Model Stability

In this competition, we will be predicting default of clients based on internal and external information that are available for each client.

This dataset contains a large number of tables as a result of utilizing diverse data sources and the varying levels of data aggregation used while preparing the dataset. Below performed EDA is for the previous application clients with depth 1 (File 1, File 2) and depth 2.

Join is performed on the Base table that store the basic information about the observation based on the case_id. This is a unique identification of every loan type.

This notebook contains EDA for following files:

train_applprev_1_0.csv

train_applprev_1_1.csv

train_applprev_2.csv

train_base.csv

Depth values:

depth=0 - These are static features directly tied to a specific case_id.
depth=1 - Each case_id has an associated historical record, indexed by num_group1.
depth=2 - Each case_id has an associated historical record, indexed by both num_group1 and num_group2.

For depth=0 tables, predictors can be directly used as features. However, for tables with depth>0, you may need to employ aggregation functions that will condense the historical records associated with each case_id into a single feature. In case num_group1 or num_group2 stands for person index (this is clear with predictor definitions) the zero index has special meaning. When num_groupN=0 it is the applicant (the person who applied for a loan).

In [1]:
import polars as pl
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

import warnings
warnings.filterwarnings("ignore")

In [2]:
def group_file_data(
    df: pl.DataFrame, 
    num_cols: list[str] = [], 
    date_cols: list[str] = [], 
    cat_cols: list[str] = []
) -> pl.DataFrame:
    '''
    Function to group numerical, date, and categorical columns

    Parameters:
    -----------
    df : Polars DataFrame
    num_cols : List of numerical column names (remember to drop num_group columns)
    date_cols : List of date column names
    cat_cols : List of categorical column names (becomes dummies)
    '''
    
    # Convert date columns
    df_date = df[['case_id'] + date_cols].with_columns([ pl.col(col).str.to_date() for col in date_cols ])

    # One-hot categories
    df_dummies = df[['case_id'] + cat_cols].to_dummies(cat_cols)

    # Num DataFrame
    df_num = df[['case_id'] + num_cols]

    # Date aggs
    date_aggs = [ pl.min(col).name.suffix('_min') for col in date_cols ] +\
                [ pl.max(col).name.suffix('_max') for col in date_cols ] +\
                [ pl.n_unique(col).name.suffix('_distinct') for col in date_cols]
    df_date_grouped = df_date.group_by('case_id').agg(date_aggs)

    # One-hot aggs
    dummy_cols = [ col for col in df_dummies.columns if col != 'case_id']
    dummies_aggs = [ pl.sum(col).name.suffix('_sum') for col in dummy_cols ]
    df_dummies_grouped = df_dummies.group_by('case_id').agg(dummies_aggs)

    # Numerical aggs
    num_aggs = [ pl.min(col).name.suffix('_min') for col in num_cols ] +\
            [ pl.max(col).name.suffix('_max') for col in num_cols ] +\
            [ pl.mean(col).name.suffix('_mean') for col in num_cols ] +\
            [ pl.median(col).name.suffix('_median') for col in num_cols ] +\
            [ pl.sum(col).name.suffix('_sum') for col in num_cols ]
    df_num_grouped = df_num.group_by('case_id').agg(num_aggs)

    # Join DataFrames
    df_joined = df_num_grouped.join(df_date_grouped, on='case_id')
    df_joined = df_joined.join(df_dummies_grouped, on='case_id')

    return df_joined

## Read and inspect the previous application files

In [3]:
#Depth 1
prev_app_0 = pl.read_csv('csv_files/train/train_applprev_1_0.csv')

#Depth 1
prev_app_1 = pl.read_csv('csv_files/train/train_applprev_1_1.csv')

#Depth 2
prev_app_2 = pl.read_csv('csv_files/train/train_applprev_2.csv')

#Base table
base = pl.read_csv('csv_files/train/train_base.csv')

In [29]:
# pandas DataFrame for the inital EDA checks 
prev_app_0_pd = pd.read_csv('csv_files/train/train_applprev_1_0.csv')

prev_app_1_pd = pd.read_csv('csv_files/train/train_applprev_1_1.csv')

prev_app_2_pd = pd.read_csv('csv_files/train/train_applprev_2.csv')

base_pd = pd.read_csv('csv_files/train/train_base.csv')

In [30]:
print(f"Number of Unique Case IDs: {prev_app_0_pd['case_id'].nunique()}")
print(f"Number of Unique Case IDs: {prev_app_1_pd['case_id'].nunique()}")
print(f"Number of Unique Case IDs: {prev_app_2_pd['case_id'].nunique()}")
print(f"Number of Unique Case IDs: {base_pd['case_id'].nunique()}")

Number of Unique Case IDs: 782997
Number of Unique Case IDs: 438525
Number of Unique Case IDs: 1221522
Number of Unique Case IDs: 1526659


### Performing initial checks on the dataset by examining the first five rows, analyzing the data types of the attributes, reviewing summary statistics, and determining the total number of rows and columns.

In [20]:
def perform_eda_checks(df1: prev_app_0_pd, df2: prev_app_1_pd, df3: prev_app_2_pd):
    """
    Perform initial EDA checks (.head(), .info(), .shape, .describe()) on three given dataframes.

    Args:
    df1, df2, df3 (pd.DataFrame): Dataframes to perform the EDA checks on.
    """
    dataframes = [df1, df2, df3]
    for i, df in enumerate(dataframes, start=1):
        print(f"Previous Application {i} EDA Checks\n" + "-" * 20)

        print("\nFirst 5 rows:")
        print(df.head())

        print("\nInfo:")
        print(df.info())

        print("\nShape:")
        print(df.shape)

        print("\nDescription:")
        print(df.describe())

        print("\n" + "=" * 50 + "\n")

In [21]:
perform_eda_checks(prev_app_0_pd, prev_app_1_pd, prev_app_2_pd)

Previous Application 1 EDA Checks
--------------------

First 5 rows:
   case_id  actualdpd_943P  annuity_853A approvaldate_319D  \
0        2             0.0         640.2               NaN   
1        2             0.0        1682.4               NaN   
2        3             0.0        6140.0               NaN   
3        4             0.0        2556.6               NaN   
4        5             0.0           NaN               NaN   

   byoccupationinc_3656910L cancelreason_3545846M  childnum_21L  \
0                       NaN              a55475b1           0.0   
1                       NaN              a55475b1           0.0   
2                       NaN           P94_109_143           NaN   
3                       NaN             P24_27_36           NaN   
4                       NaN           P85_114_140           NaN   

  creationdate_885D  credacc_actualbalance_314A  credacc_credlmt_575A  ...  \
0        2013-04-03                         NaN                   0.0  ...  

### Aggregation and Merge to the Base table

In [10]:
# Aggregation functions for numerical columns
numerical_agg_funcs = {
    'min': 'min',
    'max': 'max',
    'mean': 'mean',
    'median': 'median',
    'sum': 'sum'
}

# Aggregation functions for categorical columns
categorical_agg_funcs = {
    'mode': lambda x: x.mode().iloc[0],  # Mode
    'one_hot_encoding': lambda x: x.sum()  # One-hot encoding with sum of counts
}

# Aggregation functions for date columns
date_agg_funcs = {
    'min': 'min',
    'max': 'max',
    'distinct_count': 'nunique'  # Count of distinct values
}

### Below code snippet to test the aggregation and merging of 1 feature in the dataset

In [13]:
df = prev_app_1

# Select at most one date column
date_cols = [col for col in df.columns if 'dat' in col and df[col].dtype == pl.String][:1]

# Select at most one categorical column, excluding already selected date columns
cat_cols = [col for col in df.columns if col not in date_cols and df[col].dtype == pl.String][:1]

# Define ignored columns
ignore_cols = ['case_id', 'num_group1']

# Select at most one numerical column, excluding already selected date and categorical columns
num_cols = [col for col in df.columns 
            if col not in date_cols + cat_cols + ignore_cols 
            and df[col].dtype != pl.String][:1]

# Ensure we have only three columns in total
selected_cols = date_cols + cat_cols + num_cols[:max(3 - len(date_cols + cat_cols), 0)]

# Group data using only the selected columns
df_prev_app_1_agg = group_file_data(df, num_cols, date_cols, cat_cols)


In [16]:
df_prev_app_1_agg.head()

case_id,actualdpd_943P_min,actualdpd_943P_max,actualdpd_943P_mean,actualdpd_943P_median,actualdpd_943P_sum,approvaldate_319D_min,approvaldate_319D_max,approvaldate_319D_distinct,cancelreason_3545846M_P107_145_100_sum,cancelreason_3545846M_P116_157_162_sum,cancelreason_3545846M_P118_140_56_sum,cancelreason_3545846M_P118_30_169_sum,cancelreason_3545846M_P11_156_146_sum,cancelreason_3545846M_P11_56_131_sum,cancelreason_3545846M_P120_0_10_sum,cancelreason_3545846M_P122_66_161_sum,cancelreason_3545846M_P123_22_171_sum,cancelreason_3545846M_P128_12_74_sum,cancelreason_3545846M_P129_101_181_sum,cancelreason_3545846M_P129_162_80_sum,cancelreason_3545846M_P141_135_146_sum,cancelreason_3545846M_P145_10_63_sum,cancelreason_3545846M_P145_77_120_sum,cancelreason_3545846M_P150_0_30_sum,cancelreason_3545846M_P151_143_25_sum,cancelreason_3545846M_P163_9_145_sum,cancelreason_3545846M_P166_126_174_sum,cancelreason_3545846M_P169_159_178_sum,cancelreason_3545846M_P16_126_23_sum,cancelreason_3545846M_P175_4_106_sum,cancelreason_3545846M_P180_60_137_sum,cancelreason_3545846M_P183_71_60_sum,cancelreason_3545846M_P185_66_167_sum,cancelreason_3545846M_P187_10_172_sum,cancelreason_3545846M_P188_66_164_sum,cancelreason_3545846M_P191_55_173_sum,…,cancelreason_3545846M_P203_151_99_sum,cancelreason_3545846M_P204_22_168_sum,cancelreason_3545846M_P205_40_167_sum,cancelreason_3545846M_P20_125_107_sum,cancelreason_3545846M_P23_22_19_sum,cancelreason_3545846M_P24_27_36_sum,cancelreason_3545846M_P26_44_63_sum,cancelreason_3545846M_P28_71_137_sum,cancelreason_3545846M_P30_86_84_sum,cancelreason_3545846M_P32_163_96_sum,cancelreason_3545846M_P32_86_86_sum,cancelreason_3545846M_P46_50_166_sum,cancelreason_3545846M_P52_67_90_sum,cancelreason_3545846M_P53_10_15_sum,cancelreason_3545846M_P55_81_55_sum,cancelreason_3545846M_P57_100_127_sum,cancelreason_3545846M_P59_114_135_sum,cancelreason_3545846M_P5_143_178_sum,cancelreason_3545846M_P60_137_164_sum,cancelreason_3545846M_P60_96_75_sum,cancelreason_3545846M_P64_121_167_sum,cancelreason_3545846M_P65_58_157_sum,cancelreason_3545846M_P69_72_116_sum,cancelreason_3545846M_P72_115_176_sum,cancelreason_3545846M_P73_130_169_sum,cancelreason_3545846M_P7_85_64_sum,cancelreason_3545846M_P84_14_61_sum,cancelreason_3545846M_P85_114_140_sum,cancelreason_3545846M_P8_141_180_sum,cancelreason_3545846M_P91_110_150_sum,cancelreason_3545846M_P94_109_143_sum,cancelreason_3545846M_P94_154_184_sum,cancelreason_3545846M_P95_76_117_sum,cancelreason_3545846M_P98_38_170_sum,cancelreason_3545846M_P99_98_113_sum,cancelreason_3545846M_P9_82_76_sum,cancelreason_3545846M_a55475b1_sum
i64,f64,f64,f64,f64,f64,date,date,u32,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,…,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
222314,0.0,0.0,0.0,0.0,0.0,2015-03-25,2016-11-01,3,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,2
969122,0.0,0.0,0.0,0.0,0.0,2020-03-22,2020-03-22,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1773802,0.0,0.0,0.0,0.0,0.0,2017-12-14,2019-06-17,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3
2672892,0.0,0.0,0.0,0.0,0.0,2006-05-23,2018-12-08,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3
51683,0.0,0.0,0.0,0.0,0.0,2020-03-11,2020-03-11,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,1


### Preview Application: Depth 1, File 1

In [None]:
df = prev_app_0
# Date columns
date_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i].__contains__('dat')) and (df.dtypes[i] == pl.String) ]

# Categorical columns
cat_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i] not in date_cols) and (df.dtypes[i] == pl.String) ]

# Numerical columns
ignore_cols = ['case_id', 'num_group1']
num_cols = [ 
    df.columns[i] for i in range(len(df.columns)) 
    if (df.columns[i] not in date_cols) and (df.columns[i] not in cat_cols) and (df.columns[i] not in ignore_cols)
]

# Group data
df_prev_app_0_agg = group_file_data(df, num_cols, date_cols, cat_cols)

### Preview Application: Depth 1, File 2

In [None]:
df = prev_app_1
# Date columns
date_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i].__contains__('dat')) and (df.dtypes[i] == pl.String) ]

# Categorical columns
cat_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i] not in date_cols) and (df.dtypes[i] == pl.String) ]

# Numerical columns
ignore_cols = ['case_id', 'num_group1']
num_cols = [ 
    df.columns[i] for i in range(len(df.columns)) 
    if (df.columns[i] not in date_cols) and (df.columns[i] not in cat_cols) and (df.columns[i] not in ignore_cols)
]

# Group data
df_prev_app_1_agg = group_file_data(df, num_cols, date_cols, cat_cols)

### Preview Application: Depth 2

In [None]:
df = prev_app_2
# Date columns
date_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i].__contains__('dat')) and (df.dtypes[i] == pl.String) ]

# Categorical columns
cat_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i] not in date_cols) and (df.dtypes[i] == pl.String) ]

# Numerical columns
ignore_cols = ['case_id', 'num_group1', 'num_group2']
num_cols = [ 
    df.columns[i] for i in range(len(df.columns)) 
    if (df.columns[i] not in date_cols) and (df.columns[i] not in cat_cols) and (df.columns[i] not in ignore_cols)
]

# Group data
df_prev_app_2_agg = group_file_data(df, num_cols, date_cols, cat_cols)

### Join all the tables to Base table

In [None]:
# Join the previous application tables to the base table
prev_app_base = base.join(
    df_prev_app_0_agg, how="left", on="case_id"
).join(
    df_prev_app_1_agg, how="left", on="case_id"
).join(
    df_prev_app_2_agg, how="left", on="case_id"
)