## <u> Feature Preparataion </u>

This notebook covers feature engineering and feature selection for the Home Credit Default Risk model. While using all 100 available features is computationally feasible, deliberate dimensionality reduction improves model performance by removing noise and enhances interpretability by clarifying each feature's contribution to default risk. This follows the exploratory analysis from the previous notebook, applying those insights to construct a refined feature set optimized for both predictive accuracy and business utility.

### <u> 1. Setup</u>


In [45]:
# Import libraries
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import phik

# Set root directory for module imports
sys.path.append('..')

# Import modules
from modules.modules_eda import *


In [46]:
# Import dataset and column descriptions
home_credit = pd.read_csv('../data/processed/home_credit_cleaned.csv', index_col = 0)
columns_info = pd.read_csv('../data/processed/columns_info.csv')
print(f'Shape of the dataset: {home_credit.shape}')

Shape of the dataset: (307511, 100)


### <u> 2. Phi-K Correlation</u>

Given the different nature of the variables (binaries, categorical and continuous), standard Pearson correlation is inappropriate for feature selection as it assumes linearity and continuous distributions. As a solution to address this limitation, phi-k correlation is implemented as a substitute. Contrary to Pearson correlation, phi-k evaluates association strength across mixed variable types on a 0-1 scale by comparing the observed joint distribution against what would be expected under statistical independence. This allows meaningful comparison of feature relevance regardless of whether variables are binary, categorical, or continuous, allowing a first selection before feature engineering procedures.

In [47]:
# Calculate phik correlation matrix
#phik_corr = home_credit.phik_matrix()

# Save phik correlation matrix
# phik_corr.to_csv('../data/processed/phik_correlation_matrix.csv')

In [48]:
# Extract target correlations
phik_corr = pd.read_csv('../data/processed/phik_correlation_matrix.csv', index_col = 0)
target_corr = phik_corr['TARGET'].sort_values(ascending = False).drop('TARGET')
target_corr = target_corr.reset_index()
target_corr.index = target_corr.index + 1
target_corr.index.name = 'importance_rank'
target_corr.columns = ['feature', 'phik corr. TARGET']

target_corr.head(30)

Unnamed: 0_level_0,feature,phik corr. TARGET
importance_rank,Unnamed: 1_level_1,Unnamed: 2_level_1
1,EXT_SOURCE_3,0.24768
2,EXT_SOURCE_1,0.217846
3,EXT_SOURCE_2,0.213965
4,YEARS_EMPLOYED,0.103535
5,YEARS_BIRTH,0.102328
6,OCCUPATION_TYPE,0.090029
7,ORGANIZATION_TYPE,0.089164
8,NAME_INCOME_TYPE,0.084831
9,REG_CITY_NOT_WORK_CITY,0.079946
10,YEARS_LAST_PHONE_CHANGE,0.073182


<u> **Comment:** </u>

As we can observe from the sorted correlation rankings, variables beyond the top 25 show diminishing predictive value, with phi-k correlations dropping below 0.048. Based on our systematic feature analysis and the correlation strength distribution, we selected the top 25 variables as our initial modeling set. This threshold captures all features with meaningful predictive signal while maintaining model parsimony and computational efficiency.

Notably, only a few of our engineered bureau aggregation features appear in the top rankings, specifically `bur_cnt_active` and `bur_has_history` at positions 14 and 25 respectively. These variables capture the active loan count in the credit bureau and whether the applicant has any bureau history at all. The strongest predictors are the `EXT_SOURCE` variables, representing normalized scores from external credit bureaus. Despite containing substantial missing values (EXT_SOURCE_1 has 56% missingness and EXT_SOURCE_3 has 20%), these variables demonstrate the highest correlations with default risk.

The remaining top features include expected borrower profiling variables with clear business logic such as employment tenure (YEARS_EMPLOYED), age (YEARS_BIRTH), occupation and organization type, income characteristics, and residential stability indicators. These variables help define the borrower's financial stability and the risk profile associated with their application. Together, this feature set balances statistical predictive power with business interpretability for our modeling.

##

In [49]:
top_25_features = target_corr['feature'].head(25).tolist()
eda_top25 = eda(home_credit, top_25_features, categorical=True)
eda_top25

Unnamed: 0,Variable,Missing_Count,Missing_Percentage,Distinct_Values_Count
18,OWN_CAR_AGE,202929,65.99,62
1,EXT_SOURCE_1,173378,56.38,114584
0,EXT_SOURCE_3,60965,19.83,814
3,YEARS_EMPLOYED,55374,18.01,464
2,EXT_SOURCE_2,660,0.21,119828
15,AMT_GOODS_PRICE,278,0.09,1002
9,YEARS_LAST_PHONE_CHANGE,1,0.0,117
16,AMT_CREDIT,0,0.0,5603
19,YEARS_REGISTRATION,0,0.0,555
4,YEARS_BIRTH,0,0.0,483


### <u> 3. Categorical Feature Engineering</u>

In [50]:
# Adjust dataset
home_credit_final = home_credit[top_25_features + ['TARGET']].copy()


# Consolidation of rare categories in 'NAME_INCOME_TYPE (500 obs.) into 'Other' to address sample size instability
income_mapping = {
    'Unemployed': 'Other',
    'Maternity leave': 'Other',
    'Student': 'Other',
    'Businessman': 'Other'
}

home_credit_final['NAME_INCOME_TYPE'] = home_credit_final['NAME_INCOME_TYPE'].replace(income_mapping)


In [51]:
# Imputation based on 01_eda.ipynb analysis (2obs. to mode category 'Married')
family_status_mapping = {'Unknown' : 'Married'}
home_credit_final['NAME_FAMILY_STATUS'] = home_credit_final['NAME_FAMILY_STATUS'].replace(family_status_mapping)

While `OCCUPATION_TYPE` contains 19 distinct professions, each category holds at least 500 observations with varying default rates ranging from 4.8% (Accountants) to 17.2% (Low-skill Laborers). This meaningful risk gradient across occupation categories suggests valuable predictive signal, warranting preservation of the full granularity without consolidation. All 19 occupation types are for this reason maintained for model development.

In [52]:
# Imputation single missing value in YEARS_LAST_PHONE_CHANGE with median
home_credit_final['YEARS_LAST_PHONE_CHANGE'] = home_credit_final['YEARS_LAST_PHONE_CHANGE'].fillna(home_credit_final['YEARS_LAST_PHONE_CHANGE'].median())

In [53]:
# Consolidation of rare categories (<500 obs.) into 'Other' to address sample size instability
organization_mapping = {
    'Industry: type 13': 'Other',      # 67 obs
    'Trade: type 5': 'Other',          # 49 obs  
    'Trade: type 4': 'Other',          # 64 obs
    'Religion': 'Other',               # 85 obs
    'Industry: type 10': 'Other',      # 109 obs
    'Industry: type 6': 'Other',       # 112 obs
    'Transport: type 1': 'Other',      # 201 obs
    'Cleaning': 'Other',               # 260 obs
    'Legal Services': 'Other',         # 305 obs
    'Mobile': 'Other',                 # 317 obs
    'Trade: type 1': 'Other',          # 348 obs 
    'Industry: type 12': 'Other',      # 369 obs 
    'Culture': 'Other',                # 379 obs 
    'Realtor': 'Other',                # 396 obs 
    'Advertising': 'Other',            # 429 obs 
    'Industry: type 2': 'Other',       # 458 obs 
}

home_credit_final['ORGANIZATION_TYPE'] = home_credit_final['ORGANIZATION_TYPE'].replace(organization_mapping)

In [56]:
eda_final = eda(home_credit_final, top_25_features, categorical=True)
eda_final

Unnamed: 0,Variable,Missing_Count,Missing_Percentage,Distinct_Values_Count
18,OWN_CAR_AGE,202929,65.99,62
1,EXT_SOURCE_1,173378,56.38,114584
0,EXT_SOURCE_3,60965,19.83,814
3,YEARS_EMPLOYED,55374,18.01,464
2,EXT_SOURCE_2,660,0.21,119828
15,AMT_GOODS_PRICE,278,0.09,1002
16,AMT_CREDIT,0,0.0,5603
19,YEARS_REGISTRATION,0,0.0,555
4,YEARS_BIRTH,0,0.0,483
14,YEARS_ID_PUBLISH,0,0.0,177


<u>Comment:</u>

While several features retain substantial missing values (`EXT_SOURCE_1` at 56%, `EXT_SOURCE_3` at 20%, `OWN_CAR_AGE` at 66%, and `YEARS_EMPLOYED` at 18%), these missingness patterns represent informative signals rather than data quality issues. For the external credit bureau scores, missing values indicate credit invisible applicants lacking prior bureau records. The `OWN_CAR_AGE` missingness flags applicants without vehicle ownership, while `YEARS_EMPLOYED` captures the sentinel value denoting non-employment status. Similarly, `AMT_GOODS_PRICE` missing values likely represent loans not tied to specific goods purchases rather than data entry errors. Given the informative nature of these patterns and the native handling capabilities of tree-based algorithms, all missing values are preserved in their original state for model training.

In [58]:
# Save final dataset for model deployment
home_credit_final.to_csv('../data/processed/home_credit_final.csv')