## <u> Feature Preparataion </u>

This notebook covers feature engineering and feature selection for the Home Credit Default Risk model. While using all 100 available features is computationally feasible, deliberate dimensionality reduction improves model performance by removing noise and enhances interpretability by clarifying each feature's contribution to default risk. This follows the exploratory analysis from the previous notebook, applying those insights to construct a refined feature set optimized for both predictive accuracy and business utility.

### <u> 1. Setup</u>


In [1]:
# Import libraries
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import phik

# Set root directory for module imports
sys.path.append('..')

# Import modules
from modules.modules_eda import *


In [2]:
# Import dataset and column descriptions
home_credit = pd.read_csv('../data/processed/home_credit_cleaned.csv', index_col = 0)
columns_info = pd.read_csv('../data/processed/columns_info.csv')
print(f'Shape of the dataset: {home_credit.shape}')

Shape of the dataset: (307511, 100)


### <u> 2. Phi-K Correlation</u>

Given the different nature of the variables (binaries, categorical and continuous), standard Pearson correlation is inappropriate for feature selection as it assumes linearity and continuous distributions. As a solution to address this limitation, phi-k correlation is implemented as a substitute. Contrary to Pearson correlation, phi-k evaluates association strength across mixed variable types on a 0-1 scale by comparing the observed joint distribution against what would be expected under statistical independence. This allows meaningful comparison of feature relevance regardless of whether variables are binary, categorical, or continuous, allowing a first selection before feature engineering procedures.

In [3]:
# Calculate phik correlation matrix
#phik_corr = home_credit.phik_matrix()

# Save phik correlation matrix
# phik_corr.to_csv('../data/processed/phik_correlation_matrix.csv')

In [4]:
# Extract target correlations
phik_corr = pd.read_csv('../data/processed/phik_correlation_matrix.csv', index_col = 0)
target_corr = phik_corr['TARGET'].sort_values(ascending = False).drop('TARGET')
target_corr = target_corr.reset_index()
target_corr.index = target_corr.index + 1
target_corr.index.name = 'importance_rank'
target_corr.columns = ['feature', 'phik corr. TARGET']

target_corr.head(50)

Unnamed: 0_level_0,feature,phik corr. TARGET
importance_rank,Unnamed: 1_level_1,Unnamed: 2_level_1
1,EXT_SOURCE_3,0.24768
2,EXT_SOURCE_1,0.217846
3,EXT_SOURCE_2,0.213965
4,YEARS_EMPLOYED,0.103535
5,YEARS_BIRTH,0.102328
6,OCCUPATION_TYPE,0.090029
7,ORGANIZATION_TYPE,0.089164
8,NAME_INCOME_TYPE,0.084831
9,REG_CITY_NOT_WORK_CITY,0.079946
10,YEARS_LAST_PHONE_CHANGE,0.073182


<u> **Comment:** </u>

As we can observe from the sorted correlation rankings, variables beyond the top 41 show a notable drop in predictive value, with phi-k correlations falling from 0.028 to 0.018. Based on this natural break in the correlation strength distribution, we selected the top 41 variables as our final modeling set. This threshold captures all features with meaningful predictive signal while maintaining model parsimony and computational efficiency.

The strongest predictors are the `EXT_SOURCE_1`, `EXT_SOURCE_2` and `EXT_SOURCE_3` variables, representing normalized scores from external credit bureaus. Despite containing substantial missing values, with `EXT_SOURCE_1` at 56% missingness and `EXT_SOURCE_3` at 20%, these variables demonstrate the highest correlations with default risk. The remaining features include borrower profiling variables such as `YEARS_EMPLOYED`, `YEARS_BIRTH`, `OCCUPATION_TYPE`, `ORGANIZATION_TYPE` and `NAME_INCOME_TYPE`, alongside engineered aggregations from the bureau and previous application tables such as `bur_cnt_active`, `bur_has_history` and `prev_cnt_consumer_approved`. Together, this feature set balances statistical predictive power with business interpretability for our modelling.

##

In [5]:
top_features = target_corr['feature'].head(41).tolist()
eda_features = eda(home_credit, top_features, categorical=True).reset_index(drop=True)
eda_features

Unnamed: 0,Variable,Missing_Count,Missing_Percentage,Distinct_Values_Count
0,OWN_CAR_AGE,202929,65.99,62
1,EXT_SOURCE_1,173378,56.38,114584
2,EXT_SOURCE_3,60965,19.83,814
3,YEARS_EMPLOYED,55374,18.01,464
4,EXT_SOURCE_2,660,0.21,119828
5,AMT_GOODS_PRICE,278,0.09,1002
6,AMT_ANNUITY,12,0.0,13672
7,YEARS_LAST_PHONE_CHANGE,1,0.0,117
8,AMT_CREDIT,0,0.0,5603
9,YEARS_REGISTRATION,0,0.0,555


### <u> 3. Categorical Feature Engineering</u>

In [6]:
# Create final dataset with top features and target variable
home_credit_final = home_credit[top_features + ['TARGET']].copy()

# Assignment of 4 XNA in CODE_GENDER into class mode 'F'
gender_mapping = {'XNA': 'F'}
home_credit_final['CODE_GENDER'] = home_credit_final['CODE_GENDER'].replace(gender_mapping)


# Consolidation of rare categories in 'NAME_INCOME_TYPE (500 obs.) into 'Other' to address sample size instability
income_mapping = {
    'Unemployed': 'Other',
    'Maternity leave': 'Other',
    'Student': 'Other',
    'Businessman': 'Other'
}

home_credit_final['NAME_INCOME_TYPE'] = home_credit_final['NAME_INCOME_TYPE'].replace(income_mapping)


In [7]:
# Imputation based on 01_eda.ipynb analysis (2obs. to mode category 'Married')
family_status_mapping = {'Unknown' : 'Married'}
home_credit_final['NAME_FAMILY_STATUS'] = home_credit_final['NAME_FAMILY_STATUS'].replace(family_status_mapping)

While `OCCUPATION_TYPE` contains 19 distinct professions, each category holds at least 500 observations with varying default rates ranging from 4.8% (Accountants) to 17.2% (Low-skill Laborers). This meaningful risk gradient across occupation categories suggests valuable predictive signal, warranting preservation of the full granularity without consolidation. All 19 occupation types are for this reason maintained for model development.

In [None]:
# Imputation single missing value in YEARS_LAST_PHONE_CHANGE with median
home_credit_final['YEARS_LAST_PHONE_CHANGE'] = home_credit_final['YEARS_LAST_PHONE_CHANGE'].fillna(home_credit_final['YEARS_LAST_PHONE_CHANGE'].median())

# Imputation of 12 missing values in AMT_ANNUITY with median
home_credit_final['AMT_ANNUITY'] = home_credit_final['AMT_ANNUITY'].fillna(home_credit_final['AMT_ANNUITY'].median())

In [9]:
# Consolidation of rare categories (<500 obs.) into 'Other' to address sample size instability
organization_mapping = {
    'Industry: type 13': 'Other',      # 67 obs
    'Trade: type 5': 'Other',          # 49 obs  
    'Trade: type 4': 'Other',          # 64 obs
    'Religion': 'Other',               # 85 obs
    'Industry: type 10': 'Other',      # 109 obs
    'Industry: type 6': 'Other',       # 112 obs
    'Transport: type 1': 'Other',      # 201 obs
    'Cleaning': 'Other',               # 260 obs
    'Legal Services': 'Other',         # 305 obs
    'Mobile': 'Other',                 # 317 obs
    'Trade: type 1': 'Other',          # 348 obs 
    'Industry: type 12': 'Other',      # 369 obs 
    'Culture': 'Other',                # 379 obs 
    'Realtor': 'Other',                # 396 obs 
    'Advertising': 'Other',            # 429 obs 
    'Industry: type 2': 'Other',       # 458 obs 
}

home_credit_final['ORGANIZATION_TYPE'] = home_credit_final['ORGANIZATION_TYPE'].replace(organization_mapping)

### <u> 4. Multicollinearity across features</u>

In [11]:
# Check correlation between EXT_SOURCE variables
ext_sources = home_credit_final[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']]
ext_corr = ext_sources.corr()
print(ext_corr)

              EXT_SOURCE_1  EXT_SOURCE_2  EXT_SOURCE_3
EXT_SOURCE_1      1.000000      0.213982      0.186846
EXT_SOURCE_2      0.213982      1.000000      0.109167
EXT_SOURCE_3      0.186846      0.109167      1.000000


In [12]:
# Check phik correlation between EXT_SOURCE variables
phik_ext_sources = phik_corr.loc[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3'], ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']]
phik_ext_sources

Unnamed: 0,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3
EXT_SOURCE_1,1.0,0.2392,0.20364
EXT_SOURCE_2,0.2392,1.0,0.125038
EXT_SOURCE_3,0.20364,0.125038,1.0


<u>Comment:</u>

Given the strength of the EXT_SOURCE variables as top predictors, a multicollinearity analysis was conducted to assess potential redundancy. Both Pearson and phi-k correlations reveal low to moderate associations between the three external sources (ranging from 0.12 to 0.24), indicating absence of multicollinearity concerns. Each variable captures distinct aspects of creditworthiness from different bureau sources, warranting retention of all three features without need for dimensionality reduction techniques to address information redundancy.

In [13]:
# Extract final phik correlation matrix for top 41 features
phik_final = phik_corr.loc[top_features, top_features]

# Build diagonal mask
mask = np.triu(np.ones_like(phik_final, dtype=bool))
high_correlation = phik_final.where(~mask).stack()
high_correlation_pairs = high_correlation[high_correlation > 0.7].sort_values(ascending=False)

print("Feature pairs with correlation > 0.7:")
print(high_correlation_pairs)

Feature pairs with correlation > 0.7:
FLAG_EMP_PHONE               ORGANIZATION_TYPE                1.000000
                             NAME_INCOME_TYPE                 1.000000
REGION_RATING_CLIENT         REGION_RATING_CLIENT_W_CITY      0.998765
bur_has_history              AMT_REQ_CREDIT_BUREAU_flag_na    0.998621
AMT_CREDIT                   AMT_GOODS_PRICE                  0.984828
LIVE_CITY_NOT_WORK_CITY      REG_CITY_NOT_WORK_CITY           0.962696
FLAG_EMP_PHONE               YEARS_BIRTH                      0.911242
FLAG_DOCUMENT_6              FLAG_EMP_PHONE                   0.806903
ORGANIZATION_TYPE            OCCUPATION_TYPE                  0.802636
NAME_INCOME_TYPE             ORGANIZATION_TYPE                0.775911
FLAG_DOCUMENT_6              NAME_INCOME_TYPE                 0.774435
YEARS_ID_PUBLISH             YEARS_BIRTH                      0.761784
FLAG_EMP_PHONE               OCCUPATION_TYPE                  0.759266
FLAG_DOCUMENT_6              ORGANIZATI

In [14]:
# Feature to drop based on high correlation analysis
features_to_drop = ['AMT_GOODS_PRICE', 'AMT_REQ_CREDIT_BUREAU_flag_na',
                     'FLAG_EMP_PHONE', 'REGION_RATING_CLIENT',
                     'LIVE_CITY_NOT_WORK_CITY']

# Drop from dataframe
home_credit_final = home_credit_final.drop(columns=features_to_drop)

# Drop from list
final_features = [f for f in top_features if f not in features_to_drop]

<u>Comment:</u>

The analysis of feature intercorrelations reveals substantial multicollinearity among several selected variables, leading to the removal of five features due to information redundancy.

- `AMT_GOODS_PRICE` exhibits near-perfect correlation (0.985) with `AMT_CREDIT`, as it represents a subset capturing only the goods value for consumer loans while `AMT_CREDIT` reflects the total credit amount across all loan types. Given this redundancy and the presence of missing values in `AMT_GOODS_PRICE` for non-consumer loans, this feature is dropped.

- `AMT_REQ_CREDIT_BUREAU_flag_na` shows extremely high correlation (0.999) with `bur_has_history`, as both capture whether the applicant has prior credit bureau records through different mechanisms. The former flags missing enquiry counts while the latter directly indicates bureau history presence, making `AMT_REQ_CREDIT_BUREAU_flag_na` redundant.

- `FLAG_EMP_PHONE` demonstrates perfect correlation with `ORGANIZATION_TYPE` because work phone provision patterns are largely determined by organization sector. Since `ORGANIZATION_TYPE` captures this employment stability signal while also providing granular employer sector information, the binary flag becomes redundant.

- `REGION_RATING_CLIENT` and `REGION_RATING_CLIENT_W_CITY` share a correlation of 0.999, as the former is a broader regional rating while the latter incorporates city-level information. The city-weighted version is then retained as it carries more granular geographic risk signal as well as holding higher phi-k correlation with the target response.

- Finally, `LIVE_CITY_NOT_WORK_CITY` and `REG_CITY_NOT_WORK_CITY` correlate at 0.963, both capturing residential and work location mismatches. The latter is retained given its stronger phi-k correlation with the target.

These decisions reduce the feature set from 41 to 36 variables while preserving all unique predictive information captured in the input space.

### <u> 5. Data Type</u>

In this final step, we ensure that all features are expressed using appropriate data types, namely int or float for numerical variables and binary variables, and category for categorical variables.

In [15]:
cat_cols = home_credit_final.select_dtypes(include="object").columns

for col in cat_cols:
    home_credit_final[col] = home_credit_final[col].astype("category")


In [16]:
home_credit_final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 307511 entries, 149977 to 309985
Data columns (total 37 columns):
 #   Column                       Non-Null Count   Dtype   
---  ------                       --------------   -----   
 0   EXT_SOURCE_3                 246546 non-null  float64 
 1   EXT_SOURCE_1                 134133 non-null  float64 
 2   EXT_SOURCE_2                 306851 non-null  float64 
 3   YEARS_EMPLOYED               252137 non-null  float64 
 4   YEARS_BIRTH                  307511 non-null  float64 
 5   OCCUPATION_TYPE              307511 non-null  category
 6   ORGANIZATION_TYPE            307511 non-null  category
 7   NAME_INCOME_TYPE             307511 non-null  category
 8   REG_CITY_NOT_WORK_CITY       307511 non-null  int64   
 9   YEARS_LAST_PHONE_CHANGE      307511 non-null  float64 
 10  REG_CITY_NOT_LIVE_CITY       307511 non-null  int64   
 11  FLAG_DOCUMENT_3              307511 non-null  int64   
 12  bur_cnt_active               307511 non-null

### <u> 5. Final dataset</u>

In [17]:
eda_final = eda(home_credit_final, final_features, categorical=True).reset_index(drop=True)
eda_final

Unnamed: 0,Variable,Missing_Count,Missing_Percentage,Distinct_Values_Count
0,OWN_CAR_AGE,202929,65.99,62
1,EXT_SOURCE_1,173378,56.38,114584
2,EXT_SOURCE_3,60965,19.83,814
3,YEARS_EMPLOYED,55374,18.01,464
4,EXT_SOURCE_2,660,0.21,119828
5,AMT_ANNUITY,0,0.0,13672
6,AMT_CREDIT,0,0.0,5603
7,YEARS_REGISTRATION,0,0.0,555
8,YEARS_BIRTH,0,0.0,483
9,YEARS_ID_PUBLISH,0,0.0,177


<u>Comment:</u>

While several features retain substantial missing values (`EXT_SOURCE_1` at 56%, `EXT_SOURCE_3` at 20%, `OWN_CAR_AGE` at 66%, and `YEARS_EMPLOYED` at 18%), these missingness patterns represent informative signals rather than data quality issues. For the external credit bureau scores, missing values indicate credit invisible applicants lacking prior bureau records. The `OWN_CAR_AGE` missingness flags applicants without vehicle ownership, while `YEARS_EMPLOYED` captures the sentinel value denoting non-employment status. Given the informative nature of these patterns and the native handling capabilities of tree-based algorithms, all missing values are preserved in their original state for model training. `AMT_ANNUITY	` is the only variable containing iniformantive missing value due to wrong or missing computation, but given the boosted tree model employed, this small fraction can be neglected

In [18]:
# Save final dataset for model deployment
home_credit_final.to_csv('../data/processed/home_credit_final.csv')