# 02 — Feature Engineering

In this notebook we'll:

1. One-hot / target-encode high-cardinality categoricals  
2. Bin continuous variables (age, income)  
3. Create interaction terms  
4. Apply PCA on correlated cash-flow features  


In [None]:
#%pip install scikit-learn

import pandas as pd
import numpy as np

# scikit-learn
from sklearn.preprocessing import OneHotEncoder, KBinsDiscretizer, StandardScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA


# load cleaned data
df = pd.read_csv("C:/Users/HP PAVILION 15 CS/OneDrive/loan_default_model_Ren/data/processed/cleaned_data.csv")

Note: you may need to restart the kernel to use updated packages.


In [6]:
df.head()

Unnamed: 0,Index,LoanAmount,CreationDate,fpd_15,data.Request.Input.CB2.MaxDPD,data.Request.Input.CB2.Labelling,data.Request.Input.CB2.CurrentDPD,data.Request.Input.CB2.Outstandingloan,data.Request.Input.CB1.MaxDPD,data.Request.Input.CB1.Labelling,...,data.Request.Input.SalaryService.MonthlyCashFlow3,data.Request.Input.BVN.StateOfOrigin,data.Request.Input.PrevApplication.LoanAmount,data.Request.Input.PrevApplication.LoanTerm,data.Request.Input.PrevApplication.InterestRate,data.Request.Input.Application.RequestedLoanTerm,data.Request.Input.SalaryService.OpeningBalance,Age,TimeOnBook,DebtToIncome
0,0,1410000.0,2023-09-04,0,0.0,Sub-Prime,0.0,70000.0,0.0,Prime,...,-2000.0,Jigawa State,377000.0,13.0,12.0,17.0,30000.0,0.0,631,0.175391
1,1,710000.0,2023-06-26,0,0.0,Near-Prime,0.0,0.0,0.0,Near-Prime,...,-2000.0,Niger State,377000.0,13.0,12.0,18.0,45000.0,36.3,701,1.198997
2,2,5640000.0,2023-08-20,0,0.0,NTC,0.0,20000.0,0.0,Prime,...,-2000.0,Ogun State,4802000.0,18.0,12.0,14.0,30000.0,43.3,646,0.098621
3,3,120000.0,2023-05-22,1,0.0,Prime,0.0,0.0,0.0,NTC,...,-2000.0,Cross River State,262000.0,15.0,13.0,14.0,30000.0,30.0,736,0.105971
4,4,160000.0,2023-05-17,0,0.0,Sub-Prime,0.0,140000.0,0.0,Sub-Prime,...,-2000.0,Abia State,377000.0,13.0,12.0,12.0,30000.0,42.7,741,0.114135


In [None]:
#%pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.8.1-py3-none-any.whl.metadata (7.9 kB)
Collecting patsy>=0.5.1 (from category_encoders)
  Downloading patsy-1.0.1-py2.py3-none-any.whl.metadata (3.3 kB)
Collecting statsmodels>=0.9.0 (from category_encoders)
  Downloading statsmodels-0.14.4-cp311-cp311-win_amd64.whl.metadata (9.5 kB)
Downloading category_encoders-2.8.1-py3-none-any.whl (85 kB)
Downloading patsy-1.0.1-py2.py3-none-any.whl (232 kB)
Downloading statsmodels-0.14.4-cp311-cp311-win_amd64.whl (9.9 MB)
   ---------------------------------------- 0.0/9.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.9 MB ? eta -:--:--
   - -------------------------------------- 0.3/9.9 MB ? eta -:--:--
   -- ------------------------------------- 0.5/9.9 MB 985.5 kB/s eta 0:00:10
   --- ------------------------------------ 0.8/9.9 MB 1.1 MB/s eta 0:00:09
   ---- ----------------------------------- 1.0/9.9 MB 1.1 MB/s eta 0:00:09
   ----- -------------------------

**High-cardinality fields:**  

- `data.Request.Input.Customer.AddressLGA`  
- `data.Request.Input.Customer.Employment.EmployerLGA`  
- `data.Request.Input.BVN.StateOfOrigin`  

We'll one-hot encode any level appearing in ≥ 1% of rows, and target-encode the rest.


In [11]:
from sklearn.compose import make_column_transformer

ohe = make_column_transformer(
    *[(OneHotEncoder(handle_unknown='ignore'), [col]) for col, _ in oh_cols],
    remainder='drop'
)

X_ohe = ohe.fit_transform(df)
ohe_feats = ohe.get_feature_names_out()

# Ensure X_ohe is dense if sparse matrix is returned
if hasattr(X_ohe, 'toarray'):
    X_ohe = X_ohe.toarray()

df_ohe = pd.DataFrame(X_ohe, columns=ohe_feats, index=df.index)


df = pd.concat([df, df_ohe], axis=1)

Binning Continuous variables like age and income final
We’ll:
- Bucket **Age** into 5 equal-width bins  
- Split **Income.Final** into deciles


In [12]:
# %% [code]
# equal-width bins for Age
df['Age_bin'] = pd.cut(df['Age'], bins=5, labels=False)

# quantile bins for income
df['Income_decile'] = pd.qcut(df['data.Request.Input.Customer.Income.Final'], 10, labels=False)

df[['Age', 'Age_bin', 'Income_decile']].head()
# %% [code]
# create a column transformer for preprocessing 
#num_cols = ['data.Request.Input.Customer.Income.Final', 'data.Request.Input.Customer.Age.Final']


Unnamed: 0,Age,Age_bin,Income_decile
0,0.0,0,7
1,36.3,2,0
2,43.3,3,9
3,30.0,2,2
4,42.7,3,2


#Interation Terms 

Create:  
- **DTI × TimeOnBook**  
- **DebtToIncome × NumberOfChildren**  
- **MonthlyCashFlow2 × MonthlyCashFlow3**


In [13]:
# 
df['DTI_x_TOB'] = df['DebtToIncome'] * df['TimeOnBook']
df['DTI_x_Children'] = df['DebtToIncome'] * df['data.Request.Input.Customer.NumberOfChildren']
df['CF2_x_CF3'] = (
    df['data.Request.Input.SalaryService.MonthlyCashFlow2']
    * df['data.Request.Input.SalaryService.MonthlyCashFlow3']
)

df[['DTI_x_TOB','DTI_x_Children','CF2_x_CF3']].head()


Unnamed: 0,DTI_x_TOB,DTI_x_Children,CF2_x_CF3
0,110.67186,0.175391,4000000.0
1,840.496656,1.198997,-22000000.0
2,63.709108,0.098621,4000000.0
3,77.994621,0.317913,4000000.0
4,84.573696,0.114135,4000000.0


In [14]:
# PCA on Cashflow columns since they are vry correlated and need dimensionality reduction
cashflow_cols = [
    'data.Request.Input.SalaryService.MonthlyCashFlow2',
    'data.Request.Input.SalaryService.MonthlyCashFlow3',
    'data.Request.Input.Customer.TotalExistingExposure'
]

# scale them
scaler = StandardScaler()
X_cash = scaler.fit_transform(df[cashflow_cols])

# fit PCA to capture 90% variance
pca = PCA(n_components=0.90, random_state=42)
pcs = pca.fit_transform(X_cash)

# add components back
for i in range(pcs.shape[1]):
    df[f'CF_PCA_{i+1}'] = pcs[:, i]

print(f"PCA retained {pcs.shape[1]} components.")


PCA retained 2 components.


In [15]:
df.head()

Unnamed: 0,Index,LoanAmount,CreationDate,fpd_15,data.Request.Input.CB2.MaxDPD,data.Request.Input.CB2.Labelling,data.Request.Input.CB2.CurrentDPD,data.Request.Input.CB2.Outstandingloan,data.Request.Input.CB1.MaxDPD,data.Request.Input.CB1.Labelling,...,onehotencoder-3__data.Request.Input.BVN.StateOfOrigin_Yobe State,onehotencoder-3__data.Request.Input.BVN.StateOfOrigin_Zamfara State,onehotencoder-3__data.Request.Input.BVN.StateOfOrigin_kaduna State,Age_bin,Income_decile,DTI_x_TOB,DTI_x_Children,CF2_x_CF3,CF_PCA_1,CF_PCA_2
0,0,1410000.0,2023-09-04,0,0.0,Sub-Prime,0.0,70000.0,0.0,Prime,...,0.0,0.0,0.0,0,7,110.67186,0.175391,4000000.0,-0.022071,0.75534
1,1,710000.0,2023-06-26,0,0.0,Near-Prime,0.0,0.0,0.0,Near-Prime,...,0.0,0.0,0.0,2,0,840.496656,1.198997,-22000000.0,-0.02056,-0.46524
2,2,5640000.0,2023-08-20,0,0.0,NTC,0.0,20000.0,0.0,Prime,...,0.0,0.0,0.0,3,9,63.709108,0.098621,4000000.0,-0.034684,2.359499
3,3,120000.0,2023-05-22,1,0.0,Prime,0.0,0.0,0.0,NTC,...,0.0,0.0,0.0,2,2,77.994621,0.317913,4000000.0,-0.009728,-0.814544
4,4,160000.0,2023-05-17,0,0.0,Sub-Prime,0.0,140000.0,0.0,Sub-Prime,...,0.0,0.0,0.0,3,2,84.573696,0.114135,4000000.0,-0.009485,-0.845458


In [17]:

# save the feature-engineered dataset into data processed folder
df.to_csv('C:/Users/HP PAVILION 15 CS/OneDrive/loan_default_model_Ren/data/processed/loans_featured.csv', index=False)
print("Feature-engineered dataset saved to data/processed/loans_featured.csv")


Feature-engineered dataset saved to data/processed/loans_featured.csv
