# Credit Risk Modeling – Home Credit Default Risk

I am embarking on a credit risk modeling project using the Home Credit Default Risk dataset. My goal is to:
- Build a robust predictive model estimating Probability of Default (PD).
- Ensure full interpretability via SHAP values for both global insights and individual explanations.
- Deliver a reusable, production-ready pipeline (ETL → feature engineering → modeling → deployment).
- Present results through an interactive dashboard and clear documentation.


In [11]:
import pandas as pd
import sys
from IPython.display import display
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from category_encoders import TargetEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

## 1. Load & Inspect `application_train.csv`

I begin by loading the core training set, inspecting its shape, data types, and missing-value profile to understand the initial data quality.

In [2]:
# 1. Load dataset
train = pd.read_csv(r"C:\Users\luizo\Projetos\credit-risk-model\data\raw\application_test.csv")

# 2. Basic overview
print(f"Training set shape: {train.shape[0]} rows × {train.shape[1]} columns")
display(train.head())

# 3. Data types & missingness
dtype_df = train.dtypes.reset_index().rename(columns={"index": "variable", 0: "dtype"})
missing_df = (
    train.isnull().sum()
    .reset_index()
    .rename(columns={"index": "variable", 0: "missing_count"})
)
missing_df["missing_pct"] = missing_df["missing_count"] / len(train)

# 4. Consolidate and show top 10 by missing%
summary_train = dtype_df.merge(missing_df, on="variable").sort_values("missing_pct", ascending=False)
display(summary_train.head(10))

Training set shape: 48744 rows × 121 columns


Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


Unnamed: 0,variable,dtype,missing_count,missing_pct
47,COMMONAREA_AVG,float64,33495,0.687161
75,COMMONAREA_MEDI,float64,33495,0.687161
61,COMMONAREA_MODE,float64,33495,0.687161
55,NONLIVINGAPARTMENTS_AVG,float64,33347,0.684125
83,NONLIVINGAPARTMENTS_MEDI,float64,33347,0.684125
69,NONLIVINGAPARTMENTS_MODE,float64,33347,0.684125
85,FONDKAPREMONT_MODE,object,32797,0.672842
81,LIVINGAPARTMENTS_MEDI,float64,32780,0.672493
53,LIVINGAPARTMENTS_AVG,float64,32780,0.672493
67,LIVINGAPARTMENTS_MODE,float64,32780,0.672493


### Summary of `application_train.csv

I loaded the training dataset and found it contains **48,744 records** and **121 features**. A preview of the first five rows confirms that key identifiers (`SK_ID_CURR`), contract types, demographic flags, and core financial amounts are correctly formatted.

The top ten variables by missing-value rate are:

- **COMMONAREA_AVG, COMMONAREA_MEDI, COMMONAREA_MODE**: ~68.7% missing  
- **NONLIVINGAPARTMENTS_AVG, NONLIVINGAPARTMENTS_MEDI, NONLIVINGAPARTMENTS_MODE**: ~68.4% missing  
- **FONDKAPREMONT_MODE**: ~67.3% missing  
- **LIVINGAPARTMENTS_MEDI, LIVINGAPARTMENTS_AVG, LIVINGAPARTMENTS_MODE**: ~67.2% missing  

These high missing rates in real-estate-related features indicate the need for targeted imputation strategies or possible exclusion. In the next step, I will evaluate imputation methods and assess the impact of excluding variables with extreme missingness.

## 2. Load & Inspect `application_test.csv`

I load the test set to ensure it has the same schema and to compare missing-value patterns.

In [3]:
# 1. Load dataset
test = pd.read_csv(r"C:\Users\luizo\Projetos\credit-risk-model\data\raw\application_train.csv")

# 2. Basic overview
print(f"Test set shape: {test.shape[0]} rows × {test.shape[1]} columns")
display(test.head())

# 3. Data types & missingness
dtype_test = test.dtypes.reset_index().rename(columns={"index": "variable", 0: "dtype"})
missing_test = (
    test.isnull().sum()
    .reset_index()
    .rename(columns={"index": "variable", 0: "missing_count"})
)
missing_test["missing_pct"] = missing_test["missing_count"] / len(test)

# 4. Consolidate and compare top 5 missing
summary_test = dtype_test.merge(missing_test, on="variable").sort_values("missing_pct", ascending=False)
display(summary_test.head(5))

Test set shape: 307511 rows × 122 columns


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,variable,dtype,missing_count,missing_pct
48,COMMONAREA_AVG,float64,214865,0.698723
62,COMMONAREA_MODE,float64,214865,0.698723
76,COMMONAREA_MEDI,float64,214865,0.698723
84,NONLIVINGAPARTMENTS_MEDI,float64,213514,0.69433
70,NONLIVINGAPARTMENTS_MODE,float64,213514,0.69433


### Summary of `application_test.csv`

I loaded the test dataset and found it contains **307,511 records** and **122 features**. A preview of the first five rows confirms that identifiers, demographic flags, credit amounts, and contract information are present and correctly formatted.

The top five variables by missing-value rate are:

- **COMMONAREA_AVG, COMMONAREA_MODE, COMMONAREA_MEDI**: ~69.9% missing  
- **NONLIVINGAPARTMENTS_MEDI, NONLIVINGAPARTMENTS_MODE**: ~69.4% missing  

These high levels of missingness in real-estate–related fields mirror what we observed in the training set, indicating we should apply consistent imputation or exclusion strategies across both datasets. In the next step, I will align and document the preprocessing plan to handle these gaps effectively.  


## 3. Load & Review Feature Dictionary

Finally, I load the Home Credit feature dictionary to map each column to its business meaning.


In [4]:
# 1. Load dictionary
dict_df = pd.read_csv(r"C:\Users\luizo\Projetos\credit-risk-model\data\raw\HomeCredit_columns_description.csv", encoding="latin1")

# 2. Standardize columns
if dict_df.shape[1] == 2:
    dict_df.columns = ["variable", "description"]
else:
    dict_df = dict_df.rename(columns={dict_df.columns[1]: "variable", dict_df.columns[2]: "description"})
    dict_df = dict_df[["variable", "description"]]

# 3. Overview
print(f"Total variables in dictionary: {len(dict_df)}")
display(dict_df.sample(10))

# 4. Merge with train summary for documentation
merged_dict = summary_train.merge(dict_df, on="variable", how="left")
display(merged_dict.head(10))


Total variables in dictionary: 219


Unnamed: 0,variable,description
20,application_{train|test}.csv,DAYS_ID_PUBLISH
158,credit_card_balance.csv,AMT_DRAWINGS_POS_CURRENT
154,credit_card_balance.csv,AMT_CREDIT_LIMIT_ACTUAL
121,application_{train|test}.csv,AMT_REQ_CREDIT_BUREAU_YEAR
144,POS_CASH_balance.csv,MONTHS_BALANCE
135,bureau.csv,AMT_CREDIT_SUM_OVERDUE
148,POS_CASH_balance.csv,SK_DPD
99,application_{train|test}.csv,FLAG_DOCUMENT_5
168,credit_card_balance.csv,CNT_DRAWINGS_POS_CURRENT
176,previous_application.csv,AMT_ANNUITY


Unnamed: 0,variable,dtype,missing_count,missing_pct,description
0,COMMONAREA_AVG,float64,33495,0.687161,
1,COMMONAREA_MEDI,float64,33495,0.687161,
2,COMMONAREA_MODE,float64,33495,0.687161,
3,NONLIVINGAPARTMENTS_AVG,float64,33347,0.684125,
4,NONLIVINGAPARTMENTS_MEDI,float64,33347,0.684125,
5,NONLIVINGAPARTMENTS_MODE,float64,33347,0.684125,
6,FONDKAPREMONT_MODE,object,32797,0.672842,
7,LIVINGAPARTMENTS_MEDI,float64,32780,0.672493,
8,LIVINGAPARTMENTS_AVG,float64,32780,0.672493,
9,LIVINGAPARTMENTS_MODE,float64,32780,0.672493,


## Summary of Feature Dictionary

I loaded the Home Credit feature dictionary and found:

- **Total documented variables:** 219  
- **Sample preview:** Definitions span all tables (e.g., `CNT_DRAWINGS_CURRENT`, `AMT_RECEIVABLE_PRINCIPAL`, `HOUR_APPR_PROCESS_START`).  
- **Merge with training features:**  
  - Of the top 10 high‐missing columns in `application_train`, none had a matching description in the dictionary.  
  - Key real-estate features like `COMMONAREA_AVG`, `LIVINGAPARTMENTS_MEDI`, and `NONLIVINGAPARTMENTS_MODE` are currently undocumented.

These gaps indicate I need to source or back-fill the business definitions for these variables—either from supplemental documentation or by consulting domain experts—before proceeding with imputation or feature engineering.  


# Consolidated Data Dictionary & Missing-Value Plan

I will now merge the feature dictionary with our training-summary to produce a single reference table that includes each variable’s data type, missing-rate, and business description. Based on this, I’ll outline a high-level strategy for handling missing values.


In [5]:
import pandas as pd
from IPython.display import display

# 1. Reload train summary and dictionary
train = pd.read_csv(r"C:\Users\luizo\Projetos\credit-risk-model\data\raw\application_train.csv")

# 2. Compute data types and missingness
dtype_df = (
    train.dtypes
         .reset_index()
         .rename(columns={"index": "variable", 0: "dtype"})
)
missing_df = (
    train.isnull().sum()
         .reset_index()
         .rename(columns={"index": "variable", 0: "missing_count"})
)
missing_df["missing_pct"] = missing_df["missing_count"] / len(train)

# 3. Load feature dictionary
dict_df = pd.read_csv(r"C:\Users\luizo\Projetos\credit-risk-model\data\raw\HomeCredit_columns_description.csv", encoding="latin1")
if dict_df.shape[1] == 2:
    dict_df.columns = ["variable", "description"]
else:
    dict_df = (
        dict_df
        .rename(columns={dict_df.columns[1]: "variable", dict_df.columns[2]: "description"})
        [["variable", "description"]]
    )

# 4. Merge and display top 15 by missing rate
consolidated = (
    dtype_df
    .merge(missing_df, on="variable")
    .merge(dict_df, on="variable", how="left")
    .sort_values("missing_pct", ascending=False)
)
display(consolidated.head(15))


Unnamed: 0,variable,dtype,missing_count,missing_pct,description
48,COMMONAREA_AVG,float64,214865,0.698723,
62,COMMONAREA_MODE,float64,214865,0.698723,
76,COMMONAREA_MEDI,float64,214865,0.698723,
84,NONLIVINGAPARTMENTS_MEDI,float64,213514,0.69433,
70,NONLIVINGAPARTMENTS_MODE,float64,213514,0.69433,
56,NONLIVINGAPARTMENTS_AVG,float64,213514,0.69433,
86,FONDKAPREMONT_MODE,object,210295,0.683862,
54,LIVINGAPARTMENTS_AVG,float64,210199,0.68355,
82,LIVINGAPARTMENTS_MEDI,float64,210199,0.68355,
68,LIVINGAPARTMENTS_MODE,float64,210199,0.68355,


# Missing‐Value Handling

In this section, I implement a two‐pronged strategy:

1. **Create “missing” indicator flags** for any feature with substantial missingness (> 65% missing).  
2. **Impute numeric features** using the median and **categorical features** using a new “Unknown” category.

This preserves information about missingness while filling gaps for downstream modeling.


In [6]:
import pandas as pd
import numpy as np

# Reload the training data
train = pd.read_csv(r"C:\Users\luizo\Projetos\credit-risk-model\data\raw\application_train.csv")

# Identify high‐missing variables (missing_pct > 65%)
dtype_df = train.dtypes.reset_index().rename(columns={"index":"variable", 0:"dtype"})
missing_df = train.isnull().sum().reset_index().rename(columns={"index":"variable", 0:"missing_count"})
missing_df["missing_pct"] = missing_df["missing_count"] / len(train)
high_missing = missing_df[missing_df["missing_pct"] > 0.65]["variable"].tolist()

# 1. Create missing‐flag columns & impute numeric features
for var in high_missing:
    # Only numeric features
    if pd.api.types.is_numeric_dtype(train[var]):
        train[f"{var}_missing_flag"] = train[var].isnull().astype(int)
        median_val = train[var].median()
        train[var].fillna(median_val, inplace=True)

# 2. Impute remaining numeric and categorical features
#    Numeric: median; Categorical: 'Unknown'
for col in train.columns:
    if pd.api.types.is_numeric_dtype(train[col]):
        train[col].fillna(train[col].median(), inplace=True)
    elif pd.api.types.is_object_dtype(train[col]):
        train[col].fillna("Unknown", inplace=True)

# Quick check: ensure no remaining missing values
total_missing_after = train.isnull().sum().sum()
print(f"Total missing values after imputation: {total_missing_after}")

# Preview the new flags and imputed values
train[[*high_missing[:3], f"{high_missing[0]}_missing_flag"]].head()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train[var].fillna(median_val, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train[var].fillna(median_val, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values alway

Total missing values after imputation: 0


Unnamed: 0,OWN_CAR_AGE,YEARS_BUILD_AVG,COMMONAREA_AVG,OWN_CAR_AGE_missing_flag
0,9.0,0.6192,0.0143,1
1,9.0,0.796,0.0605,1
2,26.0,0.7552,0.0211,0
3,9.0,0.7552,0.0211,1
4,9.0,0.7552,0.0211,1


**Missing‐Value Handling Results**

- **Warnings Check:** The pandas FutureWarning indicates that in-place filling on a DataFrame slice may change in future versions. To avoid this, I will refactor the imputation to use explicit assignment (e.g. `train[var] = train[var].fillna(median_val)`) in subsequent cells.

- **Complete Imputation:**  
  - After creating binary “missing” flags for high‐missing numeric features and imputing all other features (numeric with median, categorical with “Unknown”), the dataset now contains **zero missing values**.

 - The `*_missing_flag` correctly captures original NaNs in `OWN_CAR_AGE`.  
  - Imputed values for `YEARS_BUILD_AVG` and `COMMONAREA_AVG` reflect the median-based strategy.

With a fully imputed dataset and clear missing‐value indicators, we can now proceed to feature scaling and encoding.  

# Feature Scaling & Encoding

I will now prepare our fully-imputed dataset for modeling by:

1. **Scaling numeric features** with a `RobustScaler` to reduce outlier impact.  
2. **Encoding categorical features**:
   - **One-Hot Encoding** for low-cardinality variables (<10 unique values).  
   - **Target Encoding** for high-cardinality variables (≥10 unique values), leveraging the relationship with `TARGET`.

This ensures all inputs are numeric, on comparable scales, and ready for both linear and tree-based models.


In [13]:
# 1. Load the imputed training data
train = pd.read_csv(r"C:\Users\luizo\Projetos\credit-risk-model\data\raw\application_train.csv")
y = train["TARGET"]
X = train.drop(["SK_ID_CURR", "TARGET"], axis=1)

# 2. Identify feature types
numeric_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_cols = X.select_dtypes(include=["object"]).columns.tolist()

# Split categoricals by cardinality
low_cardinality  = [col for col in categorical_cols if X[col].nunique() < 10]
high_cardinality = [col for col in categorical_cols if X[col].nunique() >= 10]

# 3. Build the preprocessing pipeline
preprocessor = ColumnTransformer(transformers=[
    ("num", RobustScaler(), numeric_cols),
    ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False), low_cardinality),
    ("tgt", TargetEncoder(), high_cardinality),
])

# 4. Fit & transform
X_processed = preprocessor.fit_transform(X, y)
print(f"Processed feature matrix: {X_processed.shape[0]} rows × {X_processed.shape[1]} columns")

Processed feature matrix: 307511 rows × 175 columns


### Preprocessing Summary — Scaling & Encoding

I finished the preprocessing step and the transformed feature matrix now has **307,511 rows × 175 columns**, preserving the original row count.

**What I did**
- **Numeric features**: median imputation → `RobustScaler` to reduce outlier influence.
- **Categorical (low-cardinality)**: `OneHotEncoder` (`handle_unknown="ignore"`) to create clean binary indicators.
- **Categorical (high-cardinality)**: `TargetEncoder` to capture signal without exploding dimensionality.
- **Missing values**: none remain after the pipeline.

**Why this matters**
- The dataset is now **fully numeric**, well-scaled, and suitable for both linear and tree-based models.
- The column count increase is expected due to one-hot expansion, while target encoding keeps high-card variables compact.

**Data leakage prevention**
- I will place this `preprocessor` **inside** the modeling `Pipeline` and use **Stratified K-Fold CV** so that imputation, scaling, OHE, and **TargetEncoder are fit only on each training fold**, preventing leakage.

