# Notebook 00: Home Credit Data Preparation & Subsampling

This notebook loads, cleans, imputes, and selects a 5,000-row stratified sample from the Home Credit Default Risk dataset for use in hybrid credit scoring analysis. Steps are explained as you go.


##Imports


In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


## Step 1: Load the Home Credit application_train.csv

We load the main training dataset from the raw data folder.


In [7]:
data = pd.read_csv('/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/raw/application_train.csv')  # Adjust path as needed
print("Data shape:", data.shape)
data.head()


Data shape: (307511, 122)


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


## Step 2: Remove duplicate applications

Duplicates can bias statistics and models, so we drop them.


In [8]:
data = data.drop_duplicates()
print("Data shape after dropping duplicates:", data.shape)


Data shape after dropping duplicates: (307511, 122)


In [9]:
print("Data shape:", data.shape)
data.head()

Data shape: (307511, 122)


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


No duplicates found.


## Step 3: Drop columns with more than 60% missing values

Columns with excessive missing data aren't useful for analysis.


In [10]:
missing_pct = data.isnull().mean()
cols_to_drop = missing_pct[missing_pct > 0.60].index.tolist()
print("Dropping columns:", cols_to_drop)
data = data.drop(columns=cols_to_drop)
print("Data shape after dropping columns:", data.shape)


Dropping columns: ['OWN_CAR_AGE', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'FLOORSMIN_AVG', 'LIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_AVG', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'FLOORSMIN_MODE', 'LIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_MODE', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'FLOORSMIN_MEDI', 'LIVINGAPARTMENTS_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'FONDKAPREMONT_MODE']
Data shape after dropping columns: (307511, 105)


## Step 4: Impute remaining missing values

- For numeric columns: fill with the median.
- For categorical columns: fill with the mode, or "MISSING" if mode is empty.


In [11]:
for col in data.columns:
    if data[col].isnull().sum() > 0:
        if data[col].dtype in ['float64', 'int64']:
            median = data[col].median()
            data[col] = data[col].fillna(median)
        else:
            mode = data[col].mode()
            if not mode.empty:
                data[col] = data[col].fillna(mode[0])
            else:
                data[col] = data[col].fillna('MISSING')

# Double-check no missing left:
print(data.isnull().sum().sort_values(ascending=False).head())


SK_ID_CURR            0
TARGET                0
NAME_CONTRACT_TYPE    0
CODE_GENDER           0
FLAG_OWN_CAR          0
dtype: int64


In [14]:
print("Data shape:", data.shape)
data.head()

Data shape: (307511, 105)


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


## Step 5: Check original default rate

Before subsampling, let's check the proportion of defaults (TARGET=1) in the full, cleaned dataset.


In [15]:
default_rate_full = data['TARGET'].mean()
print(f"Original default rate in cleaned dataset: {default_rate_full:.2%}")
print(f"Count of defaults: {data['TARGET'].sum()}")
print(f"Count of non-defaults: {(data['TARGET']==0).sum()}")


Original default rate in cleaned dataset: 8.07%
Count of defaults: 24825
Count of non-defaults: 282686


## Step 6: Stratified subsample (5,000 rows, preserve default rate)

We sample while keeping the original ratio of defaults (TARGET=1).


In [13]:
sample, _ = train_test_split(
    data, 
    train_size=5000, 
    stratify=data['TARGET'], 
    random_state=42
)
print("Sample shape:", sample.shape)
print("Default rate:", sample['TARGET'].mean())


Sample shape: (5000, 105)
Default rate: 0.0808


## Step 7: Select analysis variables

We select variables based on their use in Berg et al. (2020) and credit scoring practice.


In [19]:
selected_columns = [
    'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
    'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
    'DAYS_BIRTH', 'CODE_GENDER', 'CNT_CHILDREN', 'CNT_FAM_MEMBERS',
    'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
    'NAME_INCOME_TYPE', 'AMT_INCOME_TOTAL',
    'REGION_POPULATION_RELATIVE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
    'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH',
    'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_EMAIL',
    'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
    'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
    'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
    'TARGET'
]
# Remove columns that might have been dropped
selected_columns = [col for col in selected_columns if col in sample.columns]
sample = sample[selected_columns]
print("Sample shape:", sample.shape)
sample.head()


Sample shape: (5000, 34)


Unnamed: 0,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,DAYS_BIRTH,CODE_GENDER,CNT_CHILDREN,CNT_FAM_MEMBERS,...,FLAG_EMAIL,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,TARGET
89336,0.505998,0.159679,0.771362,808650.0,26217.0,675000.0,-16699,M,0,2.0,...,0,TUESDAY,11,0,0,0,0,0,0,0
79664,0.88976,0.799346,0.691021,472500.0,44991.0,454500.0,-20431,F,0,1.0,...,0,FRIDAY,16,0,0,0,0,0,0,0
120949,0.344848,0.197456,0.58674,267102.0,21415.5,247500.0,-16296,M,1,3.0,...,0,SATURDAY,8,0,0,0,0,0,0,0
286018,0.505998,0.478192,0.535276,176328.0,11911.5,139500.0,-14065,M,1,3.0,...,0,THURSDAY,18,0,0,0,0,0,0,0
207633,0.505998,0.622893,0.641368,490500.0,19449.0,490500.0,-22022,F,0,2.0,...,0,FRIDAY,15,0,0,0,0,0,0,0


## Step 8: Save cleaned sample and summary statistics

Export for downstream analysis.


In [22]:
sample.to_csv('/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/home_credit_sample.csv', index=False)

summary = sample.describe(include='all').T
summary['missing_pct'] = sample.isnull().mean()
summary.to_csv('/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/results/tables/home_credit_sample_summary.csv')

print("Saved cleaned sample and summary stats.")


Saved cleaned sample and summary stats.


## Step 9: Check sample and variable list

Print summary for your paper/notebook log.


In [23]:
print("Final sample size:", sample.shape[0])
print("Final default rate:", sample['TARGET'].mean())
print("Variables used:", selected_columns)


Final sample size: 5000
Final default rate: 0.0808
Variables used: ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'DAYS_BIRTH', 'CODE_GENDER', 'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'NAME_INCOME_TYPE', 'AMT_INCOME_TOTAL', 'REGION_POPULATION_RELATIVE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_EMAIL', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'TARGET']


## Save Variable Lists


In [25]:
with open('/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/results/tables/selected_variables.txt', 'w') as f:
    f.write('\n'.join(selected_columns))
