# Loan Default Prediction: Data Cleaning & Preprocessing

From our EDA analysis and visualizations, we found a few data quality problems that must be fixed before building any models:

Very large income values that appear to be errors

DTI values set to 999, which are placeholders

Missing entries in emp_length and revol_util

Categorical fields that need to be encoded

In this notebook, weâ€™ll clean and preprocess the dataset to make it ready for modeling.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Load your data
df = pd.read_csv('../raw_data/accepted_2007_to_2018Q4.csv', nrows=15000)

  df = pd.read_csv('../raw_data/accepted_2007_to_2018Q4.csv', nrows=15000)


In [2]:
# Select key columns (same as before)
key_columns = [
    'loan_amnt', 'int_rate', 'grade', 'emp_length', 'annual_inc',
    'dti', 'fico_range_low', 'revol_util', 'purpose',
    'home_ownership', 'loan_status'
]
df_subset = df[key_columns].copy()

In [None]:
##  Handle Missing Values

# Check missing values
print("Missing Values Before Cleaning:")
print(df_subset.isnull().sum())
print(f"\nTotal missing: {df_subset.isnull().sum().sum()}")

# Create a clean copy
df_clean = df_subset.copy()


df_clean['emp_length'] = df_clean['emp_length'].fillna('Unknown')
df_clean['revol_util'] = df_clean['revol_util'].fillna(df_clean['revol_util'].median())

# Verify missing values are handled
print("\n" + "="*50)
print("Missing Values After Cleaning:")
print(df_clean.isnull().sum())
print(f"\nTotal missing: {df_clean.isnull().sum().sum()}")

Missing Values Before Cleaning:
loan_amnt           0
int_rate            0
grade               0
emp_length        895
annual_inc          0
dti                 0
fico_range_low      0
revol_util          7
purpose             0
home_ownership      0
loan_status         0
dtype: int64

Total missing: 902

Missing Values After Cleaning:
loan_amnt         0
int_rate          0
grade             0
emp_length        0
annual_inc        0
dti               0
fico_range_low    0
revol_util        0
purpose           0
home_ownership    0
loan_status       0
dtype: int64

Total missing: 0


In [None]:
# Remove extreme income outlier
df_clean = df_clean[df_clean['annual_inc'] < 4000000]

# Replace DTI = 999 with NaN, then fill with median
df_clean.loc[df_clean['dti'] == 999, 'dti'] = np.nan
df_clean['dti'].fillna(df_clean['dti'].median(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_clean['dti'].fillna(df_clean['dti'].median(), inplace=True)
