# Loan Default Analysis and Prediction

## Data Scientist: 

### Project Overview:                                                                                              

This project examines loan borrower behavior to support both borrower segmentation and loan default detection, with the goal of reducing financial risk while improving understanding of borrower behavior. Rather than optimizing a single model or metric, the focus is on careful exploratory analysis, feature characterization, and evaluating multiple modeling approaches to understand where and why loan default occur.

The workflow follows an end-to-end data science process, including data validation, exploratory analysis, feature engineering, feature selection, model comparison using scikit-learn pipelines, and post-model error analysis.

### About the data: 

This dataset was taken from Coursera's Loan Default Prediction Challenge and was programmatically retrieved using the Kagglehub package. It contains 255,347 records and 18 attributes including:

*  LoanID: A unique identifier for each loan.
*  Age: The age of the borrower.
*  Income: The annual income of the borrower.
*  LoanAmount: The amount of money being borrowed.
*  CreditScore: The credit score of the borrower, indicating their creditworthiness.
*  MonthsEmployed: The number of months the borrower has been employed.
*  NumCreditLines: The number of credit lines the borrower has open.
*  InterestRate: The interest rate for the loan.
*  LoanTerm: The term length of the loan in months.
*  DTIRatio: The debt-to-income ratio, indicating the borrower's debt compared to their income.
*  Education: The highest level of education attained by the borrower (PhD, Master's, Bachelor's, High School).
*  EmploymentType: The type of employment status of the borrower (Full-time, Part-time, Self-employed, Unemployed).
*  MaritalStatus: The marital status of the borrower (Single, Married, Divorced).
*  HasMortgage: Whether the borrower has a mortgage (Yes or No).
*  HasDependents: Whether the borrower has dependents (Yes or No).
*  LoanPurpose: The purpose of the loan (Home, Auto, Education, Business, Other).
*  HasCoSigner: Whether the loan has a co-signer (Yes or No).
*  Default: The binary target variable indicating whether the loan defaulted (1) or not (0).



Default is the target variable.

### About the task:

This project (a binary classification task) will focus on understanding and predicting borrowers at high risk of defaulting on their loan. 

Due to the imbalanced nature of the target variable, 'Default', metrics such as PR-AUC, recall and precision will be used to evaluate and monitor model performance.

### Install and load packages

In [1]:
%pip install xgboost scipy catboost




[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip



Collecting catboost
  Downloading catboost-1.2.8-cp313-cp313-win_amd64.whl.metadata (1.5 kB)
Collecting graphviz (from catboost)
  Downloading graphviz-0.21-py3-none-any.whl.metadata (12 kB)
Collecting plotly (from catboost)
  Downloading plotly-6.5.2-py3-none-any.whl.metadata (8.5 kB)
Downloading catboost-1.2.8-cp313-cp313-win_amd64.whl (102.4 MB)
   ---------------------------------------- 0.0/102.4 MB ? eta -:--:--
    --------------------------------------- 2.1/102.4 MB 11.3 MB/s eta 0:00:09
   - -------------------------------------- 4.2/102.4 MB 10.7 MB/s eta 0:00:10
   -- ------------------------------------- 6.0/102.4 MB 10.1 MB/s eta 0:00:10
   --- ------------------------------------ 7.9/102.4 MB 9.6 MB/s eta 0:00:10
   --- ------------------------------------ 10.0/102.4 MB 9.8 MB/s eta 0:00:10
   ---- ----------------------------------- 12.3/102.4 MB 9.8 MB/s eta 0:00:10
   ----- ---------------------------------- 14.2/102.4 MB 9.6 MB/s eta 0:00:10
   ------ ---------------

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.filterwarnings('ignore')


import shutil
import kagglehub
from pathlib import Path
from scipy.stats import chi2_contingency, spearmanr
import scipy.stats as stats
from scipy import sparse
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler, MinMaxScaler, FunctionTransformer
from sklearn.feature_selection import SelectKBest, SelectPercentile, chi2, f_classif, RFE
from sklearn.compose import ColumnTransformer
import xgboost as xgb
from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score, StratifiedKFold, cross_validate, learning_curve, train_test_split
from sklearn.metrics import silhouette_score, classification_report, ConfusionMatrixDisplay, confusion_matrix, RocCurveDisplay, PrecisionRecallDisplay,roc_auc_score, average_precision_score, roc_curve, recall_score, accuracy_score, precision_score, f1_score  
from sklearn.pipeline import Pipeline, FunctionTransformer, FeatureUnion
from sklearn.inspection import permutation_importance

In [None]:
"""# Download latest version
path = kagglehub.dataset_download("nikhil1e9/loan-default")

print("Path to dataset files:", path)"""

Downloading from https://www.kaggle.com/api/v1/datasets/download/nikhil1e9/loan-default?dataset_version_number=2...


100%|██████████| 7.86M/7.86M [00:01<00:00, 4.36MB/s]

Extracting files...





Path to dataset files: C:\Users\peace\.cache\kagglehub\datasets\nikhil1e9\loan-default\versions\2


In [None]:
# retrieve cache location
kaggle_cache_csv = Path(
    r"C:\Users\peace\.cache\kagglehub\datasets\nikhil1e9"
    r"\loan-default\versions\2\Loan_default.csv")

In [4]:
# make project data folder
project_data_dir = Path("data")
project_data_dir.mkdir(exist_ok=True)

destination = project_data_dir / "Loan_default.csv"
shutil.copy(kaggle_cache_csv, destination)
print(f"Copied dataset to: {destination.resolve()}")

Copied dataset to: C:\Users\peace\OneDrive\Documents\GitHub\Machine Learning Projects\data-mining\loan-default-analysis-and-prediction\loan-default-prediction\data\Loan_default.csv


In [None]:
# load data using relative path
DATA_PATH = Path("data/Loan_default.csv")

if not DATA_PATH.exists():
    raise FileNotFoundError(
        "Dataset not found. Please place Loan_default.csv in the /data directory.")

### Load and sanity check data

In [6]:
# review data
pd.set_option('display.max_columns', None)
df = pd.read_csv(DATA_PATH)
df.head()

Unnamed: 0,LoanID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,I38PQUQS96,56,85994,50587,520,80,4,15.23,36,0.44,Bachelor's,Full-time,Divorced,Yes,Yes,Other,Yes,0
1,HPSK72WA7R,69,50432,124440,458,15,1,4.81,60,0.68,Master's,Full-time,Married,No,No,Other,Yes,0
2,C1OZ6DPJ8Y,46,84208,129188,451,26,3,21.17,24,0.31,Master's,Unemployed,Divorced,Yes,Yes,Auto,No,1
3,V2KKSFM3UN,32,31713,44799,743,0,3,7.07,24,0.23,High School,Full-time,Married,No,No,Business,No,0
4,EY08JDHTZP,60,20437,9139,633,8,4,6.51,48,0.73,Bachelor's,Unemployed,Divorced,No,Yes,Auto,No,0


In [7]:
# sample some rows
df.sample(10)

Unnamed: 0,LoanID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
158605,VCH0VF9WSJ,33,32456,128549,405,116,2,8.61,12,0.22,Bachelor's,Full-time,Divorced,No,No,Home,No,0
95905,W75KLFOU7G,51,47066,48629,670,34,1,7.84,24,0.33,High School,Self-employed,Divorced,Yes,No,Home,Yes,1
253630,IQHQK9Y124,50,134834,150103,659,15,2,16.1,36,0.79,Bachelor's,Self-employed,Single,Yes,No,Auto,Yes,0
153556,YJRZ50A273,18,111622,164233,365,11,3,7.86,12,0.88,PhD,Part-time,Married,Yes,No,Business,No,0
225297,DW5WGB68R7,55,91445,17117,632,117,3,14.6,60,0.53,PhD,Self-employed,Married,No,No,Home,No,0
63454,KY0DUS8B6L,57,72869,27913,300,9,4,11.68,24,0.4,High School,Unemployed,Divorced,Yes,No,Other,No,0
27222,7QEW639B02,59,149730,112100,817,2,3,6.18,24,0.25,Master's,Self-employed,Single,No,No,Auto,Yes,0
120573,J46P2VNJNJ,58,99294,54499,456,27,4,4.03,60,0.51,PhD,Self-employed,Divorced,Yes,No,Education,Yes,0
240456,SA7EBJCXYS,53,66029,143869,730,105,4,4.58,48,0.8,Master's,Full-time,Divorced,Yes,No,Home,No,0
183417,ZWHH4UZZD7,39,71763,224219,661,110,2,2.92,48,0.72,Bachelor's,Self-employed,Divorced,No,Yes,Other,No,0


In [8]:
# review the data structure
df = df.drop(columns=['LoanID'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255347 entries, 0 to 255346
Data columns (total 17 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Age             255347 non-null  int64  
 1   Income          255347 non-null  int64  
 2   LoanAmount      255347 non-null  int64  
 3   CreditScore     255347 non-null  int64  
 4   MonthsEmployed  255347 non-null  int64  
 5   NumCreditLines  255347 non-null  int64  
 6   InterestRate    255347 non-null  float64
 7   LoanTerm        255347 non-null  int64  
 8   DTIRatio        255347 non-null  float64
 9   Education       255347 non-null  object 
 10  EmploymentType  255347 non-null  object 
 11  MaritalStatus   255347 non-null  object 
 12  HasMortgage     255347 non-null  object 
 13  HasDependents   255347 non-null  object 
 14  LoanPurpose     255347 non-null  object 
 15  HasCoSigner     255347 non-null  object 
 16  Default         255347 non-null  int64  
dtypes: float64

In [9]:
# get column names
df.columns.tolist()

['Age',
 'Income',
 'LoanAmount',
 'CreditScore',
 'MonthsEmployed',
 'NumCreditLines',
 'InterestRate',
 'LoanTerm',
 'DTIRatio',
 'Education',
 'EmploymentType',
 'MaritalStatus',
 'HasMortgage',
 'HasDependents',
 'LoanPurpose',
 'HasCoSigner',
 'Default']

In [None]:
# separate columns by type
num_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = df.select_dtypes(include=['object']).columns.tolist()

In [11]:
# get descriptive statistics for numeric features
df[num_cols].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,255347.0,43.498306,14.990258,18.0,31.0,43.0,56.0,69.0
Income,255347.0,82499.304597,38963.013729,15000.0,48825.5,82466.0,116219.0,149999.0
LoanAmount,255347.0,127578.865512,70840.706142,5000.0,66156.0,127556.0,188985.0,249999.0
CreditScore,255347.0,574.264346,158.903867,300.0,437.0,574.0,712.0,849.0
MonthsEmployed,255347.0,59.541976,34.643376,0.0,30.0,60.0,90.0,119.0
NumCreditLines,255347.0,2.501036,1.117018,1.0,2.0,2.0,3.0,4.0
InterestRate,255347.0,13.492773,6.636443,2.0,7.77,13.46,19.25,25.0
LoanTerm,255347.0,36.025894,16.96933,12.0,24.0,36.0,48.0,60.0
DTIRatio,255347.0,0.500212,0.230917,0.1,0.3,0.5,0.7,0.9
Default,255347.0,0.116128,0.320379,0.0,0.0,0.0,0.0,1.0


Descriptive (numeric) considerations and insights:

*  Data doesn't appear to have extreme values, so outlier removal techniques won't be necessary
*  Age could benefit from being binned/discretized
*  Average customer age: ~ 43.5 years
*  Average income: ~ $82,500/year
*  Average loan amount: ~ $127,600
*  Average credit score: ~ 574
*  Average months employed: ~ 60 (roughly 5 years)
*  Average number of credit lines: ~ 2.5
*  Average interest rate: ~ 13.5%
*  Average loan term: ~ 36 months (roughly 3 years)
*  Average DTI ratio: ~ 50%
*  Baseline default rate: ~12%

Most borrowers have stable and moderate income, less than average credit scores and a considerable amount of existing financial obligations.

In [12]:
# get descriptive statistics for categorical features
df[cat_cols].describe().T

Unnamed: 0,count,unique,top,freq
Education,255347,4,Bachelor's,64366
EmploymentType,255347,4,Part-time,64161
MaritalStatus,255347,3,Married,85302
HasMortgage,255347,2,Yes,127677
HasDependents,255347,2,Yes,127742
LoanPurpose,255347,5,Business,51298
HasCoSigner,255347,2,Yes,127701


Descriptive (categorical) considerations and insights:

*  Features seem fairly uniform, based on unique counts and frequency
*  Most borrowers:
    *  Have a Bachelor's degree
    *  Are employed part-time
    *  Are married
    *  Have a mortgage
    *  Have children
    *  Apply for business loans
    *  Applied with a co-signer

Most borrowers appear to be working parents and/or entrepreneurs, who may not have a strong enough credit profile to be approved individually 

In [13]:
# review categorical features for typos, casing issues and unexpected values
for col in cat_cols:
    print(f'{col}: {sorted(df[col].unique())} \n')

Education: ["Bachelor's", 'High School', "Master's", 'PhD'] 

EmploymentType: ['Full-time', 'Part-time', 'Self-employed', 'Unemployed'] 

MaritalStatus: ['Divorced', 'Married', 'Single'] 

HasMortgage: ['No', 'Yes'] 

HasDependents: ['No', 'Yes'] 

LoanPurpose: ['Auto', 'Business', 'Education', 'Home', 'Other'] 

HasCoSigner: ['No', 'Yes'] 



There doesn't appear to be any unexpected values in the categorical features.
Education should be ordered.

In [14]:
# check for missing
df.isna().sum().sort_values(ascending=False)

Age               0
Income            0
LoanAmount        0
CreditScore       0
MonthsEmployed    0
NumCreditLines    0
InterestRate      0
LoanTerm          0
DTIRatio          0
Education         0
EmploymentType    0
MaritalStatus     0
HasMortgage       0
HasDependents     0
LoanPurpose       0
HasCoSigner       0
Default           0
dtype: int64

In [15]:
# check for dups
df.duplicated().sum()

np.int64(0)

### Data Cleaning

A few items will be addressed prior to Exploratory Data Analysis including:

*  type conversions 
*  binning
*  review category proportions

In [16]:
# separate cat features by type
ordinal_cols = ['Education']
nominal_cols = [col for col in cat_cols if col not in ordinal_cols]

In [17]:
# create order for ordinal feature
edu_order = ["High School", "Bachelor's", "Master's", "PhD"]

# apply ordering
df['Education'] = pd.Categorical(df['Education'], categories=edu_order, ordered=True)
df['Education'].cat.categories

Index(['High School', 'Bachelor's', 'Master's', 'PhD'], dtype='object')

In [18]:
# convert nominal features to 'category' dtype
for col in nominal_cols:
    df[col] = df[col].astype('category')
df[nominal_cols].dtypes

EmploymentType    category
MaritalStatus     category
HasMortgage       category
HasDependents     category
LoanPurpose       category
HasCoSigner       category
dtype: object

In [19]:
# bin age feature
df['Age_bin'] = pd.cut(df['Age'], bins=[18, 25, 35, 45, 55, 65, 100], labels=['18-24', '25-34', '35-44', '45-54', '55-64', '65+'])
df['Age_bin'].value_counts()

Age_bin
25-34    49408
35-44    49220
45-54    49148
55-64    49063
18-24    34132
65+      19492
Name: count, dtype: int64

In [26]:
# calculate category proportions
for col in cat_cols:
    print(f"Proportions for {col} categories:")
    print(df[col].value_counts(normalize=True).round(3))
    print('\n')

Proportions for Education categories:
Education
Bachelor's     0.252
High School    0.250
Master's       0.249
PhD            0.249
Name: proportion, dtype: float64


Proportions for EmploymentType categories:
EmploymentType
Part-time        0.251
Unemployed       0.250
Self-employed    0.249
Full-time        0.249
Name: proportion, dtype: float64


Proportions for MaritalStatus categories:
MaritalStatus
Married     0.334
Divorced    0.333
Single      0.333
Name: proportion, dtype: float64


Proportions for HasMortgage categories:
HasMortgage
Yes    0.5
No     0.5
Name: proportion, dtype: float64


Proportions for HasDependents categories:
HasDependents
Yes    0.5
No     0.5
Name: proportion, dtype: float64


Proportions for LoanPurpose categories:
LoanPurpose
Business     0.201
Home         0.201
Education    0.200
Other        0.199
Auto         0.199
Name: proportion, dtype: float64


Proportions for HasCoSigner categories:
HasCoSigner
Yes    0.5
No     0.5
Name: proportion, dtype: 

Categorical feature proportions insights:

* Most categories are relatively stable and uniform in nature

## Exploratory Data Analysis:

*  Univariate Analysis

    1. Categorical features: Barplots
    2. Numerical features: Histograms (w/ KDE) 

*  Bivariate Analysis

    1. Categorical vs Target: Heatmap of crosstabs
    2. Numerical vs Target: Violinplots
    3. Behavioral relationships with Target: default-rate barplots
    4. Effect size: Lift barplots/lineplots
    5. Association Analysis: Cramer's V/ Spearman correlation barplots

*  Customer segmentation