# Loan Default Risk Prediction System
### Machine Learning Approach for Credit Risk Assessment

## 1. Problem Statement

Financial institutions face financial losses due to loan defaults.
The goal of this project is to build a machine learning model
that predicts the probability of loan default in order to support
risk-aware lending decisions.


In [1]:
import numpy as np
import pandas as pd

In [3]:
df=pd.read_csv(r"C:\Users\Srija\Downloads\credit_risk_dataset.csv\credit_risk_dataset.csv")

## 2. Dataset Overview


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32581 entries, 0 to 32580
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_age                  32581 non-null  int64  
 1   person_income               32581 non-null  int64  
 2   person_home_ownership       32581 non-null  object 
 3   person_emp_length           31686 non-null  float64
 4   loan_intent                 32581 non-null  object 
 5   loan_grade                  32581 non-null  object 
 6   loan_amnt                   32581 non-null  int64  
 7   loan_int_rate               29465 non-null  float64
 8   loan_status                 32581 non-null  int64  
 9   loan_percent_income         32581 non-null  float64
 10  cb_person_default_on_file   32581 non-null  object 
 11  cb_person_cred_hist_length  32581 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 3.0+ MB


In [7]:
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


In [9]:
type(df)

pandas.core.frame.DataFrame

## 3. Data Preprocessing
- Separate features and target
- Identify categorical and numerical columns
- Apply OneHotEncoding
- Prevent data leakage using fit on training data only


In [11]:
x=df.drop("loan_status",axis=1)
y=df["loan_status"]

In [13]:
cat_cols = x.select_dtypes(include="object").columns
num_cols = x.select_dtypes(exclude="object").columns

In [15]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
tree_numeric = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
tree_preprocessor = ColumnTransformer(transformers=[("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), cat_cols),("num", tree_numeric, num_cols)])

In [16]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42,stratify=y)

In [19]:
x_train_proc = tree_preprocessor.fit_transform(x_train)
x_test_proc = tree_preprocessor.transform(x_test)

## 4. Model 1: Decision Tree Classifier
- max_depth=5
- class_weight='balanced'
- Evaluated using ROC-AUC

In [27]:
from sklearn.tree import DecisionTreeClassifier
tree_pipeline = Pipeline(steps=[("preprocessor", tree_preprocessor),
("model", DecisionTreeClassifier(max_depth=5,class_weight='balanced',random_state=42))]).fit(x_train, y_train)

In [29]:
from sklearn.metrics import classification_report, roc_auc_score
y_pred_tree = tree_pipeline.predict(x_test)
y_prob_tree= tree_pipeline.predict_proba(x_test)[:, 1]
print(classification_report(y_test, y_pred_tree))
print("tree ROC-AUC:", roc_auc_score(y_test, y_prob_tree))

              precision    recall  f1-score   support

           0       0.93      0.92      0.92      7642
           1       0.73      0.74      0.73      2133

    accuracy                           0.88      9775
   macro avg       0.83      0.83      0.83      9775
weighted avg       0.88      0.88      0.88      9775

tree ROC-AUC: 0.8784081861619719


## 5. Model 2: Logistic Regression (Baseline Comparison)

In [31]:
from sklearn.preprocessing import StandardScaler
numeric_pipeline = Pipeline(steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())])

In [33]:
log_preprocessor = ColumnTransformer(transformers=[("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), cat_cols),("num", numeric_pipeline, num_cols)])
x_train_proc = log_preprocessor.fit_transform(x_train)
x_test_proc = log_preprocessor.transform(x_test)

In [48]:
from sklearn.linear_model import LogisticRegression
log_model = Pipeline(steps=[("preprocessor", log_preprocessor),("model", LogisticRegression(max_iter=3000,class_weight="balanced"))]).fit(x_train, y_train)
y_pred_log = log_model.predict(x_test)
y_prob_log = log_model.predict_proba(x_test)[:, 1]
print(classification_report(y_test, y_pred_log))
print("Logistic ROC-AUC:", roc_auc_score(y_test, y_prob_log))


              precision    recall  f1-score   support

           0       0.93      0.82      0.87      7642
           1       0.55      0.79      0.65      2133

    accuracy                           0.81      9775
   macro avg       0.74      0.80      0.76      9775
weighted avg       0.85      0.81      0.82      9775

Logistic ROC-AUC: 0.8738219450754111


## 6. Cross-Validation for Robust Evaluation


In [50]:
from sklearn.model_selection import cross_val_score
log_cv_scores = cross_val_score(log_model,x,y,cv=5,scoring="roc_auc")
print("Logistic CV ROC-AUC:", log_cv_scores.mean())
print("Logistic CV Std:", log_cv_scores.std())

Logistic CV ROC-AUC: 0.8681262948560045
Logistic CV Std: 0.007916417018385084


In [42]:
tree_cv_scores = cross_val_score(tree_pipeline,x,y,cv=5,scoring="roc_auc")
print("Tree CV ROC-AUC:", tree_cv_scores.mean())
print("Tree CV Std:", tree_cv_scores.std())

Tree CV ROC-AUC: 0.8782292209435913
Tree CV Std: 0.017766958636953985


## 7.Final Model Selection
Final Model: Decision Tree Classifier

## 8.Result
The Decision Tree model slightly outperformed Logistic Regression in terms of ROCâ€“AUC, while maintaining strong balance between precision and recall. Cross-validation confirmed model stability across folds