<a href="https://colab.research.google.com/github/Thuan-ML/ML-II---Identifying-risky-borrowers-/blob/main/Notebook/project_Identifying_risky_borrowers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project: Identifying Risky Borrowers Using Logistic Regression, AdaBoost and XGBoost**

# **1. Introduction**

In this project, we want to identify borrowers who are likely to default on a loan. Using a loan dataset from Kaggle, we frame the problem as a binary classification task, where each borrower is classified as either defaulting or non-defaulting.

We start with a simple Logistic Regression model as a baseline, then apply AdaBoost as an intermediate model, and finally use XGBoost as a more advanced method to examine whether increasing model complexity improves the identification of risky borrowers.

# **2. Import Libraries**

In [None]:
!git clone https://github.com/Thuan-ML/ML-II---Identifying-risky-borrowers-.git

import os
os.chdir("ML-II---Identifying-risky-borrowers-")

Cloning into 'ML-II---Identifying-risky-borrowers-'...
remote: Enumerating objects: 11, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 11 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (11/11), 5.30 MiB | 16.60 MiB/s, done.


In [None]:

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from xgboost import XGBClassifier


# **3. Data Loading**

In [None]:
df = pd.read_excel("Data/Loan.xlsx")
df.head()

Unnamed: 0,ApplicationDate,Age,AnnualIncome,CreditScore,EmploymentStatus,EducationLevel,Experience,LoanAmount,LoanDuration,MaritalStatus,...,MonthlyIncome,UtilityBillsPaymentHistory,JobTenure,NetWorth,BaseInterestRate,InterestRate,MonthlyLoanPayment,TotalDebtToIncomeRatio,LoanApproved,RiskScore
0,2018-01-01,45,39948,617,Employed,Master,22,13152,48,Married,...,3329.0,0.724972,11,126928,0.199652,0.22759,419.805992,0.181077,0,49.0
1,2018-01-02,38,39709,628,Employed,Associate,15,26045,48,Single,...,3309.083333,0.935132,3,43609,0.207045,0.201077,794.054238,0.389852,0,52.0
2,2018-01-03,47,40724,570,Employed,Bachelor,26,17627,36,Married,...,3393.666667,0.872241,6,5205,0.217627,0.212548,666.406688,0.462157,0,52.0
3,2018-01-04,58,69084,545,Employed,High School,34,37898,96,Single,...,5757.0,0.896155,5,99452,0.300398,0.300911,1047.50698,0.313098,0,54.0
4,2018-01-05,37,103264,594,Employed,Associate,17,9184,36,Married,...,8605.333333,0.941369,5,227019,0.197184,0.17599,330.17914,0.07021,1,36.0


# **4. Data Overview**

In [None]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 36 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   ApplicationDate             20000 non-null  datetime64[ns]
 1   Age                         20000 non-null  int64         
 2   AnnualIncome                20000 non-null  int64         
 3   CreditScore                 20000 non-null  int64         
 4   EmploymentStatus            20000 non-null  object        
 5   EducationLevel              20000 non-null  object        
 6   Experience                  20000 non-null  int64         
 7   LoanAmount                  20000 non-null  int64         
 8   LoanDuration                20000 non-null  int64         
 9   MaritalStatus               20000 non-null  object        
 10  NumberOfDependents          20000 non-null  int64         
 11  HomeOwnershipStatus         20000 non-null  object    

Unnamed: 0,ApplicationDate,Age,AnnualIncome,CreditScore,Experience,LoanAmount,LoanDuration,NumberOfDependents,MonthlyDebtPayments,CreditCardUtilizationRate,...,MonthlyIncome,UtilityBillsPaymentHistory,JobTenure,NetWorth,BaseInterestRate,InterestRate,MonthlyLoanPayment,TotalDebtToIncomeRatio,LoanApproved,RiskScore
count,20000,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,...,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0
mean,2045-05-18 12:00:00,39.7526,59161.47355,571.6124,17.52275,24882.8678,54.057,1.5173,454.2927,0.286381,...,4891.715521,0.799918,5.00265,72294.32,0.239124,0.23911,911.607052,0.402182,0.239,50.76678
min,2018-01-01 00:00:00,18.0,15000.0,343.0,0.0,3674.0,12.0,0.0,50.0,0.000974,...,1250.0,0.259203,0.0,1000.0,0.130101,0.11331,97.030193,0.016043,0.0,28.8
25%,2031-09-09 18:00:00,32.0,31679.0,540.0,9.0,15575.0,36.0,0.0,286.0,0.160794,...,2629.583333,0.727379,3.0,8734.75,0.213889,0.209142,493.7637,0.179693,0.0,46.0
50%,2045-05-18 12:00:00,40.0,48566.0,578.0,17.0,21914.5,48.0,1.0,402.0,0.266673,...,4034.75,0.820962,5.0,32855.5,0.236157,0.23539,728.511452,0.302711,0.0,52.0
75%,2059-01-25 06:00:00,48.0,74391.0,609.0,25.0,30835.0,72.0,2.0,564.0,0.390634,...,6163.0,0.892333,6.0,88825.5,0.261533,0.265532,1112.770759,0.509214,0.0,56.0
max,2072-10-03 00:00:00,80.0,485341.0,712.0,61.0,184732.0,120.0,5.0,2919.0,0.91738,...,25000.0,0.999433,16.0,2603208.0,0.405029,0.446787,10892.62952,4.647657,1.0,84.0
std,,11.622713,40350.845168,50.997358,11.316836,13427.421217,24.664857,1.386325,240.507609,0.159793,...,3296.771598,0.120665,2.236804,117920.0,0.035509,0.042205,674.583473,0.338924,0.426483,7.778262


# **5. Split Data into Features and Target Variables**

The task is formulated as a supervised binary classification problem.
Given borrower features X, the objective is to predict the binary outcome y, indicating whether a loan is approved.

In [None]:
target = "LoanApproved"
# Feature variables
X = df.drop(columns=[target])
# Target variable
y = df[target]


#**6. Train–test split**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)



# **7. Data Preprocessing**

##**7.1 Encoding categorical variables**

In [None]:
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

X_train, X_test = X_train.align(X_test, axis=1, fill_value=0)

## **7.2 Cleaning datatypes**

In [None]:
# Drop datetime columns
datetime_cols = X_train.select_dtypes(include=["datetime64[ns]"]).columns
X_train = X_train.drop(columns=datetime_cols)
X_test = X_test.drop(columns=datetime_cols)

# Convert boolean columns to numeric
X_train = X_train.astype(int)
X_test = X_test.astype(int)


##**7.3 Feature scaling**

In [None]:
sscaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



**7.4. Feature selection**

In [None]:
selector = SelectKBest(score_func=f_classif, k=10)

X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

  f = msb / msw


In [None]:
selected_features = X_train.columns[selector.get_support()]
selected_features

Index(['Age', 'AnnualIncome', 'CreditScore', 'LoanAmount', 'TotalAssets',
       'MonthlyIncome', 'NetWorth', 'MonthlyLoanPayment', 'RiskScore',
       'EducationLevel_High School'],
      dtype='object')

#**8. Baseline model – Logistic Regression**

##**8.1 Train model**

In [None]:
model_lr = LogisticRegression(max_iter=1000)
model_lr.fit(X_train_selected, y_train)

##**8.2 Prediction**

In [None]:
y_pred_lr = model_lr.predict(X_test_selected)

##**8.3 Evaluation**

In [None]:
acc_lr = accuracy_score(y_test, y_pred_lr)
prec_lr = precision_score(y_test, y_pred_lr)
rec_lr = recall_score(y_test, y_pred_lr)
f1_lr = f1_score(y_test, y_pred_lr)

print("Logistic Regression")
print("Accuracy:", acc_lr)
print("Precision:", prec_lr)
print("Recall:", rec_lr)
print("F1-score:", f1_lr)

Logistic Regression
Accuracy: 0.9835
Precision: 0.9713983050847458
Recall: 0.9592050209205021
F1-score: 0.9652631578947368


#**9. AdaBoost classifier**

##**9.1 Train AdaBoost with CV + tuning**

In [None]:
param_grid = {
    "n_estimators": [50, 100, 200],
    "learning_rate": [0.05, 0.1, 0.5]
}

grid_ada = GridSearchCV(
    AdaBoostClassifier(random_state=42),
    param_grid,
    scoring="f1",
    cv=5
)

grid_ada.fit(X_train_selected, y_train)

model_ada = grid_ada.best_estimator_