# Title: Predicting Loan Repayment Behavior Using Machine Learning

## Introduction:
In today’s financial landscape, predicting whether a borrower is likely to repay a loan is crucial for minimizing risk and ensuring the sustainability of lending institutions. This project leverages machine learning techniques to classify loan applicants as either "good" or "bad" lenders based on key financial and behavioral indicators. Using historical credit data, I built a predictive model that helps financial institutions make informed lending decisions, reduce default rates, and improve overall credit management strategies.

In [None]:
!pip install gitpython




## Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import Libraries and packages

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import xgboost as xgb



## The Dataset

In [None]:
test = pd.read_csv('/content/drive/MyDrive/cs-test.csv')
train = pd.read_csv('/content/drive/MyDrive/cs-training.csv')

## Understanding the Data

In [None]:
train.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,3,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,4,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


In [None]:
test.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,,0.885519,43,0,0.177513,5700.0,4,0,0,0,0.0
1,2,,0.463295,57,0,0.527237,9141.0,15,0,4,0,2.0
2,3,,0.043275,59,0,0.687648,5083.0,12,0,1,0,2.0
3,4,,0.280308,38,1,0.925961,3200.0,7,0,2,0,0.0
4,5,,1.0,27,0,0.019917,3865.0,4,0,0,0,1.0


In [None]:
print(test.shape)

(101503, 12)


In [None]:
print(train.shape)

(150000, 12)


In [None]:
print(train.isna().sum())

Unnamed: 0                                  0
SeriousDlqin2yrs                            0
RevolvingUtilizationOfUnsecuredLines        0
age                                         0
NumberOfTime30-59DaysPastDueNotWorse        0
DebtRatio                                   0
MonthlyIncome                           29731
NumberOfOpenCreditLinesAndLoans             0
NumberOfTimes90DaysLate                     0
NumberRealEstateLoansOrLines                0
NumberOfTime60-89DaysPastDueNotWorse        0
NumberOfDependents                       3924
dtype: int64


In [None]:
print(test.isna().sum())

Unnamed: 0                                   0
SeriousDlqin2yrs                        101503
RevolvingUtilizationOfUnsecuredLines         0
age                                          0
NumberOfTime30-59DaysPastDueNotWorse         0
DebtRatio                                    0
MonthlyIncome                            20103
NumberOfOpenCreditLinesAndLoans              0
NumberOfTimes90DaysLate                      0
NumberRealEstateLoansOrLines                 0
NumberOfTime60-89DaysPastDueNotWorse         0
NumberOfDependents                        2626
dtype: int64


In [None]:
print(train.duplicated().sum())

0


In [None]:
print(test.duplicated().sum())

0


## Data Preprocessing

In [None]:
df_test = train.drop('Unnamed: 0', axis=1)
df_train = test.drop('Unnamed: 0', axis=1)

In [None]:
 d_train = train.drop('Unnamed: 0', axis=1)
 d_train.isnull().sum()
 d_test = test.drop('Unnamed: 0', axis=1)
 d_test.isnull().sum()

Unnamed: 0,0
SeriousDlqin2yrs,101503
RevolvingUtilizationOfUnsecuredLines,0
age,0
NumberOfTime30-59DaysPastDueNotWorse,0
DebtRatio,0
MonthlyIncome,20103
NumberOfOpenCreditLinesAndLoans,0
NumberOfTimes90DaysLate,0
NumberRealEstateLoansOrLines,0
NumberOfTime60-89DaysPastDueNotWorse,0


In [None]:
# Fix missing MonthlyIncome
d_train['MonthlyIncome'] = d_train['MonthlyIncome'].fillna(d_train['MonthlyIncome'].median())
d_train['MissingIncomeFlag'] = d_train['MonthlyIncome'].isnull().astype(int)

# Fix missing NumberOfDependents
d_train['NumberOfDependents'] = d_train['NumberOfDependents'].fillna(d_train['NumberOfDependents'].median())
d_train['MissingDependentsFlag'] = d_train['NumberOfDependents'].isnull().astype(int)

d_test['MonthlyIncome'] = d_test['MonthlyIncome'].fillna(d_train['MonthlyIncome'].median())
d_test['MissingIncomeFlag'] = d_test['MonthlyIncome'].isnull().astype(int)

d_test['NumberOfDependents'] = d_test['NumberOfDependents'].fillna(d_train['NumberOfDependents'].median())
d_test['MissingDependentsFlag'] = d_test['NumberOfDependents'].isnull().astype(int)


In [None]:
d_test.isna().sum()

Unnamed: 0,0
SeriousDlqin2yrs,101503
RevolvingUtilizationOfUnsecuredLines,0
age,0
NumberOfTime30-59DaysPastDueNotWorse,0
DebtRatio,0
MonthlyIncome,0
NumberOfOpenCreditLinesAndLoans,0
NumberOfTimes90DaysLate,0
NumberRealEstateLoansOrLines,0
NumberOfTime60-89DaysPastDueNotWorse,0


## Model Training and Prediction

In [None]:
X = d_train.drop('SeriousDlqin2yrs', axis=1)  # Features
y = d_train['SeriousDlqin2yrs']              # Target

# Train-test split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, stratify=y, test_size=0.2, random_state=42
)

# Calculate imbalance ratio
scale_weight = (len(y_train) - sum(y_train)) / sum(y_train)

# Define XGBoost model
model = xgb.XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    scale_pos_weight=scale_weight,
    random_state=42
)

# Train model
model.fit(X_train, y_train)

# Predict on validation set
y_pred = model.predict(X_val)
y_proba = model.predict_proba(X_val)[:, 1]

# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred))
print("\nClassification Report:\n", classification_report(y_val, y_pred))
print("\nROC-AUC Score:", roc_auc_score(y_val, y_proba))



Parameters: { "use_label_encoder" } are not used.



Confusion Matrix:
 [[23217  4778]
 [  584  1421]]

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.83      0.90     27995
           1       0.23      0.71      0.35      2005

    accuracy                           0.82     30000
   macro avg       0.60      0.77      0.62     30000
weighted avg       0.93      0.82      0.86     30000


ROC-AUC Score: 0.85072348598053


In [None]:
# Make a copy to avoid modifying the original
d_test_features = d_test.copy()

# Drop 'SeriousDlqin2yrs' and 'good/bad lender' if they exist
for col in ['SeriousDlqin2yrs', 'good/bad lender']:
    if col in d_test_features.columns:
        d_test_features.drop(columns=col, inplace=True)

# Predict
test_predictions = model.predict(d_test_features)

# Add 'good/bad lender' column
d_test['good/bad lender'] = ['bad' if pred == 1 else 'good' for pred in test_predictions]

# (Optional) Save to CSV
d_test.to_csv('test_with_good_bad_lender.csv', index=False)


In [None]:
d_test.head()

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents,MissingIncomeFlag,MissingDependentsFlag,good/bad lender
0,,0.885519,43,0,0.177513,5700.0,4,0,0,0,0.0,0,0,good
1,,0.463295,57,0,0.527237,9141.0,15,0,4,0,2.0,0,0,bad
2,,0.043275,59,0,0.687648,5083.0,12,0,1,0,2.0,0,0,good
3,,0.280308,38,1,0.925961,3200.0,7,0,2,0,0.0,0,0,bad
4,,1.0,27,0,0.019917,3865.0,4,0,0,0,1.0,0,0,bad
