**Reasons why a loan could be rejected:**
* Credit score was too low
* Debt-to-income ratio was too high
* Tried to borrow too much
* Income was insufficient or unstable
* Didn’t meet the basic requirements
* Missing information on the application
* Loan purpose didn’t meet the lender’s criteria

[Reference](https://www.lendingtree.com/personal/reasons-why-your-personal-loan-was-declined/)


In [None]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
df = pd.read_csv('Bondora_raw.csv')

In [None]:
df.head()

Unnamed: 0,ReportAsOfEOD,LoanId,LoanNumber,ListedOnUTC,BiddingStartedOn,BidsPortfolioManager,BidsApi,BidsManual,UserName,NewCreditCustomer,...,PreviousEarlyRepaymentsCountBeforeLoan,GracePeriodStart,GracePeriodEnd,NextPaymentDate,NextPaymentNr,NrOfScheduledPayments,ReScheduledOn,PrincipalDebtServicingCost,InterestAndPenaltyDebtServicingCost,ActiveLateLastPaymentCategory
0,2020-01-27,F0660C80-83F3-4A97-8DA0-9C250112D6EC,659,2009-06-11 16:40:39,2009-06-11 16:40:39,0,0,115.041,KARU,True,...,0,,,,,,,0.0,0.0,
1,2020-01-27,978BB85B-1C69-4D51-8447-9C240104A3A2,654,2009-06-10 15:48:57,2009-06-10 15:48:57,0,0,140.6057,koort681,False,...,0,,,,,,,0.0,0.0,
2,2020-01-27,EA44027E-7FA7-4BB2-846D-9C1F013C8A22,641,2009-06-05 19:12:29,2009-06-05 19:12:29,0,0,319.558,0ie,True,...,0,,,,,,,0.0,0.0,180+
3,2020-01-27,CE67AD25-2951-4BEE-96BD-9C2700C61EF4,668,2009-06-13 12:01:20,2009-06-13 12:01:20,0,0,57.5205,Alyona,True,...,0,,,,,,,0.0,0.0,
4,2020-01-27,9408BF8C-B159-4D6A-9D61-9C2400A986E3,652,2009-06-10 10:17:13,2009-06-10 10:17:13,0,0,319.5582,Kai,True,...,0,,,,,,,0.0,0.0,180+


In [None]:
df.shape

(134529, 112)

This Dataset has 134529 records and 112 columns.

In [None]:
df.describe()

Unnamed: 0,LoanNumber,BidsPortfolioManager,BidsApi,BidsManual,ApplicationSignedHour,ApplicationSignedWeekday,VerificationType,LanguageCode,Age,Gender,...,InterestAndPenaltyBalance,NoOfPreviousLoansBeforeLoan,AmountOfPreviousLoansBeforeLoan,PreviousRepaymentsBeforeLoan,PreviousEarlyRepaymentsBefoleLoan,PreviousEarlyRepaymentsCountBeforeLoan,NextPaymentNr,NrOfScheduledPayments,PrincipalDebtServicingCost,InterestAndPenaltyDebtServicingCost
count,134529.0,134529.0,134529.0,134529.0,134529.0,134529.0,134484.0,134529.0,134529.0,134484.0,...,134529.0,134529.0,134529.0,91368.0,58026.0,134529.0,97788.0,97788.0,59129.0,59129.0
mean,944939.2,966.452876,29.111664,559.33259,13.37464,3.907908,2.817257,2.827874,40.819295,0.442097,...,701.567107,1.48762,2868.652401,928.395548,320.743805,0.069903,5.178795,50.126795,5.264702,89.851455
std,478673.8,1355.686016,150.159148,750.360512,4.992375,1.726192,1.407908,1.959802,12.348693,0.636083,...,2514.595572,2.396148,4507.046575,2042.348751,1561.799076,0.359461,7.674427,12.51953,57.800582,287.449052
min,37.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,-2.66,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,620679.0,155.0,0.0,96.0,10.0,2.0,1.0,1.0,31.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,36.0,0.0,0.0
50%,923597.0,465.0,0.0,317.0,13.0,4.0,4.0,3.0,40.0,0.0,...,0.0,1.0,396.3541,197.98,0.0,0.0,3.0,60.0,0.0,0.0
75%,1311025.0,1218.0,5.0,729.0,17.0,5.0,4.0,4.0,50.0,1.0,...,202.9,2.0,4250.0,780.95,0.0,0.0,7.0,60.0,0.0,17.33
max,1855339.0,10625.0,7570.0,10630.0,23.0,7.0,4.0,22.0,77.0,2.0,...,64494.77,25.0,53762.0,34077.42,48100.0,11.0,60.0,72.0,3325.33,5295.29


In [None]:
df.columns

Index(['ReportAsOfEOD', 'LoanId', 'LoanNumber', 'ListedOnUTC',
       'BiddingStartedOn', 'BidsPortfolioManager', 'BidsApi', 'BidsManual',
       'UserName', 'NewCreditCustomer',
       ...
       'PreviousEarlyRepaymentsCountBeforeLoan', 'GracePeriodStart',
       'GracePeriodEnd', 'NextPaymentDate', 'NextPaymentNr',
       'NrOfScheduledPayments', 'ReScheduledOn', 'PrincipalDebtServicingCost',
       'InterestAndPenaltyDebtServicingCost', 'ActiveLateLastPaymentCategory'],
      dtype='object', length=112)

In [None]:
numerical_columns = df.select_dtypes(include=['float64','int64']).columns.to_list()

**Impute with median value:** For the numerical column, you can also replace the missing values with median values. In case you have extreme values such as outliers it is advisable to use the median approach.

In [None]:
numerical_columns

['BidsPortfolioManager',
 'BidsApi',
 'BidsManual',
 'ApplicationSignedHour',
 'ApplicationSignedWeekday',
 'VerificationType',
 'LanguageCode',
 'Age',
 'Gender',
 'AppliedAmount',
 'Amount',
 'Interest',
 'LoanDuration',
 'MonthlyPayment',
 'UseOfLoan',
 'Education',
 'MaritalStatus',
 'EmploymentStatus',
 'OccupationArea',
 'HomeOwnershipType',
 'IncomeFromPrincipalEmployer',
 'IncomeFromPension',
 'IncomeFromFamilyAllowance',
 'IncomeFromSocialWelfare',
 'IncomeFromLeavePay',
 'IncomeFromChildSupport',
 'IncomeOther',
 'IncomeTotal',
 'ExistingLiabilities',
 'LiabilitiesTotal',
 'RefinanceLiabilities',
 'DebtToIncome',
 'FreeCash',
 'MonthlyPaymentDay',
 'PlannedInterestTillDate',
 'ExpectedLoss',
 'LossGivenDefault',
 'ExpectedReturn',
 'ProbabilityOfDefault',
 'PrincipalOverdueBySchedule',
 'ModelVersion',
 'PrincipalPaymentsMade',
 'InterestAndPenaltyPaymentsMade',
 'PrincipalBalance',
 'InterestAndPenaltyBalance',
 'NoOfPreviousLoansBeforeLoan',
 'AmountOfPreviousLoansBeforeLoa

In [None]:
df.dtypes

ReportAsOfEOD                              object
BidsPortfolioManager                        int64
BidsApi                                     int64
BidsManual                                float64
NewCreditCustomer                            bool
                                           ...   
AmountOfPreviousLoansBeforeLoan           float64
PreviousRepaymentsBeforeLoan              float64
PreviousEarlyRepaymentsCountBeforeLoan      int64
NextPaymentNr                             float64
NrOfScheduledPayments                     float64
Length: 71, dtype: object

In [None]:
len(numerical_columns)

51

**Impute with mode value:** For the categorical column, We can replace the missing values with mode values i.e the frequent ones.

In [None]:
categorical_columns = df.select_dtypes(include=['object','bool']).columns.to_list()

In [None]:
categorical_columns

['ReportAsOfEOD',
 'NewCreditCustomer',
 'LoanApplicationStartedDate',
 'LoanDate',
 'FirstPaymentDate',
 'MaturityDate_Original',
 'MaturityDate_Last',
 'DateOfBirth',
 'Country',
 'County',
 'City',
 'EmploymentDurationCurrentEmployer',
 'ActiveScheduleFirstPaymentReached',
 'LastPaymentOn',
 'StageActiveSince',
 'Rating',
 'Status',
 'Restructured',
 'WorseLateCategory',
 'CreditScoreEsMicroL']

In [None]:
len(categorical_columns)

20

In [None]:
from sklearn.impute import SimpleImputer

num_imp = SimpleImputer(strategy = 'median')
df[numerical_columns]= num_imp.fit_transform(df[numerical_columns])

cat_imp = SimpleImputer(strategy = 'most_frequent')
df[categorical_columns]= cat_imp.fit_transform(df[categorical_columns])

In [None]:
mi_scores['InterestAndPenaltyBalance']

0.6785910205972808

In [None]:
low_mi_score_col = []
for i in X.columns:
  if mi_scores[i] < 0.05:
    print(f'Removing the {i} column')
    low_mi_score_col.append(i)
    X.drop(columns = [i], inplace = True)

Removing the BidsPortfolioManager column
Removing the BidsApi column
Removing the BidsManual column
Removing the ApplicationSignedHour column
Removing the ApplicationSignedWeekday column
Removing the VerificationType column
Removing the LanguageCode column
Removing the Age column
Removing the Gender column
Removing the Amount column
Removing the UseOfLoan column
Removing the Education column
Removing the MaritalStatus column
Removing the EmploymentStatus column
Removing the OccupationArea column
Removing the HomeOwnershipType column
Removing the IncomeFromPrincipalEmployer column
Removing the IncomeFromPension column
Removing the IncomeFromFamilyAllowance column
Removing the IncomeFromSocialWelfare column
Removing the IncomeFromLeavePay column
Removing the IncomeFromChildSupport column
Removing the IncomeOther column
Removing the IncomeTotal column
Removing the ExistingLiabilities column
Removing the LiabilitiesTotal column
Removing the RefinanceLiabilities column
Removing the DebtToIn

In [None]:
X.shape

(77394, 27)

In [None]:
X.columns

Index(['AppliedAmount', 'Interest', 'LoanDuration', 'MonthlyPayment',
       'PlannedInterestTillDate', 'ExpectedLoss', 'LossGivenDefault',
       'ExpectedReturn', 'ProbabilityOfDefault', 'PrincipalOverdueBySchedule',
       'PrincipalPaymentsMade', 'InterestAndPenaltyPaymentsMade',
       'PrincipalBalance', 'InterestAndPenaltyBalance', 'NextPaymentNr',
       'NrOfScheduledPayments', 'LoanApplicationStartedDate_year',
       'FirstPaymentDate_year', 'MaturityDate_Original_year',
       'MaturityDate_Last_year', 'LastPaymentOn_year', 'LastPaymentOn_month',
       'LastPaymentOn_week', 'StageActiveSince_year', 'StageActiveSince_month',
       'StageActiveSince_week', 'StageActiveSince_day'],
      dtype='object')

In [None]:
X.shape

(77394, 27)

In [None]:
X.head()

Unnamed: 0,AppliedAmount,Interest,LoanDuration,MonthlyPayment,PlannedInterestTillDate,ExpectedLoss,LossGivenDefault,ExpectedReturn,ProbabilityOfDefault,PrincipalOverdueBySchedule,...,FirstPaymentDate_year,MaturityDate_Original_year,MaturityDate_Last_year,LastPaymentOn_year,LastPaymentOn_month,LastPaymentOn_week,StageActiveSince_year,StageActiveSince_month,StageActiveSince_week,StageActiveSince_day
0,319.5582,30.0,12.0,97.38,319.08,0.123398,0.506748,0.134049,0.23414,0.0,...,2009,2010,2010,2010,7,27,2019,2,6,8
1,191.7349,25.0,1.0,97.38,45.83,0.123398,0.506748,0.134049,0.23414,0.0,...,2009,2009,2009,2009,7,28,2019,2,6,8
2,319.5582,25.0,20.0,97.38,197.2926,0.123398,0.506748,0.134049,0.23414,116.35,...,2009,2011,2014,2012,10,40,2016,3,9,3
3,127.8233,45.0,15.0,97.38,293.1,0.123398,0.506748,0.134049,0.23414,0.0,...,2009,2010,2010,2010,9,37,2019,2,6,8
4,319.5582,30.0,12.0,97.38,833.81,0.123398,0.506748,0.134049,0.23414,0.0,...,2009,2010,2010,2015,7,29,2019,2,6,8


In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

In [None]:
from sklearn.preprocessing import Normalizer
scale = Normalizer()
x_train = scale.fit_transform(x_train)
x_test = scale.transform(x_test)

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Fit the logistic regression model
clf = LogisticRegression()
clf.fit(x_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(x_test)

# Calculate the model's performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

## Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier_rf = RandomForestClassifier(random_state=42, n_jobs=-1, max_depth=5,
                                       n_estimators=100, oob_score=True)
classifier_rf.fit(x_train, y_train)

classifier_rf.oob_score_

0.9999515464750061

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier_rf = RandomForestClassifier(max_depth=2, min_samples_leaf=5, n_jobs=-1,
                                       random_state=42, n_estimators=100, oob_score=True)

classifier_rf.fit(x_train, y_train)

classifier_rf.oob_score_

0.9986433013001695

In [None]:
y_pred = classifier_rf.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)