# Predicting if a borrower will pay off their loan on time or not

In this project we will focus on credit modelling for Lending Club, which is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. We will be working with financial lending data.

Each borrower fills out a comprehensive application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower's credit score using past historical data (and their own data science process!) and assign an interest rate to the borrower. The interest rate is the percent in addition to the requested loan amount the borrower has to pay back.

**The goal of this project it to build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not.**

## Reading the data

In [78]:
import pandas as pd
pd.options.display.max_columns = 100

#Reading the data without first row, which contains just unuseful text
loans = pd.read_csv(r"Data\LoanStats3a.csv", skiprows=1, low_memory=0)

#Removing all columns containing more than 50% missing values
half_count = len(loans) / 2
loans = loans.dropna(thresh=half_count, axis=1)

#Removing 'desc' column, which contains a long text explanation for each loan
loans = loans.drop(['desc'],axis=1)

#Let's save our cleaned data
loans.to_csv('Data\loans.csv', index=False)

loans.head(3)

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens,hardship_flag,disbursement_method,debt_settlement_flag
0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2018,0.0,1.0,Individual,0.0,0.0,0.0,0.0,0.0,N,Cash,N
1,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1014.53,1014.53,456.46,435.17,0.0,122.9,1.11,Apr-2013,119.66,Oct-2016,0.0,1.0,Individual,0.0,0.0,0.0,0.0,0.0,N,Cash,N
2,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2017,0.0,1.0,Individual,0.0,0.0,0.0,0.0,0.0,N,Cash,N


In [79]:
loans.shape

(42538, 53)

## Data Cleaning

We will start with removing the columns that will not be useful for our ML model, due to:
- leak of information from the future (after the loan has already been funded)
- no affection for a borrower's ability to pay back a loan (e.g. a randomly generated ID value by Lending Club)
- poor format and need to be cleaned up
- require more data or a lot of processing to turn into a useful feature
- contain redundant information

From the 52 columns we are going to remove the following columns:
- **funded_amnt:** leaks data from the future (after the loan is already started to be funded)
- **funded_amnt_inv:** also leaks data from the future (after the loan is already started to be funded)
- **grade:** contains redundant information as the interest rate column (int_rate)
- **sub_grade:** also contains redundant information as the interest rate column (int_rate)
- **emp_title:** requires other data and a lot of processing to potentially be useful
- **issue_d:** leaks data from the future (after the loan is already completed funded)
- **zip_code:** redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible (which only can be used to identify the state the borrower lives in)
- **out_prncp:** leaks data from the future, (after the loan already started to be paid off)
- **out_prncp_inv:** also leaks data from the future, (after the loan already started to be paid off)
- **total_pymnt:** also leaks data from the future, (after the loan already started to be paid off)
- **total_pymnt_inv:** also leaks data from the future, (after the loan already started to be paid off)
- **total_rec_prncp:** also leaks data from the future, (after the loan already started to be paid off)
- **total_rec_int:** leaks data from the future, (after the loan already started to be paid off),
- **total_rec_late_fee:** also leaks data from the future, (after the loan already started to be paid off),
recoveries: also leaks data from the future, (after the loan already started to be paid off),
- **collection_recovery_fee:** also leaks data from the future, (after the loan already started to be paid off),
- **last_pymnt_d:** also leaks data from the future, (after the loan already started to be paid off),
- **last_pymnt_amnt:** also leaks data from the future, (after the loan already started to be paid off),

In [80]:
loans = loans.drop(
        ["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d",
        "zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp",
        "total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"]
        , axis=1)
print(loans.iloc[0])
print(loans.shape[1])

loan_amnt                            5000
term                            36 months
int_rate                           10.65%
installment                        162.87
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                 Jan-1985
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                          83.7%
total_acc                               9
initial_list_status                     f
last_credit_pull_d               J

We managed to reduce number of columns from 53 to 35. Now let's remove columns that have only one unique value.

In [81]:
drop_columns = []
all_columns = loans.columns
for c in all_columns:
    not_null = loans[c].dropna()
    unique_values = not_null.unique()
    num_of_val = len(unique_values)
    if num_of_val == 1:
        drop_columns.append(c)
        
loans = loans.drop(drop_columns, axis=1)
print(drop_columns)
print(loans.shape[1])

['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'chargeoff_within_12_mths', 'hardship_flag', 'disbursement_method']
27


We reduced number of columns to 27. It's time to choose our target column. The best option seems to be loan_status, let's explore number of values.

In [82]:
loans["loan_status"].value_counts()

Fully Paid                                             34116
Charged Off                                             5670
Does not meet the credit policy. Status:Fully Paid      1988
Does not meet the credit policy. Status:Charged Off      761
Name: loan_status, dtype: int64

We have 4 possible values for loan_status:
- **Fully Paid:** Loan has been fully paid off.
- **Charged Off:** Loan for which there is no longer a reasonable expectation of further payments.
- **Does not meet the credit policy. Status:Fully Paid:** While the loan was paid off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.
- **Does not meet the credit policy. Status:Charged Off:** While the loan was charged off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.

From the investor point of view the most interesting are values: Fully Paid and Charged Off, as they describe the final outcome of the loan. We are going to remove the rows where the value for loan_status is different than the two mentioned above. Also we will map our values to numerical values: Fully Paid will be 1 and Charged Off will be 0, so we can use it as target column in our model.

In [83]:
loans = loans[(loans['loan_status'] == 'Fully Paid') | (loans['loan_status'] == 'Charged Off')]
loan_status = {
    "loan_status": {
        "Fully Paid" : 1,
        "Charged Off" : 0
    }
}
loans = loans.replace(loan_status)

## Preparing the features

We are going to prepare the data for machine learning by focusing on handling missing values, converting categorical columns to numeric columns, and removing any other not useful columns. Let's start with counting the missing values.

In [84]:
null_counts = loans.isnull().sum()
print(null_counts)

loan_amnt                 0
term                      0
int_rate                  0
installment               0
emp_length                0
home_ownership            0
annual_inc                0
verification_status       0
loan_status               0
purpose                   0
title                    10
addr_state                0
dti                       0
delinq_2yrs               0
earliest_cr_line          0
inq_last_6mths            0
open_acc                  0
pub_rec                   0
revol_bal                 0
revol_util               50
total_acc                 0
last_credit_pull_d        2
acc_now_delinq            0
delinq_amnt               0
pub_rec_bankruptcies    697
tax_liens                39
debt_settlement_flag      0
dtype: int64


Most of the columns have 0 missing values. Just 3 columns have 50 or less missing values and one 'pub_rec_bankruptcies' has 697 missing values. We will follow the rule of removing columns with more than 1% of missing values (in our case it will be just 'pub_rec_bankruptcies') and for the others we will remove rows with missing values.

In [85]:
loans = loans.drop(['pub_rec_bankruptcies'], axis=1)
loans = loans.dropna()
loans.isnull().sum()

loan_amnt               0
term                    0
int_rate                0
installment             0
emp_length              0
home_ownership          0
annual_inc              0
verification_status     0
loan_status             0
purpose                 0
title                   0
addr_state              0
dti                     0
delinq_2yrs             0
earliest_cr_line        0
inq_last_6mths          0
open_acc                0
pub_rec                 0
revol_bal               0
revol_util              0
total_acc               0
last_credit_pull_d      0
acc_now_delinq          0
delinq_amnt             0
tax_liens               0
debt_settlement_flag    0
dtype: int64

In [86]:
#Let's check how many columns do we have in different data types.
print(loans.dtypes.value_counts())

float64    13
object     12
int64       1
dtype: int64


In [87]:
#We have 12 columns in object data type, which we will have to convert to numerical value. Let's investigate them first.
object_columns_df = loans.select_dtypes(include=['object'])
print(object_columns_df.iloc[0])

term                      36 months
int_rate                     10.65%
emp_length                10+ years
home_ownership                 RENT
verification_status        Verified
purpose                 credit_card
title                      Computer
addr_state                       AZ
earliest_cr_line           Jan-1985
revol_util                    83.7%
last_credit_pull_d         Jun-2018
debt_settlement_flag              N
Name: 0, dtype: object


**Next steps:**

- **Step 1** The columns listed below seem to be categorical, but first we should confirm the number of unique values in those columns. If the number is reasonable, we will encode these columns as dummy variables and keep them:

    - **home_ownership:** home ownership status, can only be 1 of 4 categorical values according to the data dictionary,
    - **verification_status:** indicates if income was verified by Lending Club,
    - **term:** number of payments on the loan, either 36 or 60,
    - **addr_state:** borrower's state of residence.


- **Step 2** It looks like purpose and title columns may contain the same information. If so, we will keep just one of them:
    - **purpose:** a category provided by the borrower for the loan request,
    - **title:** loan title provided the borrower.


- **Step 3** We will convert emp_length column to numerical one since the values have ordering (2 years of employment is less than 8 years):

    - **emp_length:** number of years the borrower was employed upon time of application.


- **Step 4** Here are also some columns that represent numeric values, that need to be converted:

    - **int_rate:** interest rate of the loan in %,
    - **revol_util:** revolving line utilization rate or the amount of credit the borrower is using relative to all available credit.
    
    
- **Step 5** Listed below date features require some feature engineering for modeling purposes, we will remove them:
    - earliest_cr_line: The month the borrower's earliest reported credit line was opened,
    - last_credit_pull_d: The most recent month Lending Club pulled credit for this loan.
    
    
- **Step 6** Column debt_settlement_flag has mostly one value, it will not bring much value to our model, we will remove it

### Step 1 - encode columns as dummy

In [88]:
#We will explore purpose and title columns separately, as they might contain the same information.
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for c in cols:
    print(c,'\n','-'*40,'\n',loans[c].value_counts(),'\n')

home_ownership 
 ---------------------------------------- 
 RENT        18869
MORTGAGE    17669
OWN          3050
OTHER          96
NONE            1
Name: home_ownership, dtype: int64 

verification_status 
 ---------------------------------------- 
 Not Verified       16851
Verified           12833
Source Verified    10001
Name: verification_status, dtype: int64 

emp_length 
 ---------------------------------------- 
 10+ years    8894
< 1 year     4563
2 years      4386
3 years      4092
4 years      3431
5 years      3278
1 year       3233
6 years      2226
7 years      1769
8 years      1481
9 years      1258
n/a          1074
Name: emp_length, dtype: int64 

term 
 ---------------------------------------- 
  36 months    29002
 60 months    10683
Name: term, dtype: int64 

addr_state 
 ---------------------------------------- 
 CA    7094
NY    3811
FL    2864
TX    2729
NJ    1850
IL    1524
PA    1515
VA    1406
GA    1396
MA    1332
OH    1221
MD    1052
AZ     878
WA     841

The column addr_state contains too many values to convert it to dummy, so we will remove that one and create dummies from the rest of clumns.

In [89]:
dummy_df = pd.get_dummies(loans[["home_ownership", "verification_status", "term"]])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(["home_ownership", "verification_status", "term", "addr_state"], axis=1)

### Step 2 -  purpose vs title column

In [90]:
print("purpose",'\n','-'*40,'\n',loans["purpose"].value_counts(),'\n')
print("title",'\n','-'*40,'\n',loans["title"].value_counts(),'\n')

purpose 
 ---------------------------------------- 
 debt_consolidation    18654
credit_card            5123
other                  3980
home_improvement       2974
major_purchase         2182
small_business         1823
car                    1549
wedding                 947
medical                 693
moving                  580
house                   381
vacation                378
educational             318
renewable_energy        103
Name: purpose, dtype: int64 

title 
 ---------------------------------------- 
 Debt Consolidation                          2189
Debt Consolidation Loan                     1732
Personal Loan                                661
Consolidation                                515
debt consolidation                           508
Credit Card Consolidation                    357
Home Improvement                             356
Debt consolidation                           334
Small Business Loan                          329
Credit Card Loan                 

It look like both columns contain the same information. We are going to keep only 'purpose' column, because it has less values. Moreover 'title' column has data quality issues, as it contains the same information repeated with small modification e.g. Debt Consolidation and Debt Consolidation Loan and debt consolidation.

In [91]:
dummy_df = pd.get_dummies(loans["purpose"])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(["title", "purpose"], axis=1)

### Step 3 - convert emp_length column to numerical

In [92]:
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}

loans = loans.replace(mapping_dict)

### Step 4 - convert int_rate ans revol_util to numerical

In [93]:
loans['int_rate'] = loans['int_rate'].str.rstrip('%').astype(float)
loans['revol_util'] = loans['revol_util'].str.rstrip('%').astype(float)

### Step 5 - Removing two date features

In [94]:
loans = loans.drop(["last_credit_pull_d", "earliest_cr_line"], axis=1)

### Step 6 - Removing debt_settlement_flag

In [95]:
loans["debt_settlement_flag"].value_counts()

N    39535
Y      150
Name: debt_settlement_flag, dtype: int64

In [96]:
loans = loans.drop(["debt_settlement_flag"], axis=1)

In [97]:
loans.head(3)

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,acc_now_delinq,delinq_amnt,tax_liens,home_ownership_MORTGAGE,home_ownership_NONE,home_ownership_OTHER,home_ownership_OWN,home_ownership_RENT,verification_status_Not Verified,verification_status_Source Verified,verification_status_Verified,term_ 36 months,term_ 60 months,car,credit_card,debt_consolidation,educational,home_improvement,house,major_purchase,medical,moving,other,renewable_energy,small_business,vacation,wedding
0,5000.0,10.65,162.87,10,24000.0,1,27.65,0.0,1.0,3.0,0.0,13648.0,83.7,9.0,0.0,0.0,0.0,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
1,2500.0,15.27,59.83,0,30000.0,0,1.0,0.0,5.0,3.0,0.0,1687.0,9.4,4.0,0.0,0.0,0.0,0,0,0,0,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,2400.0,15.96,84.33,10,12252.0,1,8.72,0.0,2.0,2.0,0.0,2956.0,98.5,10.0,0.0,0.0,0.0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [98]:
loans.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39685 entries, 0 to 39753
Data columns (total 41 columns):
loan_amnt                              39685 non-null float64
int_rate                               39685 non-null float64
installment                            39685 non-null float64
emp_length                             39685 non-null int64
annual_inc                             39685 non-null float64
loan_status                            39685 non-null int64
dti                                    39685 non-null float64
delinq_2yrs                            39685 non-null float64
inq_last_6mths                         39685 non-null float64
open_acc                               39685 non-null float64
pub_rec                                39685 non-null float64
revol_bal                              39685 non-null float64
revol_util                             39685 non-null float64
total_acc                              39685 non-null float64
acc_now_delinq             

## Making predictions

In our target column we noticed class imbalance, which means that there are about 6 times more positive cases (with value 1), than negative cases (with value 0). It can cause issues with many machine learning algorithms. That is the reason why for our error metric we will not choose accurancy, but the False Positive Rate and True Positive Rate. We will optimize:
- high recall (true positive rate),
- low fall-out (false positive rate).

**False positive rate** is the number of false positives divided by the number of false positives plus the number of true negatives. This divides all the cases where we thought a loan would be paid off but it wasn't by all the loans that weren't paid off:
*fpr = fp / (fp + tn)*

**True positive rate** is the number of true positives divided by the number of true positives plus the number of false negatives. This divides all the cases where we thought a loan would be paid off and it was by all the loans that were paid off:
*tpr = tp / (tp + fn)*

We will start with logistic regression model.

## Logistic Regression

Due to class imbalance issue we will set class_weight to 'balanced' in our model. This tells scikit-learn to penalize the misclassification of the minority class during the training process. This would mean that for the classifier, correctly classifying a row where loan_status is 0 is 6 times more important than correctly classifying a row where loan_status is 1.

Also for prediction we use cross validation to make sure that we will not train model and make predictions on the same set of data.

In [99]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

features = loans[loans.columns.drop("loan_status")]
target = loans["loan_status"]

lr = LogisticRegression(class_weight = 'balanced')
predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)

fp = len(predictions[(predictions==1) & (loans['loan_status']==0)])
tn = len(predictions[(predictions==0) & (loans['loan_status']==0)])
fpr = fp/(fp+tn)

tp = len(predictions[(predictions==1) & (loans['loan_status']==1)])
fn = len(predictions[(predictions==0) & (loans['loan_status']==1)])
tpr = tp/(tp+fn)

print(fpr)
print(tpr)

0.5869102518623626
0.6273355892305429


Our true positive rate is equal to around 63% and false positive rate is around 59%. It's good that the false positive rate is lower because it means that we'll be able to do a better job at avoiding bad loans than if we funded everything. However, we'd only ever decide to fund 63% of the total loans (true positive rate), so we'd immediately reject a good amount of loans.

To improve our model we can manualy set class weight. The below dictionary will impose a penalty of 10 for misclassifying a 0, and a penalty of 1 for misclassifying a 1.

In [100]:
penalty = {
    0: 10,
    1: 1
}

lr = LogisticRegression(class_weight = penalty)
predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)

fp = len(predictions[(predictions==1) & (loans['loan_status']==0)])
tn = len(predictions[(predictions==0) & (loans['loan_status']==0)])
fpr = fp/(fp+tn)

tp = len(predictions[(predictions==1) & (loans['loan_status']==1)])
fn = len(predictions[(predictions==0) & (loans['loan_status']==1)])
tpr = tp/(tp+fn)

print(fpr)
print(tpr)

0.20060305072720824
0.21279976460203032


It looks like assigning manual penalties lowered the false positive rate to about 20%, and thus lowered our risk. This comes at the expense of true positive rate, we're missing opportunities to fund more loans and potentially make more money. 

## Random Forest

We are going to apply more complex algorithm - random forest, which is able to work with nonlinear data. This model can give us more accurancy, because of columns that correlate nonlinearly with loan_status.

In [101]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict

penalty = {
    0: 10,
    1: 1
}

rfc = RandomForestClassifier(random_state=1,class_weight = 'balanced')
predictions = cross_val_predict(rfc, features, target, cv=3)
predictions = pd.Series(predictions)

fp = len(predictions[(predictions==1) & (loans['loan_status']==0)])
tn = len(predictions[(predictions==0) & (loans['loan_status']==0)])
fpr = fp/(fp+tn)

tp = len(predictions[(predictions==1) & (loans['loan_status']==1)])
fn = len(predictions[(predictions==0) & (loans['loan_status']==1)])
tpr = tp/(tp+fn)

print(fpr)
print(tpr)

0.9659453706988294
0.9693099897013389


## Conclusion

Random Forest algorithm didn't improve the false positive rate. This model is still predicting mostly 1s, which is the consequence of class imbalance. 

Our best model has false positive rate of 20%, and true positive rate of 21%. It means that the interest rate should be high enough to cover the losses from 20% of borrowers who will default. Also according to this model we would accept only 21% of borrowers, who would pay back everything, so the profit they will make should be high enough to cover losses.