## Introduction

In this project, we'll be going through the full process of data cleaning, data transformation, and finally model prediction, in order to predict whether a particular loan request is likely to be paid off. We'll be doing this with financial lending data from [Lending Club](https://www.lendingclub.com/info/download-data.action), an online marketplace for personal loans that matches borrowers with investors. Naturally, we'll be playing the part of an interested investor who's looking to make a profit.

In [148]:
import pandas as pd

data = pd.read_csv("loans_2007.csv", low_memory = False)
data.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


## 1. Data Cleaning

The dataset has already been cleaned up to remove some problematic or unnecessary columns, and remove columns with more than 50% of the values missing. However, we still need to look more closely through the columns for a few things:
* Whether the column leaks information about the future
* Whether the column is irrelevant to the borrower's ability to pay back the loan
* Whether a column needs to be reformatted
* Whether a column needs to be processed in order for useful information to be extracted from it
* Whether the column contains redundant information that is present elsewhere in the dataset

Of these, the first is the most important. If a column leaks information about the future, then if use it as a feature in our model for prediction we'll end up with an overfit model that won't be any good for predicting information on future loans (since we won't have that information yet).

We also need to select a column as a target column which we'll be trying to predict with our model.

To find more detailed information on the columns in our dataset, we can look at the data dictionary provided by Lending Club, found [here](https://docs.google.com/spreadsheets/d/191B2yJ4H1ZPXq0_ByhUgWMFZOYem5jFz0Y3by_7YBY4/edit#gid=2081333097).

In [149]:
list(data)

['id',
 'member_id',
 'loan_amnt',
 'funded_amnt',
 'funded_amnt_inv',
 'term',
 'int_rate',
 'installment',
 'grade',
 'sub_grade',
 'emp_title',
 'emp_length',
 'home_ownership',
 'annual_inc',
 'verification_status',
 'issue_d',
 'loan_status',
 'pymnt_plan',
 'purpose',
 'title',
 'zip_code',
 'addr_state',
 'dti',
 'delinq_2yrs',
 'earliest_cr_line',
 'inq_last_6mths',
 'open_acc',
 'pub_rec',
 'revol_bal',
 'revol_util',
 'total_acc',
 'initial_list_status',
 'out_prncp',
 'out_prncp_inv',
 'total_pymnt',
 'total_pymnt_inv',
 'total_rec_prncp',
 'total_rec_int',
 'total_rec_late_fee',
 'recoveries',
 'collection_recovery_fee',
 'last_pymnt_d',
 'last_pymnt_amnt',
 'last_credit_pull_d',
 'collections_12_mths_ex_med',
 'policy_code',
 'application_type',
 'acc_now_delinq',
 'chargeoff_within_12_mths',
 'delinq_amnt',
 'pub_rec_bankruptcies',
 'tax_liens']

We'll start by cutting some columns. The following columns we'll cut because of data leakage: "funded_amnt", "funded_amnt_inv", "issue_d", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp", "total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt", "last_credit_pull_d".

We'll cut the following because they're irrelevant, redundant, or hard to transform: "id", "member_id", "grade" (redundant, info in int_rate), "sub_grade" (ditto), "zip_code" (redundant, info in addr_state), "emp_title" (hard to transform).

While loan_status leaks information about the future, we need to make sure we keep it since it's going to serve as our target column. Our goal, after all, is to see which loan requests are worth taking because we expect the borrower to pay off the loan successfully.

In [150]:
data = data.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d", "zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp", "total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt", "last_credit_pull_d"], axis = 1)
data.shape

(42538, 31)

There are a few columns we may need to transform, such as purpose, title, and addr_state, but we'll leave those for now.

### Cleaning the Target Column

Let's now look more closely at our target column.

In [151]:
data["loan_status"].value_counts()

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

Since we're trying to predict which loans will be successfully paid off, we're only interested in those loans which are either Full Paid or Charged Off. Those loans with are classified as Current, In Grace Period, Late (16 - 30 days), Late (31 -120 days), and Default are still undetermined.

We're also not interested in those loans which were fully paid or charged off but which no longer meet Lending Club's policy. Such loans will no longer get approved, and so if we include the data in those rows we'll be fitting our model to data which will be irrelevant in the future, because it relates to loans that will no longer meet the policy and we will therefore never need to evaluate.

This therefore turns our problem into one of binary classification. We're trying to predict whether a given loan will result in one of two values, either fully paid or charged off. We'll remove the rows for those loans whose statuses fall into neither category, and then we'll replace Fully Paid values with 1, and Charged Off with 0.

It should be noted that doing the above results in a class imbalance between the two outcomes. That is, the dataset we're training our model on contains far more loans which were fully paid than those that were charged off. This can result in the model having a bias towards predicting the class with more observations, in this case, full paid. This would be a disaster for us.

While we want to make as much money as possible by evaluating each prospective loan correctly, if the model were skewed towards predicting charged off rather than fully paid, this would be less of a concern. We'd miss out on funding some loans that would actually be good due to the bias of our model, but that would just mean less profit. However, given that our model would be biased towards predicting fully paid, we'd instead be funding loans which were actually bad, which would not just mean less profit, but potentially a loss.

As such, we'll need to remedy this issue, but we'll do so later. For now, we'll go ahead and alter our dataframe to contain only rows with loan statuses of fully paid or charged off.

In [152]:
data = data[(data['loan_status'] == "Fully Paid") | (data['loan_status'] == "Charged Off")]

status_replace = {
    "loan_status" : {
        "Fully Paid": 1,
        "Charged Off": 0,
    }
}

data = data.replace(status_replace)

### Removing Useless Columns

Some columns may contain only one unique value. These columns will be useless for modelling as they contain no information to differentiate the loans from one another. To remove them, we first need to drop any null values from the columns, then find whether the remaining values are all alike. If they are, we'll add the column name to a list, and finally remove from the dataframe all columns in the list.

In [153]:
drop_columns = []

for col in list(data):
    non_null_unique = data[col].dropna().unique()
    if len(non_null_unique) == 1:
        drop_columns.append(col)

data = data.drop(drop_columns, axis = 1)

print(drop_columns)

['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']


After all of this, our dataframe is looking much smaller and easier to manage.

In [154]:
print(data.head())
print(data.shape)

   loan_amnt        term int_rate  installment emp_length home_ownership  \
0     5000.0   36 months   10.65%       162.87  10+ years           RENT   
1     2500.0   60 months   15.27%        59.83   < 1 year           RENT   
2     2400.0   36 months   15.96%        84.33  10+ years           RENT   
3    10000.0   36 months   13.49%       339.31  10+ years           RENT   
5     5000.0   36 months    7.90%       156.46    3 years           RENT   

   annual_inc verification_status  loan_status         purpose  \
0     24000.0            Verified            1     credit_card   
1     30000.0     Source Verified            0             car   
2     12252.0        Not Verified            1  small_business   
3     49200.0     Source Verified            1           other   
5     36000.0     Source Verified            1         wedding   

          ...             dti delinq_2yrs  earliest_cr_line  inq_last_6mths  \
0         ...           27.65         0.0          Jan-1985        

## 2. Transforming the Data

Now that we've selected the columns we'll be using in our model, we need to make sure the data is in suitable form. We'll remove missing values and convert non-numerical data to numerical data, before giving our data one final look over to see if there are any other columns we can remove.

In [155]:
null_counts = data.isnull().sum()
null_counts

loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1036
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
pub_rec_bankruptcies     697
dtype: int64

There are four columns with missing values: emp_length, title, revol_util, and pub_rec_bankruptcies. The missing rows in title and revol_util represent such a small portion of the dataset that we can safely remove those rows without significantly impacting the data.

The emp_length and pub_rec_bankruptcies columns are a little more difficult, as rows with missing values in either of these columns represents a much larger proportion of the dataset. We have three options for dealing with these missing values. We could remove the rows with missing values in either of these two columns, we could remove the columns themselves, or we could fill in the missing values with an average of the values in each column. Let's take a closer look at these two columns to get a better idea of the information they contain before we make our decision.

In [156]:
emp_length_counts = data["emp_length"].value_counts()
emp_length_counts

10+ years    8547
< 1 year     4527
2 years      4308
3 years      4026
4 years      3362
5 years      3209
1 year       3183
6 years      2181
7 years      1718
8 years      1444
9 years      1229
Name: emp_length, dtype: int64

In [157]:
pub_rec_bank_counts = data["pub_rec_bankruptcies"].value_counts()
pub_rec_bank_counts

0.0    36422
1.0     1646
2.0        5
Name: pub_rec_bankruptcies, dtype: int64

The first option above (removing rows with missing values) is unappealing because we'll lose a lot of other information in the dataset that those rows contain.

The second option (removing the columns) looks like a better option for the pub_rec_bank_counts column. It would be hard to replace the missing values with any kind of average, as the average in all cases would be 0. That might be a realistic prediction for most of those values, but we have no way of knowing whether that is the case. Furthermore, since so many of the values in the column are 0, we're not losing that much information by removing the column in its entirety. This may turn out to be a mistake if pub_rec_bankruptcies is strongly correlated with loan status, but we'll proceed by removing the column for now.

As for emp_length, the second option also looks reasonable. Assigning a mean value would be impossible without dealing with the 10+ years column (as we have no idea how many years precisely each individual in this category has been employed for). Assigning modal and median values would be possible, but in neither case would be representative of the range of values in emp_length. Prima facie, we also wouldn't expect emp_length to be that strongly correlated with loan_status. So we'll cut this column as well.

In [158]:
data = data.drop(["pub_rec_bankruptcies", "emp_length"], axis = 1)
data = data.dropna()

We now need to convert text columns to numerical data types. We'll look more closely at the non-numeric columns in our dataset using the select_dtypes method.

In [159]:
print(data.dtypes.value_counts())

float64    10
object      9
int64       1
dtype: int64


In [160]:
object_df = data.select_dtypes(include = ["object"])

In [161]:
object_df.head()

Unnamed: 0,term,int_rate,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,revol_util
0,36 months,10.65%,RENT,Verified,credit_card,Computer,AZ,Jan-1985,83.7%
1,60 months,15.27%,RENT,Source Verified,car,bike,GA,Apr-1999,9.4%
2,36 months,15.96%,RENT,Not Verified,small_business,real estate business,IL,Nov-2001,98.5%
3,36 months,13.49%,RENT,Source Verified,other,personel,CA,Feb-1996,21%
5,36 months,7.90%,RENT,Source Verified,wedding,My wedding loan I promise to pay back,AZ,Nov-2004,28.3%


Two easy things to do are: to convert the int_rate and revol_util columns, as both contain numeric values; to remove the earliest_cr_line column, as this contains a date value which would require some alteration to make useful.

In [162]:
data["int_rate"] = data["int_rate"].str.rstrip("%").astype("float")
data["revol_util"] = data["revol_util"].str.rstrip("%").astype("float")

data = data.drop(["earliest_cr_line"], axis = 1)

Let's now check to see if any of the columns are categorical, as most look as if they may be. We can check this by looking at the number of unique values in each column (and also by looking at the documentation).

In [163]:
cols = ["term", "home_ownership", "verification_status", "purpose", "title", "addr_state"]

values = {}

for col in cols:
    values[col] = object_df[col].value_counts()
    
values

{'term':  36 months    29042
  60 months     9667
 Name: term, dtype: int64, 'home_ownership': RENT        18514
 MORTGAGE    17112
 OWN          2984
 OTHER          96
 NONE            3
 Name: home_ownership, dtype: int64, 'verification_status': Not Verified       16698
 Verified           12289
 Source Verified     9722
 Name: verification_status, dtype: int64, 'purpose': debt_consolidation    18130
 credit_card            5039
 other                  3865
 home_improvement       2897
 major_purchase         2154
 small_business         1763
 car                    1510
 wedding                 929
 medical                 680
 moving                  576
 vacation                375
 house                   369
 educational             320
 renewable_energy        102
 Name: purpose, dtype: int64, 'title': Debt Consolidation                         2104
 Debt Consolidation Loan                    1632
 Personal Loan                               642
 Consolidation                 

Most of the columns contain a reasonable number of discrete categories and so can be encoded as dummy variables to be used in our model. However, the addr_state column contains many discrete values, and so would require adding many new columns if we were to represent the information in the column with dummy variables. We'll cut the column for that reason.

Additionally, purpose and title seem to contain similar information, and many of the different values in title are unique given that the borrower can give their loan application any title. Thus, we'll remove the title column and keep the purpose column.

In [164]:
data = data.drop(["addr_state", "title"], axis = 1)

In [165]:
cat_cols = ["term", "home_ownership", "verification_status", "purpose"]

dummy_df = pd.get_dummies(data[cat_cols])
data = pd.concat([data, dummy_df], axis = 1)

['loan_amnt',
 'term',
 'int_rate',
 'installment',
 'home_ownership',
 'annual_inc',
 'verification_status',
 'loan_status',
 'purpose',
 'dti',
 'delinq_2yrs',
 'inq_last_6mths',
 'open_acc',
 'pub_rec',
 'revol_bal',
 'revol_util',
 'total_acc',
 'term_ 36 months',
 'term_ 60 months',
 'home_ownership_MORTGAGE',
 'home_ownership_NONE',
 'home_ownership_OTHER',
 'home_ownership_OWN',
 'home_ownership_RENT',
 'verification_status_Not Verified',
 'verification_status_Source Verified',
 'verification_status_Verified',
 'purpose_car',
 'purpose_credit_card',
 'purpose_debt_consolidation',
 'purpose_educational',
 'purpose_home_improvement',
 'purpose_house',
 'purpose_major_purchase',
 'purpose_medical',
 'purpose_moving',
 'purpose_other',
 'purpose_renewable_energy',
 'purpose_small_business',
 'purpose_vacation',
 'purpose_wedding']

In [167]:
data = data.drop(["term", "home_ownership", "verification_status", "purpose"], axis = 1)

We're now done with data cleaning and transformation, which means we're ready to start making models to predict whether a loan will be paid off on time.

### 3. Making Models

Before we make our model, we need to decide on an error metric which we can use to estimate the accuracy of our model, and to compare the accuracy of models to one another.

Above, we filtered the data in the loan_status column so that it only contained those loans which had either been paid off or which were never going to be paid off. In doing this, we turned our loan_status column into a binary column, thereby making our goal one of binary classification.

In this instance, our goal is to make money. To do this, we'll need to identify accurately both loans that will be paid off (which we can take on) and loans that will not be paid off (which we can avoid). We therefore want our model to avoid making predictions that are false positives or false negatives. A false positive occurs when our model predicts that a loan was paid off, but it actually was not paid off. A false negative occurs when our model predicts that a loan was notpaid off, but it actually was paid off. Each false positive results in us losing money, because we take a loan which will not be paid off. Each false negative results in us missing out on money, because we decided not to take a loan that actually would have been paid off.

As discussed above, false positives are far worse for us than false negatives, because at least with a false negative we don't lose money, we only forego profit. However, we want to maximize profit, and so we want to make sure we're taking advantage of every favorable opportunity, and avoiding every unfavorable opportunity as much as possible. We therefore want to minimize the ratio of false positives to true positives, and the ratio of false negatives to true negatives.

N.B. Maximizing profit is actually a little more complex than this. In general, we want to avoid false positives and false negatives, but in reality we're making a model that will involve trade-offs. We may be able to make a model that has a higher number of false positives, but also identifies high-value true positives at a high rate. Such a model may make more money than one which has a lower number of false positives, but is bad at identifying high-value true positives. Additionally, we care more about high-value loans than low-value ones. If we can predict the outcome of high-value loans with great accuracy, but with worse overall accuracy than a model that predicts with high overall accuracy, but worse accuracy with regards to high-value loans, then we will likely make more money. We'll ignore these intricacies here, however, and look only at ratios of negatives and positives.

Now, we also discussed above the class imbalance in the loan_status column. There are far more loans that were paid off than were not paid off. This is an issue when evaluating a model's accuracy because a model could predict 1 for every row and still appear to have a high accuracy (in our case, it would be correct more than 85% of the time).

With this in mind, optimizing our model to have a high true positive rate (ratio of true positives to all actual positives (true positives + false negatives)) and a low false positive rate (ratio of false positives to all actual negatives (false positives + true negatives)) is a good way to go, as we can calculate what true positive rate and false positive rates we need to be profitable, based on the average amount we profit from a successful loan minus the average amount we lose from a failed loan.

Given that our problem is one of binary classification, we'll begin by using a logistic regression model to predict whether a loan will be paid off. We'll use cross validation so that we train a logistic regression model using different train and test sets created from our dataset.

In [179]:
tp_filter = (predictions == 1) & (data["loan_status"] == 1)
# Finds all true positive results
fp_filter = (predictions == 1) & (data["loan_status"] == 0)
# Finds all true negative results
tn_filter = (predictions == 0) & (data["loan_status"] == 0)
# Finds all false negative results
fn_filter = (predictions == 0) & (data["loan_status"] == 1)

In [173]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

cols = data.columns
train_cols = cols.drop("loan_status")
features = data[train_cols]
target = data["loan_status"]

lr = LogisticRegression()
predictions = cross_val_predict(lr, features, target, cv = 5)

predictions = pd.Series(predictions)

In [180]:
tp = len(predictions[tp_filter])
fp = len(predictions[fp_filter])
tn = len(predictions[tn_filter])
fn = len(predictions[fn_filter])

tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr, fpr)

0.9986645133238089 0.9976094152261861


These initial numbers are terrible. We want our true positive rate (tpr) to be high, and our false positive rate (fpr) to be low. We have one, but not the other.

To put these numbers more into perspective, think about what our tpr and fpr would be if our model only predicted 1's. The tpr would be 1. The model always predicts positive results, and so it will always correctly predict any positive result. Our fpr would also be 1. That is, our model would incorrectly predict every result that was actually negative.

Now, we want a high tpr, but we also want a low fpr, because only with a low fpr is our model becoming more refined in how it makes its predictions. The lower the fpr, the fewer false positives we have relative to the total number of positives, and the less money we'll lose by accepting bad loans.

So, we want to change our model to improve the fpr whilst still keeping a high tpr.

As for why the tpr and fpr are as bad as they are, we need to look at what's going on under the hood of our model. This explanation will be brief, but I plan to cover the topic in more detail in my blog in the near future.

Consider that each row in our dataset corresponds to a single data point. Logistic regression models work by determining whether that data point is more likely to belong to one of two classes, in our case, whether it represents a loan that will or will not be paid off.

To establish the likelihood of a point belonging to a class, the model will plot each of the data points in Euclidean space, and define a plane which divides the data points into two groups. All of the points on one side of the plane are assigned one class (e.g. will be paid off), and the other points the other class (e.g. will not be paid off). Thus, the predicition rests entirely on the positioning of this plane.

The positioning of this plane, however, is not arbitrary. The logistic regression algorithm will try out various planes based on the features passed to the algorithm (e.g. loan_amnt, term, int_rate as above). Each plane will divide the dataset into the two classes differently. How well each plane does this will be evaluated by calculating the likelihood that each datapoint in the dataset is classified correctly. The plane which does this the best is the one selected by the algorithm to use as the model, and it is from this model that predictions are made.

The issue is, when the logistic regression algorithm selects a model, it selects a model that is the most accurate - that is, a model that predicts the actual value correclty most often. As it turns out, given that our target column has a class imbalance, the model that does this best is one that predicts 1's almost all of the time (that is, the plane that divides the datapoints will place almost all of the datapoints on one side of the plane).

We need to fix this to improve the predictive accuracy of our model with regards to the predictions we care about - that is, the tpr and fpr. We can do this in two ways:
- Use oversampling and undersmpaling to ensure that each time the algorithm is run, it has a balanced number of each target class
- Tell the algorithm to penalize misclassifications of the less prevalent class (in our case, negative predictions) more than the other class.

The first way to do this is difficult, since we'd need to get an equal number of 1s and 0s by one of the following:
- use only a small portion of our positive data so we have as little of it as we do the negative data
- copy the negative data until we have as much of it as the positive data
- generate fake negative data until we have as much of it as the positive data

The better option, therefore, is to penalize misclassifications of the negative class more strongly. We can do this by changing to class_weight parameter of the logistic regression instance to balanced. This will mean that the algorithm will be more concerned with correctly classifying rows where the loan_status value is 0. This will lower the accuracy for rows where loan_status is 1, but increase it for rows where loan_status is 0. This will likely lead to a lower tpr, but it should also lead to a lower fpr, which is very important in this case.

In [190]:
balanced_lr = LogisticRegression(class_weight = "balanced")
bal_pred = cross_val_predict(balanced_lr, features, target, cv = 5)

bal_pred = pd.Series(bal_pred)

In [191]:
tp_filter = (bal_pred == 1) & (data["loan_status"] == 1)
fp_filter = (bal_pred == 1) & (data["loan_status"] == 0)
tn_filter = (bal_pred == 0) & (data["loan_status"] == 0)
fn_filter = (bal_pred == 0) & (data["loan_status"] == 1)

tp = len(bal_pred[tp_filter])
fp = len(bal_pred[fp_filter])
tn = len(bal_pred[tn_filter])
fn = len(bal_pred[fn_filter])

tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr, fpr)

0.6358780048450214 0.6324016182420007


We're moving in the right direction, but we can still do better. We can reduce the false positive rate further by increasing the penalty for negative predictions. We do this by passing a dictionary to the class_weight parameter which contains the classes and their specified weightings.

In [192]:
penalty = {0: 10, 1: 1}
pen_lr = LogisticRegression(class_weight = penalty)
pen_pred = cross_val_predict(pen_lr, features, target, cv = 5)
pen_pred = pd.Series(pen_pred)

In [193]:
tp_filter = (pen_pred == 1) & (data["loan_status"] == 1)
fp_filter = (pen_pred == 1) & (data["loan_status"] == 0)
tn_filter = (pen_pred == 0) & (data["loan_status"] == 0)
fn_filter = (pen_pred == 0) & (data["loan_status"] == 1)

tp = len(pen_pred[tp_filter])
fp = len(pen_pred[fp_filter])
tn = len(pen_pred[tn_filter])
fn = len(pen_pred[fn_filter])

tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr, fpr)

0.20116156282998943 0.19510849577050385


We're now substantially worse at predicting true positives, which means we're missing out on good loans. But, we're also much better at avoiding false positives, which will mean losing less money to bad loans.

However, our ratio of tpr to fpr hasn't improved, meaning our model is still not doing any better for us in terms of our profit than the model that predicted all 1s.

The above numbers mean that we're catching 1 in every 5 good loans, and we're incorrectly taking 1 in every 5 bad loans. There are around 6 times more good loans than bad loans. If we're evaluating 140 loans, 120 will be good, and 20 will be bad. We're going to take 24 of the good ones, and 4 of the bad ones, meaning 1/7 of the loans we take is bad. That's just the same as if we took every loan!

If return on 6/7 loans is good enough, then that's fine, but really we'd still like to improve our model's predictive ability (and, as it happens, return on 6/7 loans is certainly not good enough).

Let's use a random forest to try to improve our tpr and fpr one final time.

In [194]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(class_weight = "balanced")
rf_pred = cross_val_predict(rf, features, target, cv = 5)
rf_pred = pd.Series(rf_pred)

  from numpy.core.umath_tests import inner1d


In [195]:
tp_filter = (rf_pred == 1) & (data["loan_status"] == 1)
fp_filter = (rf_pred == 1) & (data["loan_status"] == 0)
tn_filter = (rf_pred == 0) & (data["loan_status"] == 0)
fn_filter = (rf_pred == 0) & (data["loan_status"] == 1)

tp = len(rf_pred[tp_filter])
fp = len(rf_pred[fp_filter])
tn = len(rf_pred[tn_filter])
fn = len(rf_pred[fn_filter])

tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr, fpr)

0.9669234113920119 0.9639573372563442


Unfortunately, using a random forest classifier seems to have actually reduced our tpr:fpr ratio, which is certainly not what we want.

### 4. Conclusion

None of the models we used above did a very good job predicting whether a borrower would pay back their loan. We could try to improve our models by tweaking the penalties, by trying different types of models, by using different combinations of columns, by handling columns with missing values differently, by tuning parameters, or by ensembling models of different types. For now, though, we'll leave it here.