# Predicting Loan Defaults
The goal of this project is to develop a model to predict whether a loan is likely to default or not before being funded. We'll be working on a financial dataset from Lending Club, a marketplace for personal loans. 

The dataset is contained within loans_2007.csv, a pre-cleaned file (original file: [LoanStats3a.csv](https://resources.lendingclub.com/LoanStats3a.csv.zip)) from which columns containing more than 50% missing values were already dropped, together with extraneous text and two descriptive columns which weren't useful for machine learning. For reference:

```import pandas as pd
 loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1)
 half_count = len(loans_2007) / 2
 loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)
 loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)
 loans_2007.to_csv('loans_2007.csv', index=False)```

The dataset contains data for loans funded between 2007 and 2011, so that by now the final status of the loan should be determined on a large majority of them. By studying which past loans defaulted and which were paid in full we will develop a predicting tool for future loans.

In [1]:
import pandas as pd

In [2]:
loans = pd.read_csv("loans_2007.csv",low_memory=False)
loans.shape

(42538, 52)

In [3]:
loans.head(3)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [4]:
loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42538 entries, 0 to 42537
Data columns (total 52 columns):
id                            42538 non-null object
member_id                     42535 non-null float64
loan_amnt                     42535 non-null float64
funded_amnt                   42535 non-null float64
funded_amnt_inv               42535 non-null float64
term                          42535 non-null object
int_rate                      42535 non-null object
installment                   42535 non-null float64
grade                         42535 non-null object
sub_grade                     42535 non-null object
emp_title                     39909 non-null object
emp_length                    41423 non-null object
home_ownership                42535 non-null object
annual_inc                    42531 non-null float64
verification_status           42535 non-null object
issue_d                       42535 non-null object
loan_status                   42535 non-null object
p

# Column selection
Not all columns will be useful.
- Particular attention must be given to columns leaking future data, i.e. data which would not be available beforehand to the person funding a loan. Since what we want to create is a predicting tool to decide whether a loan is likely to default or not BEFORE funding it, these would overfit our model.
- Some columns may be redundant, or may not provide useful data.
- Some columns may be problematic and difficult to categorize.

A description of all the columns can be found [here](https://docs.google.com/spreadsheets/d/191B2yJ4H1ZPXq0_ByhUgWMFZOYem5jFz0Y3by_7YBY4/edit#gid=2081333097) (note that the list contains columns that were dropped in the pre-cleaning of the data).

Based on column descriptions, we can drop:

#### Not useful:
- "id"
- "member_id"

#### Difficult to categorize:
- "emp_title"

#### Redundant:
- "grade" (due to "int_rate")
- "sub_grade" (due to "int_rate")

Note that "int_rate" is a continuous variable and is therefore the best suited to keep of the three.
- "zip_code" (due to "addr_state")

#### Leaks future data:
- "funded_amnt"
- "funded_amnt_inv"
- "issue_d"
- "out_prncp"
- "out_prncp_inv"
- "total_pymnt"
- "total_pymnt_inv"
- "total_rec_prncp"
- "total_rec_int"
- "total_rec_late_fee"
- "recoveries"
- "collection_recovery_fee"
- "last_pymnt_d"
- "last_pymnt_amnt"

In [5]:
drop_me = [
    "id",
    "member_id",
    "funded_amnt",
    "funded_amnt_inv",
    "grade","sub_grade",
    "emp_title",
    "issue_d",
    "zip_code",
    "out_prncp",
    "out_prncp_inv",
    "total_pymnt",
    "total_pymnt_inv",
    "total_rec_prncp",
    "total_rec_int",
    "total_rec_late_fee",
    "recoveries",
    "collection_recovery_fee",
    "last_pymnt_amnt",
    "last_pymnt_d"
]
loans = loans.drop(drop_me,axis=1)

# Target selection
We now need to select a target feature for our model to predict. We want to predict whether a loan will default or not, so "loan_status" is the ideal candidate. Let's look at the possible values of this column.

In [6]:
loans["loan_status"].value_counts()

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

Ideally, we want to study the final status of loans in a binary paid/defaulted configuration. Therefore it would be good to drop all other rows. Thankfully, "Fully Paid" or "Charged Off" (defaulted) make up the large majority of all cases, so we can simply keep these ones and drop the others.

"Does not meet the credit policy" indicates loans that no longer meet current credit policy and would not be issued today. We want to develop a model to use moving forward on future loans, so conservatively we may drop these rows too for the moment.

In [7]:
loans = loans[(loans["loan_status"] == "Fully Paid") | (loans["loan_status"] == "Charged Off")]
replace_dict = {
    "Fully Paid": 1,
    "Charged Off": 0
}
loans["loan_status"] = loans["loan_status"].replace(replace_dict)

In [8]:
# Let's check if all went correctly
print(loans["loan_status"].value_counts())

1    33136
0     5634
Name: loan_status, dtype: int64


Finally let's drop columns with only either NaN or a single unique value, since these won't contribute to the model in any useful way.

In [9]:
drop_cols = []
for u in loans.columns:
    col = loans[u].dropna() # Drop NaN values from the column
    if len(col.unique()) == 1: # If only a single unique value is left...
        drop_cols.append(u) # ...add the column to the list of columns to be dropped
loans = loans.drop(drop_cols,axis=1)
print(drop_cols)

['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']


In [10]:
# We may want to save the clean dataset at this point
loans.to_csv("filtered_loans_2007.csv",index=False)

# Data preparation
In our next step we'll prepare the data for machine learning. Machine learning algorithms such as scikit-learn cannot use missing-values or non-numeric values, therefore:
- Missing values must be removed or replaced.
- Categoricals must be rendered into numeric columns.

In [11]:
# Let's check how many missing values we have left.
null_counts = loans.isnull().sum()
print(null_counts)

loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1036
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64


Two of the columns contain many missing values, while three others have a few of them missing. Let's look at the two major ones.

In [12]:
loans["emp_length"].value_counts(dropna=False)

10+ years    8547
< 1 year     4527
2 years      4308
3 years      4026
4 years      3362
5 years      3209
1 year       3183
6 years      2181
7 years      1718
8 years      1444
9 years      1229
NaN          1036
Name: emp_length, dtype: int64

This column could be easily converted to a numerical one. Let's try to keep as much data as possible and substitute NaN with 0. We'll also convert 10+ years to 10 and < 1 year to 0.

In [13]:
subs_dict = {
    "10+ years": 10,
    "9 years": 9,
    "8 years": 8,
    "7 years": 7,
    "6 years": 6,
    "5 years": 5,
    "4 years": 4,
    "3 years": 3,
    "2 years": 2,
    "1 year": 1,
    "< 1 year": 0
}
loans["emp_length"] = loans["emp_length"].replace(subs_dict)
loans["emp_length"] = loans["emp_length"].fillna(0)
loans["emp_length"].value_counts()

10.0    8547
0.0     5563
2.0     4308
3.0     4026
4.0     3362
5.0     3209
1.0     3183
6.0     2181
7.0     1718
8.0     1444
9.0     1229
Name: emp_length, dtype: int64

In [14]:
loans["pub_rec_bankruptcies"].value_counts(dropna=False)

 0.0    36422
 1.0     1646
NaN       697
 2.0        5
Name: pub_rec_bankruptcies, dtype: int64

This column records previous publicly recorded bankruptcies, and is thus potentially very impactful for our model. Conservatively, let's get rid of the null rows so we don't risk biasing the model. But first, let's fill in the values for the other three columns with the mode for their respective values.

In [15]:
fill_me = ["title","revol_util","last_credit_pull_d"]
for f in fill_me:
    loans[f] = loans[f].fillna(loans[f].mode()[0])
loans.isnull().sum()

loan_amnt                 0
term                      0
int_rate                  0
installment               0
emp_length                0
home_ownership            0
annual_inc                0
verification_status       0
loan_status               0
purpose                   0
title                     0
addr_state                0
dti                       0
delinq_2yrs               0
earliest_cr_line          0
inq_last_6mths            0
open_acc                  0
pub_rec                   0
revol_bal                 0
revol_util                0
total_acc                 0
last_credit_pull_d        0
pub_rec_bankruptcies    697
dtype: int64

In [16]:
#loans = loans.drop("pub_rec_bankruptcies",axis=1)
loans = loans.dropna()

In [17]:
loans.isnull().sum()

loan_amnt               0
term                    0
int_rate                0
installment             0
emp_length              0
home_ownership          0
annual_inc              0
verification_status     0
loan_status             0
purpose                 0
title                   0
addr_state              0
dti                     0
delinq_2yrs             0
earliest_cr_line        0
inq_last_6mths          0
open_acc                0
pub_rec                 0
revol_bal               0
revol_util              0
total_acc               0
last_credit_pull_d      0
pub_rec_bankruptcies    0
dtype: int64

All right! Missing values have been dealt with. Let's now look at categoricals.

In [18]:
object_columns = loans.select_dtypes(include=["object"]).columns
object_columns

Index(['term', 'int_rate', 'home_ownership', 'verification_status', 'purpose',
       'title', 'addr_state', 'earliest_cr_line', 'revol_util',
       'last_credit_pull_d'],
      dtype='object')

In [19]:
# Let's quickly look at the unique values for these variables. The output won't be pretty, but it'll serve the purpose
for c in object_columns:
    print(loans[c].value_counts())

 36 months    28399
 60 months     9674
Name: term, dtype: int64
 10.99%    928
 11.49%    795
  7.51%    787
 13.49%    760
  7.88%    725
  7.49%    652
  9.99%    593
  7.90%    574
  5.42%    573
 11.71%    563
 11.99%    492
 10.37%    469
 12.69%    452
  6.03%    447
  8.49%    440
 12.99%    414
  5.79%    410
 12.42%    407
 10.65%    404
  7.29%    397
  6.62%    396
  8.90%    386
 11.86%    383
  9.63%    378
  9.91%    357
 10.59%    349
  5.99%    347
 14.27%    344
  7.14%    341
  6.99%    336
          ... 
 15.76%      2
 17.15%      2
 20.20%      2
 15.07%      2
 15.38%      2
 21.82%      2
  9.51%      1
 21.48%      1
 13.93%      1
 12.36%      1
 13.30%      1
 12.49%      1
 10.91%      1
 22.64%      1
 10.28%      1
 11.22%      1
 10.46%      1
 17.54%      1
 18.72%      1
 17.03%      1
 16.01%      1
 16.20%      1
 17.44%      1
 17.46%      1
 16.96%      1
 20.52%      1
  9.83%      1
 24.40%      1
 24.59%      1
  9.01%      1
Name: int_rate, Leng

A few things to note:
- "int_rate" and "revol_util" are really numerical columns, we only need to get rid of the percent character and change the type to float.
- "addr_state" contains very many categories which will bog down the model once the dummy columns are extracted. We should get rid of this one.
- "purpose" and "title" are in fact redundant. Of the two, "purpose" is much more ordered while "title" contains many... unorthodox entries, a large number of which have just 1 count too. Let's drop "title".
- "earliest_cr_line" and "last_credit_pull_d" are date columns which would take a great deal of effort to be made usable. Let's drop these ones too.

Once all of this is done, we can finally add the dummy columns to the dataset.

In [20]:
drop_me = ["addr_state", "title", "earliest_cr_line","last_credit_pull_d"]
loans = loans.drop(drop_me,axis=1)
loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")

In [21]:
dums = ["home_ownership","verification_status","purpose","term"]
for d in dums:
    loans = pd.concat([loans,pd.get_dummies(loans[d],prefix=d)],axis=1)
    loans = loans.drop(d,axis=1)

In [22]:
# We may want to save the final dataset at this point
loans.to_csv("loans_final.csv",index=False)

# Target prediction
In the final step we'll try to develop the best possible method for predicting the target variable ("loan_status"). Two things must be kept in mind:
- The dataset contains much more paid loans than it contains defaulted ones. This will severely hamper the accuracy of the method as even predicting every single loan as paid will result in a decent performance since the model will be wrong in just about 1/6 of the cases (14.5%). We should therefore use class weights in our classifier.
- We must choose the best estimators for our model performance. Since arguably what we want to do with this model is maximize profit (i.e. true positives, loans we fund and are paid off) and minimize losses (i.e. false positives, loans we fund and end up defaulting), the true positive rate TPR and false positive rate FPR are the estimators we need. The relative importance of the two depends on the investor - a conservative investor will want to minimize losses even at the cost of cutting on earnings. In general, reducing false positives will reduce true positives too.

For reference:
- TPR = true positives / (true positives + false negatives) i.e. true positives / all positives
- FPR = false positives / (false positives + true negatives) i.e. false positives / all negatives

### NOTE
Given all of the above, *FPR = 0.145 is the threshold under which we want to get our model so that it performs better than random picking*.

In [23]:
# For some odd reason, we need to reload the data we just saved or the following will not work correctly
loans = pd.read_csv("loans_final.csv",low_memory=False)

# Logistic regression

In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

features = loans.drop("loan_status",axis=1)
target = loans["loan_status"]

lr = LogisticRegression(class_weight="balanced")

predictions = cross_val_predict(lr,features,target,cv=3)
predictions = pd.Series(predictions)

tn = len(predictions[(predictions == 0) & (loans["loan_status"] == 0)])
tp = len(predictions[(predictions == 1) & (loans["loan_status"] == 1)])
fn = len(predictions[(predictions == 0) & (loans["loan_status"] == 1)])
fp = len(predictions[(predictions == 1) & (loans["loan_status"] == 0)])

fpr = fp/(fp+tn)
tpr = tp/(tp+fn)
print("FPR: ",fpr,"\nTPR: ",tpr)

FPR:  0.3756345177664975 
TPR:  0.6654175753294223


In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

features = loans.drop("loan_status",axis=1)
target = loans["loan_status"]

weight = {
    0: 10,
    1: 1    
}
lr = LogisticRegression(class_weight=weight)

predictions = cross_val_predict(lr,features,target,cv=3)
predictions = pd.Series(predictions)
preds1 = predictions.copy()

tn = len(predictions[(predictions == 0) & (loans["loan_status"] == 0)])
tp = len(predictions[(predictions == 1) & (loans["loan_status"] == 1)])
fn = len(predictions[(predictions == 0) & (loans["loan_status"] == 1)])
fp = len(predictions[(predictions == 1) & (loans["loan_status"] == 0)])

fpr = fp/(fp+tn)
tpr = tp/(tp+fn)
print("FPR: ",fpr,"\nTPR: ",tpr)

FPR:  0.09300217548948514 
TPR:  0.25155880455815954


# Random forest

In [29]:
from sklearn.ensemble import RandomForestClassifier

features = loans.drop("loan_status",axis=1)
target = loans["loan_status"]

rf = RandomForestClassifier(class_weight="balanced",n_estimators=10,min_samples_leaf=5,max_depth=12,random_state=0)

predictions = cross_val_predict(rf,features,target,cv=3)
predictions = pd.Series(predictions)

tn = len(predictions[(predictions == 0) & (loans["loan_status"] == 0)])
tp = len(predictions[(predictions == 1) & (loans["loan_status"] == 1)])
fn = len(predictions[(predictions == 0) & (loans["loan_status"] == 1)])
fp = len(predictions[(predictions == 1) & (loans["loan_status"] == 0)])

fpr = fp/(fp+tn)
tpr = tp/(tp+fn)
print("FPR: ",fpr,"\nTPR: ",tpr)

FPR:  0.5792240754169689 
TPR:  0.7947906748164757


In [30]:
from sklearn.ensemble import RandomForestClassifier

features = loans.drop("loan_status",axis=1)
target = loans["loan_status"]

weight = {
    0: 10,
    1: 1    
}
rf = RandomForestClassifier(class_weight=weight,n_estimators=10,min_samples_leaf=5,max_depth=12,random_state=0)

predictions = cross_val_predict(rf,features,target,cv=3)
predictions = pd.Series(predictions)

tn = len(predictions[(predictions == 0) & (loans["loan_status"] == 0)])
tp = len(predictions[(predictions == 1) & (loans["loan_status"] == 1)])
fn = len(predictions[(predictions == 0) & (loans["loan_status"] == 1)])
fp = len(predictions[(predictions == 1) & (loans["loan_status"] == 0)])

fpr = fp/(fp+tn)
tpr = tp/(tp+fn)
print("FPR: ",fpr,"\nTPR: ",tpr)

FPR:  0.3694706308919507 
TPR:  0.6081948582486101


In [31]:
from sklearn.ensemble import RandomForestClassifier

features = loans.drop("loan_status",axis=1)
target = loans["loan_status"]

weight = {
    0: 50,
    1: 1    
}
rf = RandomForestClassifier(class_weight=weight,n_estimators=10,min_samples_leaf=5,max_depth=12,random_state=0)

predictions = cross_val_predict(rf,features,target,cv=3)
predictions = pd.Series(predictions)
preds2 = predictions.copy()

tn = len(predictions[(predictions == 0) & (loans["loan_status"] == 0)])
tp = len(predictions[(predictions == 1) & (loans["loan_status"] == 1)])
fn = len(predictions[(predictions == 0) & (loans["loan_status"] == 1)])
fp = len(predictions[(predictions == 1) & (loans["loan_status"] == 0)])

fpr = fp/(fp+tn)
tpr = tp/(tp+fn)
print("FPR: ",fpr,"\nTPR: ",tpr)

FPR:  0.09372733865119652 
TPR:  0.23279171913874128
