# Credit Modelling - Lending Club Loans Assessments

### We will be trying to predict the potential for loan repayment given loan applications.


### 1. Data that we will be using:

Lending Club releases data for all of the approved and declined loan applications periodically on their [website](https://www.lendingclub.com/investing/peer-to-peer). We have elected to use data from the period 2007-2011, as many of these loans will have already been paid off (and the ones not paid off are unlikely to be).

A breakdown of all the data columns can be found in this data dictionary file hosted on [Google Drive](https://docs.google.com/spreadsheets/d/191B2yJ4H1ZPXq0_ByhUgWMFZOYem5jFz0Y3by_7YBY4/edit#gid=2081333097).

### 2. Data Cleaning:

Let's begin by loading in our data and examining our columns and applying some domain knowledge.

In [1]:
import pandas as pd

loans = pd.read_csv(r'C:\Users\Panda\Documents\4 Coding & Work\1 DataQuest Files\Working Files\loans_2007.csv')
print(loans.columns)
print(loans.shape)
loans.head()

Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'verification_status',
       'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code',
       'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util',
       'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv',
       'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'policy_code', 'application_type',
       'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt',
       'pub_rec_bankruptcies', 'tax_liens'],
      dtype='object')
(42538, 52)


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


Taking a look at those columns and the data dictionary (documentation), let's go ahead and remove these for the following reasons:

- <u>id</u>: randomly generated field by Lending Club for unique identification purposes only
- <u>member_id</u>: also a randomly generated field by Lending Club for unique identification purposes only
- <u>funded_amnt</u>: leaks data from the future (after the loan is already started to be funded)
- <u>funded_amnt_inv</u>: also leaks data from the future (after the loan is already started to be funded)
- <u>grade</u>: contains redundant information as the interest rate column (int_rate)
- <u>sub_grade</u>: also contains redundant information as the interest rate column (int_rate)
- <u>emp_title</u>: requires other data and a lot of processing to potentially be useful
- <u>issue_d</u>: leaks data from the future (after the loan is already completely funded)
- <u>zip_code</u>: redundant with the addr_state column since only the first 3 digits of the 5-digit zip code are visible (which can only be used to identify the state the borrower lives in)
- <u>out_prncp</u>: leaks data from the future, (after the loan already started to be paid off)
- <u>out_prncp_inv</u>: also leaks data from the future, (after the loan already started to be paid off)
- <u>total_pymnt</u>: also leaks data from the future, (after the loan already started to be paid off)
- <u>total_pymnt_inv</u>: also leaks data from the future, (after the loan already started to be paid off)
- <u>total_rec_prncp</u>: also leaks data from the future, (after the loan already started to be paid off)
- <u>total_rec_int</u>: leaks data from the future, (after the loan has started to be paid off),
- <u>total_rec_late_fee</u>: leaks data from the future, (after the loan has started to be paid off),
- <u>recoveries</u>: leaks data from the future, (after the loan has started to be paid off),
- <u>collection_recovery_fee</u>: leaks data from the future, (after the loan has started to be paid off),
- <u>last_pymnt_d</u>: leaks data from the future, (after the loan has started to be paid off),
- <u>last_pymnt_amnt</u>: leaks data from the future, (after the loan has started to be paid off)

In [2]:
drop_cols = ['id', 'member_id', 'funded_amnt', 'funded_amnt_inv', 
             'grade', 'sub_grade', 'emp_title', 'issue_d', 'zip_code', 
             'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 
             'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 
             'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 
             'last_pymnt_amnt']
loans.drop(columns=drop_cols, inplace=True)
loans.columns

Index(['loan_amnt', 'term', 'int_rate', 'installment', 'emp_length',
       'home_ownership', 'annual_inc', 'verification_status', 'loan_status',
       'pymnt_plan', 'purpose', 'title', 'addr_state', 'dti', 'delinq_2yrs',
       'earliest_cr_line', 'inq_last_6mths', 'open_acc', 'pub_rec',
       'revol_bal', 'revol_util', 'total_acc', 'initial_list_status',
       'last_credit_pull_d', 'collections_12_mths_ex_med', 'policy_code',
       'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths',
       'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens'],
      dtype='object')

Great - looks like our remaining columns will not leak any future data, and should prove useful for our analysis.

---

Next, let's decide how we will classify a loan as paid off or not (this will be the target output of our model) using the <u>loan-status</u> column.

In [3]:
loans['loan_status'].value_counts()

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

Here, it looks like only the top two results will be useful to us (since the others either are retrospective, or have such minimal values).

**Note**: since this is our target column, we will need to be very aware of bias in our results, since there are much more 'Fully Paid' loans than 'Charged Off'. 

- E.g. Our model might simply always predict 'Fully Paid' and still be correct 85%+ of the time in testing... however, this would be very unreliable in practice.

---

Let's remove any other values, then convert to a binary classification model.

In [4]:
loans = loans[(loans['loan_status'] == 'Fully Paid') | 
                        (loans['loan_status'] == 'Charged Off')]

binary_mapper = {'Fully Paid': 1, 'Charged Off': 0} # 1 means loan fully paid
loans['loan_status'].replace(binary_mapper, inplace=True)

---

As a final step in the initial Data Cleaning, let's remove any columns that only contain 1 value (and would therefore be useless in modelling).

In [5]:
# create a list of columns with just 1 unique value
drop_cols_2 = []
for col in loans.columns:
    col_vals = loans[col].dropna() # first, remove the NAs
    col_unique_vals = col_vals.unique()
    if len(col_unique_vals) == 1: # check if there is just 1 unique value
        drop_cols_2.append(col)

loans.drop(columns=drop_cols_2, inplace=True)

# print the columns we have removed
print('Dropped columns: {}'.format(drop_cols_2))
print('\n')
# print the remaining columns 
print('Remaining columns: {}'.format(list(loans.columns)))

Dropped columns: ['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']


Remaining columns: ['loan_amnt', 'term', 'int_rate', 'installment', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'loan_status', 'purpose', 'title', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'last_credit_pull_d', 'pub_rec_bankruptcies']


---

### 3. Feature Engineering:

Now that our data has been somewhat cleaned, let's examine the columns in details and engineer them to best fit our binary classification model that we will be building & testing in the next step.

Let's start with the null values.

In [6]:
null_counts = loans.isnull().sum()
print(null_counts[null_counts > 0].sort_values(ascending=False))
print('\n')
print(loans.shape)

emp_length              1036
pub_rec_bankruptcies     697
revol_util                50
title                     11
last_credit_pull_d         2
dtype: int64


(38770, 23)


- Those bottom 3 columns have a relatively low null values count, let's simply remove those rows with null values. 

- Domain knowledge tells us that Employment Length is quite an important metric in predicting loan repayment. Even though there are 1000+ nulls (out of 38,000+ rows, let's keep this column and just remove the rows with nulls).

- Let's take a closer look at the <u>pub_rec_bankruptcies</u> data.

In [7]:
loans['pub_rec_bankruptcies'].value_counts(normalize=True, dropna=False)

0.0    0.939438
1.0    0.042456
NaN    0.017978
2.0    0.000129
Name: pub_rec_bankruptcies, dtype: float64

It looks like this is a not very useful column, with 94% of the data coming under 1  catergory. 

That being said, number of recorded bankruptcies could be very important in loans analysis.

Let's pivot that column.

In [8]:
loans['pub_rec_bankruptcies'].fillna(3.0, inplace=True)
loans[['pub_rec_bankruptcies', 'loan_status']].pivot_table(index='pub_rec_bankruptcies', values='loan_status')

Unnamed: 0_level_0,loan_status
pub_rec_bankruptcies,Unnamed: 1_level_1
0.0,0.858684
1.0,0.777035
2.0,0.6
3.0,0.830703


3.0's in the table above are the N/A's. The mean <u>loan_status</u> is similar to that of the 0 bankruptcies (only slightly lower), so let's just fill the N/A's with 0.

In [9]:
loans['pub_rec_bankruptcies'].fillna(0.0, inplace=True)
loans.dropna(axis=0, inplace=True)

---

With null values dealt with, let's now look at column data types:

In [10]:
loans.dtypes.value_counts()

float64    11
object     11
int64       1
dtype: int64

We'll need to do some manipulation on the 'object' columns:

In [11]:
objects_df = loans.select_dtypes(include=['object'])
objects_df.head()

Unnamed: 0,term,int_rate,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,revol_util,last_credit_pull_d
0,36 months,10.65%,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985,83.7%,Jun-2016
1,60 months,15.27%,< 1 year,RENT,Source Verified,car,bike,GA,Apr-1999,9.4%,Sep-2013
2,36 months,15.96%,10+ years,RENT,Not Verified,small_business,real estate business,IL,Nov-2001,98.5%,Jun-2016
3,36 months,13.49%,10+ years,RENT,Source Verified,other,personel,CA,Feb-1996,21%,Apr-2016
5,36 months,7.90%,3 years,RENT,Source Verified,wedding,My wedding loan I promise to pay back,AZ,Nov-2004,28.3%,Jan-2016


Firstly, we can simply tweak and convert these two columns <u>int_rate</u> and <u>revol_util</u> to float.

Second, <u>earliest_cr_line</u> and <u>last_credit_pull_d</u> are date columns that will require some feature engineering. For now, we will remove them.

For the remaining columns, we'll check the unique value counts before deciding how to proceed:

In [12]:
cols_checking = ['home_ownership', 'verification_status', 'emp_length', 
                 'term', 'addr_state', 'purpose', 'title']
for col in cols_checking:
    print(pd.DataFrame(objects_df[col].value_counts()))
    print('\n')

          home_ownership
RENT               18112
MORTGAGE           16686
OWN                 2778
OTHER                 96
NONE                   3


                 verification_status
Not Verified                   16281
Verified                       11856
Source Verified                 9538


           emp_length
10+ years        8545
< 1 year         4513
2 years          4303
3 years          4022
4 years          3353
5 years          3202
1 year           3176
6 years          2177
7 years          1714
8 years          1442
9 years          1228


             term
 36 months  28234
 60 months   9441


    addr_state
CA        6776
NY        3614
FL        2704
TX        2613
NJ        1776
IL        1447
PA        1442
VA        1347
GA        1323
MA        1272
OH        1149
MD        1008
AZ         807
WA         788
CO         748
NC         729
CT         711
MI         678
MO         648
MN         581
NV         466
SC         454
WI         427
OR         422
L

Alright, we should be able to encode most of these categorical columns as dummy variables, with a few modifications and exceptions:

1. <u>emp_length</u>: let's make this numerical, and also take some liberties with the '10+ years' and '< 1 year' and 'n/a' values.
2. <u>addr_state</u>: though possibly useful, there are simply too many unique values which could slow down our program.
3. <u>purpose</u> vs <u>title</u>: there appears to be some overlap here, and <u>title</u> has too many values and is pretty *messy*, so let's just keep and encode <u>purpose</u>.


In [13]:
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}

drop_columns = ['last_credit_pull_d', 'addr_state', 'title', 'earliest_cr_line']
loans.drop(columns=drop_columns, inplace=True)

for col in ['int_rate', 'revol_util']:
    loans[col] = loans[col].str.rstrip('%').astype('float')
    
loans.replace(mapping_dict, inplace=True)

---

Finally, we'll encode those columns and our features should be ready!

In [14]:
dummies_cols = ['home_ownership', 'verification_status', 'purpose', 'term']
for col in dummies_cols:
    dummy_df = pd.get_dummies(loans[[col]])
    loans = pd.concat([loans, dummy_df], axis=1)
    loans.drop(columns=col, inplace=True)
loans.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,dti,delinq_2yrs,inq_last_6mths,open_acc,...,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_renewable_energy,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,5000.0,10.65,162.87,10,24000.0,1,27.65,0.0,1.0,3.0,...,0,0,0,0,0,0,0,0,1,0
1,2500.0,15.27,59.83,0,30000.0,0,1.0,0.0,5.0,3.0,...,0,0,0,0,0,0,0,0,0,1
2,2400.0,15.96,84.33,10,12252.0,1,8.72,0.0,2.0,2.0,...,0,0,0,0,0,1,0,0,1,0
3,10000.0,13.49,339.31,10,49200.0,1,20.0,0.0,1.0,10.0,...,0,0,0,1,0,0,0,0,1,0
5,5000.0,7.9,156.46,3,36000.0,1,11.2,0.0,3.0,9.0,...,0,0,0,0,0,0,0,1,1,0


In [15]:
null_counts = loans.isnull().sum()
print(null_counts[null_counts > 0].sort_values(ascending=False))
print('\n')
print(loans.shape)

Series([], dtype: int64)


(37675, 39)


---

### 4. Training the Model:

We now have our data cleaned and our features prepared. Let's start modelling.

As a reminder, our target column that we will be predicting is <u>loan_status</u>, and we will need to be careful of bias given that 85%+ of our binary results are 1's.

Further, since we are talking about loans, if we were advising a conservative investor, they would be much more sensitive to False Positives than False Negatives.

- i.e. they would rather **miss** loans that **would** have been paid, than **approve** loans that are then **not** paid.

With these concepts in mind, let's start by building some True Negative, True Positive, False Negative and False Positive filters. We will be using these as our **error metrics**.

In [16]:
def true_false_matrix(target, predictions):
    tn_filter = (predictions == 0) & (target == 0)
    tn = len(predictions[tn_filter])

    tp_filter = (predictions == 1) & (target == 1)
    tp = len(predictions[tp_filter])

    fn_filter = (predictions == 0) & (target == 1)
    fn = len(predictions[fn_filter])

    fp_filter = (predictions == 1) & (target == 0)
    fp = len(predictions[fp_filter])
    
    mat_sum = tn + tp + fn + fp
    
    print(' True Negatives: {:,} ({}%)'.format(tn, round(100 * tn / mat_sum, 1)))
    print(' True Positives: {:,} ({}%)'.format(tp, round(100 * tp / mat_sum, 1)))
    print('False Negatives: {:,} ({}%)'.format(fn, round(100 * fn / mat_sum, 1)))
    print('False Positives: {:,} ({}%)'.format(fp, round(100 * fp / mat_sum, 1)))
    print('-----------------------------')
    print(' True Positive Rate: {}%'.format(round(100 * tp / (tp + fn), 2)))
    print('False Positive Rate: {}%'.format(round(100 * fp / (fp + tn), 2)))
    print('-----------------------------')
    print(' Accuracy: {}%'.format(round(100 * (tp + tn) / mat_sum, 2)))

Note: we want to aim for a **high True Positive Rate (TPR)** and a **low False Positive Rate (FPR)**, with an emphasis on the **FPR**!

---

Alright, time to start model building. 

Since we have binary classification problem, let's begin with a simple **Logistic Regression** model:

In [17]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=150) # was not converging with default 100

features = loans.drop(columns='loan_status')
target = pd.Series(loans['loan_status'])

lr.fit(features, target)

predictions = lr.predict(features)

print('Logistic Regression Metrics:')
true_false_matrix(target, predictions)

Logistic Regression Metrics:
 True Negatives: 23 (0.1%)
 True Positives: 32,243 (85.6%)
False Negatives: 43 (0.1%)
False Positives: 5,366 (14.2%)
-----------------------------
 True Positive Rate: 99.87%
False Positive Rate: 99.57%
-----------------------------
 Accuracy: 85.64%


Here we have a True Positives making up 85.7%. Further our TPR and crucially our FPR are almost 100%. 

A high FPR is bad, and also these are strong indicators of bias due to the imbalance between the <u>loan_status</u> results. We'll need to solve for this in our modelling.

---

Before that, let's perform **k-fold cross validation** to try to reduce the overfitting of the model.

In [18]:
from sklearn.model_selection import cross_val_predict

lr = LogisticRegression(max_iter=150)

predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)

print('Logistic Regression (k-fold Cross Validation) Metrics:')
true_false_matrix(target, predictions)

Logistic Regression (k-fold Cross Validation) Metrics:
 True Negatives: 7 (0.0%)
 True Positives: 30,464 (85.6%)
False Negatives: 48 (0.1%)
False Positives: 5,058 (14.2%)
-----------------------------
 True Positive Rate: 99.84%
False Positive Rate: 99.86%
-----------------------------
 Accuracy: 85.65%


--- 

Let's discuss how we can now solve this issue of the imbalanced results leading to model bias.

1. One option would be to take 50/50 split sample - the issue here is we would either need to remove a lot of useful data, or the training sets would be relatively small since it's such a large disparity at the moment (85/15).
2. The more straighforward option in sklearn is to use the 'class_weight' parameter, either by setting it to 'balanced', or even by further tweaking it to our needs. This will more heavily penalize the model when misclassifying the minority class.

In [19]:
lr = LogisticRegression(max_iter=180, class_weight='balanced')

predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)

print('Logistic Regression (k-fold Cross Validation, Balanced Class) Metrics:')
true_false_matrix(target, predictions)

Logistic Regression (k-fold Cross Validation, Balanced Class) Metrics:
 True Negatives: 2,565 (7.2%)
 True Positives: 15,463 (43.5%)
False Negatives: 15,049 (42.3%)
False Positives: 2,500 (7.0%)
-----------------------------
 True Positive Rate: 50.68%
False Positive Rate: 49.36%
-----------------------------
 Accuracy: 50.67%


So the good news here is that massively reduced the number of False Positives as well as the FPR. 

The bad news is that the True Positives and TPR were also reduced, however we are less concerned by that.

---

Let's try another model with an even harsher penalty on the class_weight parameter:

In [20]:
penalty = {0: 10,
           1: 1}

lr = LogisticRegression(max_iter=180, class_weight=penalty)

predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)

print('Logistic Regression (k-fold Cross Validation, Harsh Penalty) Metrics:')
true_false_matrix(target, predictions)

Logistic Regression (k-fold Cross Validation, Harsh Penalty) Metrics:
 True Negatives: 4,216 (11.9%)
 True Positives: 5,206 (14.6%)
False Negatives: 25,306 (71.1%)
False Positives: 849 (2.4%)
-----------------------------
 True Positive Rate: 17.06%
False Positive Rate: 16.76%
-----------------------------
 Accuracy: 26.48%


Once again we have massively reduced our FPR, even at the expense of reducing the TPR. For the conservative investor, this is a good result. 

Let's now move onto a different model and see if we can achieve any better results.

---

We'll now build a **Random Forest Classifier** model.

In [21]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=1, class_weight='balanced')

predictions = cross_val_predict(rfc, features, target, cv=3)
predictions = pd.Series(predictions)

print('Random Forest Classifier (k-fold Cross Validation, Balanced Class) Metrics:')
true_false_matrix(target, predictions)

Random Forest Classifier (k-fold Cross Validation, Balanced Class) Metrics:
 True Negatives: 17 (0.0%)
 True Positives: 30,394 (85.4%)
False Negatives: 118 (0.3%)
False Positives: 5,048 (14.2%)
-----------------------------
 True Positive Rate: 99.61%
False Positive Rate: 99.66%
-----------------------------
 Accuracy: 85.48%


Hmm.. this result is akin to our Logistic Regression Model with zero class balancing penalties, i.e. not great.

---

Let's try Random Forest again this time with the harsher penalties applied.

In [22]:
rfc = RandomForestClassifier(random_state=1, class_weight=penalty)

predictions = cross_val_predict(rfc, features, target, cv=3)
predictions = pd.Series(predictions)

print('Random Forest Classifier (k-fold Cross Validation, Harsh Penalty) Metrics:')
true_false_matrix(target, predictions)

Random Forest Classifier (k-fold Cross Validation, Harsh Penalty) Metrics:
 True Negatives: 10 (0.0%)
 True Positives: 30,416 (85.5%)
False Negatives: 96 (0.3%)
False Positives: 5,055 (14.2%)
-----------------------------
 True Positive Rate: 99.69%
False Positive Rate: 99.8%
-----------------------------
 Accuracy: 85.52%


Even with a harsh penalty applied, it looks as though the Random Forest Classifier is simply not suited to solve this problem.

---

Let's try out a few other models that are suitable for binary classification models. We'll add a few intuitive parameters based on what we've learned so far.

In [23]:
models = {}

penalty = {0: 12,
           1: 1}

# Logistic Regression
from sklearn.linear_model import LogisticRegression
models['Logistic Regression'] = LogisticRegression(max_iter=170, class_weight=penalty)

# Random Forest
from sklearn.ensemble import RandomForestClassifier
models['Random Forest'] = RandomForestClassifier(random_state=1, class_weight=penalty)

# Decision Trees
from sklearn.tree import DecisionTreeClassifier
models['Decision Trees'] = DecisionTreeClassifier(random_state=1, class_weight=penalty, max_depth=5)

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
models['Naive Bayes'] = GaussianNB()

# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
models['K-Nearest Neighbor'] = KNeighborsClassifier(n_neighbors=5, weights='distance')

In [24]:
# modified for multi-model analysis
def true_false_matrix_2(target, predictions):
    tn_filter = (predictions == 0) & (target == 0)
    tn = len(predictions[tn_filter])

    tp_filter = (predictions == 1) & (target == 1)
    tp = len(predictions[tp_filter])

    fn_filter = (predictions == 0) & (target == 1)
    fn = len(predictions[fn_filter])

    fp_filter = (predictions == 1) & (target == 0)
    fp = len(predictions[fp_filter])
    
    mat_sum = tn + tp + fn + fp
    
    return round(100 * tp / (tp + fn), 2) , round(100 * fp / (fp + tn), 2) , round(100 * (tp + tn) / mat_sum, 2)

In [25]:
# build a table to more easily compare all models
TPR, FPR, Accuracy = {}, {}, {}

for key in models.keys():
    predictions = pd.Series(cross_val_predict(models[key], features, target, cv=3))
    TPR[key], FPR[key], Accuracy[key] = true_false_matrix_2(target, predictions)
    
df_model = pd.DataFrame(index=models.keys(), columns=['TPR (%)', 'FPR (%)', 'Accuracy (%)'])
df_model['TPR (%)'] = TPR.values()
df_model['FPR (%)'] = FPR.values()
df_model['Accuracy (%)'] = Accuracy.values()
    
df_model

Unnamed: 0,TPR (%),FPR (%),Accuracy (%)
Logistic Regression,8.09,7.36,20.12
Random Forest,99.64,99.78,85.49
Decision Trees,32.13,30.58,37.43
Naive Bayes,96.68,96.56,83.41
K-Nearest Neighbor,94.8,95.34,81.97


---

### Conclusion

At this point, I'm at a bit of a loss with how to proceed with building a reliable model to predict the <u>loan_status</u>.

I think the best next steps would be to revisit **Step 3. Feature Engineering** above and rethink which columns are or aren't useful. 

Besides that, another thing to try could be in **Step 4. Training the Model**, select a training set whereby the binary result is 50/50 (even if this is a much smaller dataset. This would negate any class weight bias without the need to assign those model parameters.