# Modeling a Borrower's Credit Risk
----
The focus of this project is on credit modelling, a well known data science problem that focuses on modeling a borrower's credit risk. 
Credit has played a key role in the economy for centuries and some form of credit has existed since the beginning of commerce. Data is about lending data from Lending Club. 
Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return.
The aim of the project is to answer the following question:

- Can we build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not?


In [1]:
import pandas as pd
import numpy as np

loans_2007 = pd.read_csv('loans_2007.csv')
print(loans_2007.iloc[0])
print(loans_2007.shape[1])


id                                1077501
member_id                      1.2966e+06
loan_amnt                            5000
funded_amnt                          5000
funded_amnt_inv                      4975
term                            36 months
int_rate                           10.65%
installment                        162.87
grade                                   B
sub_grade                              B2
emp_title                             NaN
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
issue_d                          Dec-2011
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
zip_code                            860xx
addr_state                             AZ
dti                                 27.65
delinq_2yrs                       

  interactivity=interactivity, compiler=compiler, result=result)


# Data Cleaning
----
The Dataframe contains many columns and can be cumbersome to try to explore all at once. 
Let's break up the columns into 3 groups of 18 columns and use the data dictionary to become familiar with what each column represents. As you understand each feature, you want to pay attention to any features that:

- Leak information from the future (after the loan has already been funded)
- Don't affect a borrower's ability to pay back a loan (e.g. a randomly generated ID value by Lending Club)
- Formatted poorly and need to be cleaned up
- Require more data or a lot of processing to turn into a useful feature
- Contain redundant information

After analyzing first 18 columns, we can conclude that the following features need to be removed:

- id: randomly generated field by Lending Club for unique identification purposes only
- member_id: also a randomly generated field by Lending Club for unique identification purposes only
- funded_amnt: leaks data from the future (after the loan is already started to be funded)
- funded_amnt_inv: also leaks data from the future (after the loan is already started to be funded)
- grade: contains redundant information as the interest rate column (int_rate)
- sub_grade: also contains redundant information as the interest rate column (int_rate)
- emp_title: requires other data and a lot of processing to potentially be useful
- issue_d: leaks data from the future (after the loan is already completed funded)

In [2]:
# From First 18 columns
loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)

From the second group of columns, remove following columns:

- zip_code: redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible (which only can be used to identify the state the borrower lives in)
- out_prncp: leaks data from the future, (after the loan already started to be paid off)
- out_prncp_inv: also leaks data from the future, (after the loan already started to be paid off)
- total_pymnt: also leaks data from the future, (after the loan already started to be paid off)
- total_pymnt_inv: also leaks data from the future, (after the loan already started to be paid off)
- total_rec_prncp: also leaks data from the future, (after the loan already started to be paid off)

In [3]:
#from Second 18 columns
loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)

In the last group of columns, we need to drop the following columns:

- total_rec_int: leaks data from the future, (after the loan already started to be paid off),
- total_rec_late_fee: also leaks data from the future, (after the loan already started to be paid off),
- recoveries: also leaks data from the future, (after the loan already started to be paid off),
- collection_recovery_fee: also leaks data from the future, (after the loan already started to be paid off),
- last_pymnt_d: also leaks data from the future, (after the loan already started to be paid off),
- last_pymnt_amnt: also leaks data from the future, (after the loan already started to be paid off).

In [4]:
# from Last group
loans_2007 = loans_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
print(loans_2007.iloc[0]) # first row of loans_2007
print(loans_2007.shape[1]) # number of columns

loan_amnt                            5000
term                            36 months
int_rate                           10.65%
installment                        162.87
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                 Jan-1985
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                          83.7%
total_acc                               9
initial_list_status                     f
last_credit_pull_d               J

## Identifing Target Column
----
We should use the loan_status column
- it's the only column that directly describes if a loan was paid off on time, had delayed payments, or was defaulted on the borrower. 
- This column contains text values and we need to convert it to a numerical one for training a model.

In [18]:
loans = loans_2007.copy()
# to return the frequency of the unique values in the loan_status column
loans_2007['loan_status'].value_counts()

1    32286
0     5389
Name: loan_status, dtype: int64


| Loan Status                                       | Meaning |
----------------------------------------------------|------------------------------|
|Fully Paid                                         | Loan has been fully paid off.|
|Charged Off                                        | Loan for which there is no longer a reasonable expectation of further payments.|
|Does not meet the credit policy. Status:Fully Paid	| While the loan was paid off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.|
|Does not meet the credit policy. Status:Charged Off| While the loan was charged off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.|
|In Grace Period	                                |The loan is past due but still in the grace period of 15 days.|
|Late (16-30 days)	                                | Loan hasn't been paid in 16 to 30 days (late on the current payment).|
|Late (31-120 days)	                                | Loan hasn't been paid in 31 to 120 days (late on the current payment).|
|Current                                            |	Loan is up to date on current payments.|
|Default                                            |	Loan is defaulted on and no payment has been made for more than 121 days.|

From the investor's perspective, we're interested in trying to predict which **loans will be paid off on time** and which ones won't be. 

Only the **Fully Paid** and **Charged Off** values describe the final outcome of the loan.
Since we're interested in being able to predict which of these 2 values a loan will fall under, we can treat the problem as a **binary classification** one. 

Let's remove all the loans that don't contain either **Fully Paid** and **Charged Off** as the loan's status and then transform the **Fully Paid** values to **1** for the positive case and the **Charged Off** values to **0** for the negative case

In [6]:
loans_2007 = loans_2007[(loans_2007['loan_status']=='Fully Paid')| (loans_2007['loan_status']=='Charged Off')]


status_replace = {
    "loan_status" : {
        "Fully Paid": 1,
        "Charged Off": 0,
    }
}

loans_2007 = loans_2007.replace(status_replace)
loans_2007.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,pymnt_plan,...,initial_list_status,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,1,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,2500.0,60 months,15.27%,59.83,< 1 year,RENT,30000.0,Source Verified,0,n,...,f,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,2400.0,36 months,15.96%,84.33,10+ years,RENT,12252.0,Not Verified,1,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,10000.0,36 months,13.49%,339.31,10+ years,RENT,49200.0,Source Verified,1,n,...,f,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
5,5000.0,36 months,7.90%,156.46,3 years,RENT,36000.0,Source Verified,1,n,...,f,Jan-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [7]:
# Remove single value columns
drop_columns = []

for col in loans_2007.columns:
    unique_non_null = loans_2007[col].dropna().unique()
    
    if len(unique_non_null) <= 1:
        drop_columns.append(col)

loans_2007 = loans_2007.drop(drop_columns, axis =1)
loans_2007

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,purpose,...,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,last_credit_pull_d,pub_rec_bankruptcies
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,1,credit_card,...,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,Jun-2016,0.0
1,2500.0,60 months,15.27%,59.83,< 1 year,RENT,30000.0,Source Verified,0,car,...,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,Sep-2013,0.0
2,2400.0,36 months,15.96%,84.33,10+ years,RENT,12252.0,Not Verified,1,small_business,...,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,Jun-2016,0.0
3,10000.0,36 months,13.49%,339.31,10+ years,RENT,49200.0,Source Verified,1,other,...,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,Apr-2016,0.0
5,5000.0,36 months,7.90%,156.46,3 years,RENT,36000.0,Source Verified,1,wedding,...,0.0,Nov-2004,3.0,9.0,0.0,7963.0,28.3%,12.0,Jan-2016,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39781,2500.0,36 months,8.07%,78.42,4 years,MORTGAGE,110000.0,Not Verified,1,home_improvement,...,0.0,Nov-1990,0.0,13.0,0.0,7274.0,13.1%,40.0,Jun-2010,
39782,8500.0,36 months,10.28%,275.38,3 years,RENT,18000.0,Not Verified,1,credit_card,...,1.0,Dec-1986,1.0,6.0,0.0,8847.0,26.9%,9.0,Jul-2010,
39783,5000.0,36 months,8.07%,156.84,< 1 year,MORTGAGE,100000.0,Not Verified,1,debt_consolidation,...,0.0,Oct-1998,0.0,11.0,0.0,9698.0,19.4%,20.0,Jun-2007,
39784,5000.0,36 months,7.43%,155.38,< 1 year,MORTGAGE,200000.0,Not Verified,1,other,...,0.0,Nov-1988,0.0,17.0,0.0,85607.0,0.7%,26.0,Jun-2007,


In [8]:
# Null value columns
null_counts = loans_2007.isnull().sum()
print(null_counts)

loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1036
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64


Here we can see that, most of the columns have no missing values, two columns have fifty or less rows with missing values, and two columns, **emp_length and pub_rec_bankruptcies**, contain a relatively high amount of missing values.

Domain knowledge tells us that **employment length is frequently used in assessing how risky a potential borrower is**, so we'll keep this column despite its relatively large amount of missing values.

In [9]:
# to remove the pub_rec_bankruptcies column from loans
loans_2007 = loans_2007.drop('pub_rec_bankruptcies',axis=1)

# to remove all rows from loans containing any missing values
loans_2007 = loans_2007.dropna(axis = 0)

#  to return the counts for each column data type
loans_2007.dtypes.value_counts()

object     11
float64    10
int64       1
dtype: int64

In [10]:
#to select only the columns of object type from loans
object_columns_df = loans_2007.select_dtypes(include =['object'])
object_columns_df.head(1)

Unnamed: 0,term,int_rate,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,revol_util,last_credit_pull_d
0,36 months,10.65%,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985,83.7%,Jun-2016


In [11]:
object_columns_df.iloc[0]

term                     36 months
int_rate                    10.65%
emp_length               10+ years
home_ownership                RENT
verification_status       Verified
purpose                credit_card
title                     Computer
addr_state                      AZ
earliest_cr_line          Jan-1985
revol_util                   83.7%
last_credit_pull_d        Jun-2016
Name: 0, dtype: object

Some of the columns seem like they represent categorical values, but we should confirm by checking the number of unique values in those columns:

- home_ownership: home ownership status, can only be 1 of 4 categorical values according to the data dictionary,
- verification_status: indicates if income was verified by Lending Club,
- emp_length: number of years the borrower was employed upon time of application,
- term: number of payments on the loan, either 36 or 60,
- addr_state: borrower's state of residence,
- purpose: a category provided by the borrower for the loan request,
- title: loan title provided the borrower

There are also some columns that represent numeric values, that need to be converted:

- int_rate: interest rate of the loan in %,
- revol_util: revolving line utilization rate or the amount of credit the borrower is using relative to all available credit

Based on the first row's values for **purpose** and **title**, it seems like these columns could reflect the same information. 

In [12]:
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']

for c in cols:
    print(loans_2007[c].value_counts())

RENT        18112
MORTGAGE    16686
OWN          2778
OTHER          96
NONE            3
Name: home_ownership, dtype: int64
Not Verified       16281
Verified           11856
Source Verified     9538
Name: verification_status, dtype: int64
10+ years    8545
< 1 year     4513
2 years      4303
3 years      4022
4 years      3353
5 years      3202
1 year       3176
6 years      2177
7 years      1714
8 years      1442
9 years      1228
Name: emp_length, dtype: int64
 36 months    28234
 60 months     9441
Name: term, dtype: int64
CA    6776
NY    3614
FL    2704
TX    2613
NJ    1776
IL    1447
PA    1442
VA    1347
GA    1323
MA    1272
OH    1149
MD    1008
AZ     807
WA     788
CO     748
NC     729
CT     711
MI     678
MO     648
MN     581
NV     466
SC     454
WI     427
OR     422
LA     420
AL     420
KY     311
OK     285
KS     249
UT     249
AR     229
DC     209
RI     194
NM     180
WV     164
HI     162
NH     157
DE     110
MT      77
WY      76
AK      76
SD      60
VT  

In [13]:
for c in ['title' , 'purpose']:
    print(loans_2007[c].value_counts())

Debt Consolidation                        2068
Debt Consolidation Loan                   1599
Personal Loan                              624
Consolidation                              488
debt consolidation                         466
                                          ... 
Wipe Out Debts                               1
RV Trailer                                   1
Trying to get debt free                      1
Paying of medical and small card bills       1
JS Loan                                      1
Name: title, Length: 18881, dtype: int64
debt_consolidation    17751
credit_card            4911
other                  3711
home_improvement       2808
major_purchase         2083
small_business         1719
car                    1459
wedding                 916
medical                 655
moving                  552
house                   356
vacation                348
educational             312
renewable_energy         94
Name: purpose, dtype: int64


- The home_ownership, verification_status, emp_length, and term columns each contain a few discrete categorical values. We should encode these columns as dummy variables and keep them.

- Use mapping to clean the emp_length column:

    - "10+ years": 10
    - "9 years": 9
    - "8 years": 8
    - "7 years": 7
    - "6 years": 6
    - "5 years": 5
    - "4 years": 4
    - "3 years": 3
    - "2 years": 2
    - "1 year": 1
    - "< 1 year": 0
    - "n/a": 0

- The addr_state column contains many discrete values and we'd need to add 49 dummy variable columns to use it for classification. This would make our Dataframe much larger and could slow down how quickly the code runs. Let's remove this column from consideration.

In [14]:
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}
# Remove the last_credit_pull_d, addr_state, title, and earliest_cr_line columns from loans
loans_2007 = loans_2007.drop(['last_credit_pull_d', 'addr_state', 'title', 'earliest_cr_line'], axis=1)

#rstrip string method to strip the right trailing percent sign (%) and convert to float
loans_2007['int_rate']= loans_2007['int_rate'].str.rstrip('%').astype('float')
loans_2007['revol_util']= loans_2007['revol_util'].str.rstrip('%').astype('float')

# mapping/replace to clean emp_length column
loans_2007 = loans_2007.replace(mapping_dict)

In [15]:
loans_2007.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,purpose,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc
0,5000.0,36 months,10.65,162.87,10,RENT,24000.0,Verified,1,credit_card,27.65,0.0,1.0,3.0,0.0,13648.0,83.7,9.0
1,2500.0,60 months,15.27,59.83,0,RENT,30000.0,Source Verified,0,car,1.0,0.0,5.0,3.0,0.0,1687.0,9.4,4.0
2,2400.0,36 months,15.96,84.33,10,RENT,12252.0,Not Verified,1,small_business,8.72,0.0,2.0,2.0,0.0,2956.0,98.5,10.0
3,10000.0,36 months,13.49,339.31,10,RENT,49200.0,Source Verified,1,other,20.0,0.0,1.0,10.0,0.0,5598.0,21.0,37.0
5,5000.0,36 months,7.9,156.46,3,RENT,36000.0,Source Verified,1,wedding,11.2,0.0,3.0,9.0,0.0,7963.0,28.3,12.0


Now encode the **home_ownership, verification_status, purpose,** and **term** columns as dummy variables so we can use them in our model.

In [16]:
# encode the columns as integer value
col =["home_ownership", "verification_status", "purpose", "term"]

# to return a Dataframe containing the dummy columns
dummy_df = pd.get_dummies(loans_2007[col])

# to add these dummy columns back to loans
loans_2007 = pd.concat([loans_2007, dummy_df], axis=1)

#Remove orih=ginal columns
loans_2007 = loans_2007.drop(col, axis=1)
loans_2007.head()


TypeError: 'tuple' object is not callable

# Picking Error Metric
------
An error metric will help us figure out when our model is performing well, and when it's performing poorly.
- Our objective in this is to make money -- we want to fund enough loans that are paid off on time to offset our losses from loans that aren't paid off. 
- An error metric will help us determine if our algorithm will make us money or lose us money.

In this case, we're primarily concerned with false positives and false negatives. Both of these are different types of misclassifications. 
 - With a false positive, we predict that a loan will be paid off on time, but it actually isn't. This costs us money, since we fund loans that lose us money. 
 - With a false negative, we predict that a loan won't be paid off on time, but it actually would be paid off on time. This loses us potential money, since we didn't fund a loan that actually would have been paid off.

|loan_status actual|prediction|error type|
|-----|----|----|
|0|1|False Positive|
|1|1|True positive|
|0|0|True negative|
|1|0|False negative|

# Logistic Regression Model

In [19]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
cols = loans_2007.columns

train_cols = cols.drop("loan_status")

features = loans_2007[train_cols]

target = loans_2007["loan_status"]

lr.fit(features, target)
predictions = lr.predict(features)



In [20]:
from sklearn.model_selection import cross_val_predict

# Generate cross validated predictions for features
# Make predictions using 3-fold cross-validation.
predictions = cross_val_predict(lr, features, target, cv=3)



In [21]:
predictions

array([1, 1, 1, ..., 1, 1, 1])

In [22]:
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp  / (tp + fn)
fpr = fp  / (fp + tn)
print(tpr)
print(fpr)

0.9987920460880877
0.9962887363147152


As you can see from the last screen, our fpr and tpr are around what we'd expect if the model was predicting all ones.

Unfortunately, even through we're not using accuracy as an error metric, the classifier is, and it isn't accounting for the imbalance in the classes. There are a few ways to get a classifier to correct for imbalanced classes. The two main ways are:

1. Use oversampling and undersampling to ensure that the classifier gets input that has a balanced number of each class.
    - They involve taking a sample that contains equal numbers of rows where loan_status is 0, and where loan_status is 1. This way, the classifier is forced to make actual predictions, since predicting all 1s or all 0s will only result in 50% accuracy at most.
    
2. Tell the classifier to penalize misclassifications of the less prevalent class more than the other class.
    - We can do this by setting the class_weight parameter to balanced when creating the LogisticRegression instance. This tells scikit-learn to penalize the misclassification of the minority class during the training process. The penalty means that the logistic regression classifier pays more attention to correctly classifying rows where loan_status is 0. This lowers accuracy when loan_status is 1, but raises accuracy when loan_status is 0.

The second method is actually much easier to implement using scikit-learn.

In [24]:
# set class_weight='balanced'
lr = LogisticRegression(class_weight='balanced')

predictions = cross_val_predict(lr,features,target,cv=3)

predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans_2007["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans_2007["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans_2007["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans_2007["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)



0.6263109596224437
0.611253701875617


In [26]:
penalty = {
    0: 10,
    1: 1
}

lr = LogisticRegression(class_weight=penalty)
predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans_2007["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans_2007["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans_2007["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans_2007["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)



0.22787755637126378
0.2252714708785785


In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict

rf = RandomForestClassifier(class_weight="balanced", random_state=1)
predictions = cross_val_predict(rf, features, target, cv=3)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans_2007["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans_2007["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans_2007["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans_2007["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)



0.9630637126376508
0.9636722606120435
